Age | Commit message (Collapse) | Author |
|
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
"Core:
- Introduction of iommu-pages infrastructure to consolitate
page-table allocation code among hardware drivers. This is
ground-work for more generalization in the future
- Remove IOMMU_DEV_FEAT_SVA and IOMMU_DEV_FEAT_IOPF feature flags
- Convert virtio-iommu to domain_alloc_paging()
- KConfig cleanups
- Some small fixes for possible overflows and race conditions
Intel VT-d driver:
- Restore WO permissions on second-level paging entries
- Use ida to manage domain id
- Miscellaneous cleanups
AMD-Vi:
- Make sure notifiers finish running before module unload
- Add support for HTRangeIgnore feature
- Allow matching ACPI HID devices without matching UIDs
ARM-SMMU:
- SMMUv2:
- Recognise the compatible string for SAR2130P MDSS in the
Qualcomm driver, as this device requires an identity domain
- Fix Adreno stall handling so that GPU debugging is more robust
and doesn't e.g. result in deadlock
- SMMUv3:
- Fix ->attach_dev() error reporting for unrecognised domains
- IO-pgtable:
- Allow clients (notably, drivers that process requests from
userspace) to silence warnings when mapping an already-mapped
IOVA
S390:
- Add support for additional table regions
Mediatek:
- Add support for MT6893 MM IOMMU
And some smaller fixes and improvements in various other drivers"
* tag 'iommu-updates-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (75 commits)
iommu/vt-d: Restore context entry setup order for aliased devices
iommu/mediatek: Fix compatible typo for mediatek,mt6893-iommu-mm
iommu/arm-smmu-qcom: Make set_stall work when the device is on
iommu/arm-smmu: Move handing of RESUME to the context fault handler
iommu/arm-smmu-qcom: Enable threaded IRQ for Adreno SMMUv2/MMU500
iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()
iommu: Clear the freelist after iommu_put_pages_list()
iommu/vt-d: Change dmar_ats_supported() to return boolean
iommu/vt-d: Eliminate pci_physfn() in dmar_find_matched_satc_unit()
iommu/vt-d: Replace spin_lock with mutex to protect domain ida
iommu/vt-d: Use ida to manage domain id
iommu/vt-d: Restore WO permissions on second-level paging entries
iommu/amd: Allow matching ACPI HID devices without matching UIDs
iommu: make inclusion of arm/arm-smmu-v3 directory conditional
iommu: make inclusion of riscv directory conditional
iommu: make inclusion of amd directory conditional
iommu: make inclusion of intel directory conditional
iommu: remove duplicate selection of DMAR_TABLE
iommu/fsl_pamu: remove trailing space after \n
iommu/arm-smmu-qcom: Add SAR2130P MDSS compatible
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping updates from Marek Szyprowski:
"New two step DMA mapping API, which is is a first step to a long path
to provide alternatives to scatterlist and to remove hacks, abuses and
design mistakes related to scatterlists.
This new approach optimizes some calls to DMA-IOMMU layer and cache
maintenance by batching them, reduces memory usage as it is no need to
store mapped DMA addresses to unmap them, and reduces some function
call overhead. It is a combination effort of many people, lead and
developed by Christoph Hellwig and Leon Romanovsky"
* tag 'dma-mapping-6.16-2025-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
docs: core-api: document the IOVA-based API
dma-mapping: add a dma_need_unmap helper
dma-mapping: Implement link/unlink ranges API
iommu/dma: Factor out a iommu_dma_map_swiotlb helper
dma-mapping: Provide an interface to allow allocate IOVA
iommu: add kernel-doc for iommu_unmap_fast
iommu: generalize the batched sync after map interface
dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
PCI/P2PDMA: Refactor the p2pdma mapping helpers
|
|
The commit below reworked how iommu_put_pages_list() worked to not do
list_del() on every entry. This was done expecting all the callers to
already re-init the list so doing a per-item deletion is not
efficient.
It was missed that fq_ring_free_locked() re-uses its list after calling
iommu_put_pages_list() and so the leftover list reaches free'd struct
pages and will crash or WARN/BUG/etc.
Reinit the list to empty in fq_ring_free_locked() after calling
iommu_put_pages_list().
Audit to see if any other callers of iommu_put_pages_list() need the list
to be empty:
- iommu_dma_free_fq_single() and iommu_dma_free_fq_percpu() immediately
frees the memory
- iommu_v1_map_pages(), v1_free_pgtable(), domain_exit(),
riscv_iommu_map_pages() uses a stack variable which goes out of scope
- intel_iommu_tlb_sync() uses a gather in a iotlb_sync() callback, the
caller re-inits the gather
Fixes: 13f43d7cf3e0 ("iommu/pages: Formalize the freelist API")
Reported-by: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
Closes: https://lore.kernel.org/r/SJ1PR11MB61292CE72D7BE06B8810021CB997A@SJ1PR11MB6129.namprd11.prod.outlook.com
Tested-by: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-7d4dfa6140f7+11f04-iommu_freelist_init_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Introduce new DMA APIs to perform DMA linkage of buffers
in layers higher than DMA.
In proposed API, the callers will perform the following steps.
In map path:
if (dma_can_use_iova(...))
dma_iova_alloc()
for (page in range)
dma_iova_link_next(...)
dma_iova_sync(...)
else
/* Fallback to legacy map pages */
for (all pages)
dma_map_page(...)
In unmap path:
if (dma_can_use_iova(...))
dma_iova_destroy()
else
for (all pages)
dma_unmap_page(...)
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
Split the iommu logic from iommu_dma_map_page into a separate helper.
This not only keeps the code neatly separated, but will also allow for
reuse in another caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
The existing .map_pages() callback provides both allocating of IOVA
and linking DMA pages. That combination works great for most of the
callers who use it in control paths, but is less effective in fast
paths where there may be multiple calls to map_page().
These advanced callers already manage their data in some sort of
database and can perform IOVA allocation in advance, leaving range
linkage operation to be in fast path.
Provide an interface to allocate/deallocate IOVA and next patch
link/unlink DMA ranges to that specific IOVA.
In the new API a DMA mapping transaction is identified by a
struct dma_iova_state, which holds some recomputed information
for the transaction which does not change for each page being
mapped, so add a check if IOVA can be used for the specific
transaction.
The API is exported from dma-iommu as it is the only implementation
supported, the namespace is clearly different from iommu_* functions
which are not allowed to be used. This code layout allows us to save
function call per API call used in datapath as well as a lot of boilerplate
code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
To support the upcoming non-scatterlist mapping helpers, we need to go
back to have them called outside of the DMA API. Thus move them out of
dma-map-ops.h, which is only for DMA API implementations to pci-p2pdma.h,
which is for driver use.
Note that the core helper is still not exported as the mapping is
expected to be done only by very highlevel subsystem code at least for
now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
The current scheme with a single helper to determine the P2P status
and map a scatterlist segment force users to always use the map_sg
helper to DMA map, which we're trying to get away from because they
are very cache inefficient.
Refactor the code so that there is a single helper that checks the P2P
state for a page, including the result that it is not a P2P page to
simplify the callers, and a second one to perform the address translation
for a bus mapped P2P transfer that does not depend on the scatterlist
structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
This converts the remaining places using list of pages to the new API.
The Intel free path was shared with its gather path, so it is converted at
the same time.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/11-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
cocci warnings:
drivers/iommu/dma-iommu.c:1788:2-3: Unneeded semicolon
so remove unneeded semicolon to fix cocci warnings.
Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/tencent_73EEE47E6ECCF538229C9B9E6A0272DA2B05@qq.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are
clearly linked to a device.
This is linked to the VIOMMU object and is intended to be used by a
VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered
through a simple FD returning event records on read().
- PASID support in VFIO.
The "Process Address Space ID" is a PCI feature that allows the
device to tag all PCI DMA operations with an ID. The IOMMU will
then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the
hypervisor to manage each PASID entry.
The support is generic so any VFIO user could attach any
translation to a PASID, and the support should work on ARM SMMUv3
as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate
what owns the various opaque owner fields"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (49 commits)
iommufd: Test attach before detaching pasid
iommufd: Fix iommu_vevent_header tables markup
iommu: Convert unreachable() to BUG()
iommufd: Balance veventq->num_events inc/dec
iommufd: Initialize the flags of vevent in iommufd_viommu_report_event()
iommufd/selftest: Add coverage for reporting max_pasid_log2 via IOMMU_HW_INFO
iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability
vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasid
vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
ida: Add ida_find_first_range()
iommufd/selftest: Add coverage for iommufd pasid attach/detach
iommufd/selftest: Add test ops to test pasid attach/detach
iommufd/selftest: Add a helper to get test device
iommufd/selftest: Add set_dev_pasid in mock iommu
iommufd: Allow allocating PASID-compatible domain
iommu/vt-d: Add IOMMU_HWPT_ALLOC_PASID support
iommufd: Enforce PASID-compatible domain for RID
iommufd: Support pasid attach/replace
iommufd: Enforce PASID-compatible domain in PASID path
iommufd/device: Add pasid_attach array to track per-PASID attach
...
|
|
Bare unreachable() should be avoided as it generates undefined behavior,
e.g. falling through to the next function. Use BUG() instead so the
error is defined.
Fixes the following warnings:
drivers/iommu/dma-iommu.o: warning: objtool: iommu_dma_sw_msi+0x92: can't find jump dest instruction at .text+0x54d5
vmlinux.o: warning: objtool: iommu_dma_get_msi_page() falls through to next function __iommu_dma_unmap()
Link: https://patch.msgid.link/r/0c801ae017ec078cacd39f8f0898fc7780535f85.1743053325.git.jpoimboe@kernel.org
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Closes: https://lore.kernel.org/314f8809-cd59-479b-97d7-49356bf1c8d1@infradead.org
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Closes: https://lore.kernel.org/5dd1f35e-8ece-43b7-ad6d-86d02d2718f6@paulmck-laptop
Fixes: 6aa63a4ec947 ("iommu: Sort out domain user data")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
There are only two sw_msi implementations in the entire system, thus it's
not very necessary to have an sw_msi pointer.
Instead, check domain->cookie_type to call the two sw_msi implementations
directly from the core code.
Link: https://patch.msgid.link/r/7ded87c871afcbaac665b71354de0a335087bf0f.1742871535.git.nicolinc@nvidia.com
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
When DMA/MSI cookies were made first-class citizens back in commit
46983fcd67ac ("iommu: Pull IOVA cookie management into the core"), there
was no real need to further expose the two different cookie types.
However, now that IOMMUFD wants to add a third type of MSI-mapping
cookie, we do have a nicely compelling reason to properly dismabiguate
things at the domain level beyond just vaguely guessing from the domain
type.
Meanwhile, we also effectively have another "cookie" in the form of the
anonymous union for other user data, which isn't much better in terms of
being vague and unenforced. The fact is that all these cookie types are
mutually exclusive, in the sense that combining them makes zero sense
and/or would be catastrophic (iommu_set_fault_handler() on an SVA
domain, anyone?) - the only combination which *might* be reasonable is
perhaps a fault handler and an MSI cookie, but nobody's doing that at
the moment, so let's rule it out as well for the sake of being clear and
robust. To that end, we pull DMA and MSI cookies apart a little more,
mostly to clear up the ambiguity at domain teardown, then for clarity
(and to save a little space), move them into the union, whose ownership
we can then properly describe and enforce entirely unambiguously.
[nicolinc: rebase on latest tree; use prefix IOMMU_COOKIE_; merge unions
in iommu_domain; add IOMMU_COOKIE_IOMMUFD for iommufd_hwpt]
Link: https://patch.msgid.link/r/1ace9076c95204bbe193ee77499d395f15f44b23.1742871535.git.nicolinc@nvidia.com
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
This reverts commit ac9a5d522bb80be50ea84965699e1c8257d745ce.
iommu_dma_init_domain() is now only called under the group mutex, as it
should be given that that the default domain belongs to the group, so
the additional internal locking is no longer needed.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/a943d4c198e6a1fffe998337d577dc3aa7f660a9.1740585469.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Currently, IRQ_MSI_IOMMU is selected if DMA_IOMMU is available to provide
an implementation for iommu_dma_prepare/compose_msi_msg(). However, it'll
make more sense for irqchips that call prepare/compose to select it, and
that will trigger all the additional code and data to be compiled into
the kernel.
If IRQ_MSI_IOMMU is selected with no IOMMU side implementation, then the
prepare/compose() will be NOP stubs.
If IRQ_MSI_IOMMU is not selected by an irqchip, then the related code on
the iommu side is compiled out.
Link: https://patch.msgid.link/r/a2620f67002c5cdf974e89ca3bf905f5c0817be6.1740014950.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
SW_MSI supports IOMMU to translate an MSI message before the MSI message
is delivered to the interrupt controller. On such systems, an iommu_domain
must have a translation for the MSI message for interrupts to work.
The IRQ subsystem will call into IOMMU to request that a physical page be
set up to receive MSI messages, and the IOMMU then sets an IOVA that maps
to that physical page. Ultimately the IOVA is programmed into the device
via the msi_msg.
Generalize this by allowing iommu_domain owners to provide implementations
of this mapping. Add a function pointer in struct iommu_domain to allow a
domain owner to provide its own implementation.
Have dma-iommu supply its implementation for IOMMU_DOMAIN_DMA types during
the iommu_get_dma_cookie() path. For IOMMU_DOMAIN_UNMANAGED types used by
VFIO (and iommufd for now), have the same iommu_dma_sw_msi set as well in
the iommu_get_msi_cookie() path.
Hold the group mutex while in iommu_dma_prepare_msi() to ensure the domain
doesn't change or become freed while running. Races with IRQ operations
from VFIO and domain changes from iommufd are possible here.
Replace the msi_prepare_lock with a lockdep assertion for the group mutex
as documentation. For the dmau_iommu.c each iommu_domain is unique to a
group.
Link: https://patch.msgid.link/r/4ca696150d2baee03af27c4ddefdb7b0b0280e7b.1740014950.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The two-step process to translate the MSI address involves two functions,
iommu_dma_prepare_msi() and iommu_dma_compose_msi_msg().
Previously iommu_dma_compose_msi_msg() needed to be in the iommu layer as
it had to dereference the opaque cookie pointer. Now, the previous patch
changed the cookie pointer into an integer so there is no longer any need
for the iommu layer to be involved.
Further, the call sites of iommu_dma_compose_msi_msg() all follow the same
pattern of setting an MSI message address_hi/lo to non-translated and then
immediately calling iommu_dma_compose_msi_msg().
Refactor iommu_dma_compose_msi_msg() into msi_msg_set_addr() that directly
accepts the u64 version of the address and simplifies all the callers.
Move the new helper to linux/msi.h since it has nothing to do with iommu.
Aside from refactoring, this logically prepares for the next patch, which
allows multiple implementation options for iommu_dma_prepare_msi(). So, it
does not make sense to have the iommu_dma_compose_msi_msg() in dma-iommu.c
as it no longer provides the only iommu_dma_prepare_msi() implementation.
Link: https://patch.msgid.link/r/eda62a9bafa825e9cdabd7ddc61ad5a21c32af24.1740014950.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The IOMMU translation for MSI message addresses has been a 2-step process,
separated in time:
1) iommu_dma_prepare_msi(): A cookie pointer containing the IOVA address
is stored in the MSI descriptor when an MSI interrupt is allocated.
2) iommu_dma_compose_msi_msg(): this cookie pointer is used to compute a
translated message address.
This has an inherent lifetime problem for the pointer stored in the cookie
that must remain valid between the two steps. However, there is no locking
at the irq layer that helps protect the lifetime. Today, this works under
the assumption that the iommu domain is not changed while MSI interrupts
being programmed. This is true for normal DMA API users within the kernel,
as the iommu domain is attached before the driver is probed and cannot be
changed while a driver is attached.
Classic VFIO type1 also prevented changing the iommu domain while VFIO was
running as it does not support changing the "container" after starting up.
However, iommufd has improved this so that the iommu domain can be changed
during VFIO operation. This potentially allows userspace to directly race
VFIO_DEVICE_ATTACH_IOMMUFD_PT (which calls iommu_attach_group()) and
VFIO_DEVICE_SET_IRQS (which calls into iommu_dma_compose_msi_msg()).
This potentially causes both the cookie pointer and the unlocked call to
iommu_get_domain_for_dev() on the MSI translation path to become UAFs.
Fix the MSI cookie UAF by removing the cookie pointer. The translated IOVA
address is already known during iommu_dma_prepare_msi() and cannot change.
Thus, it can simply be stored as an integer in the MSI descriptor.
The other UAF related to iommu_get_domain_for_dev() will be addressed in
patch "iommu: Make iommu_dma_prepare_msi() into a generic operation" by
using the IOMMU group mutex.
Link: https://patch.msgid.link/r/a4f2cd76b9dc1833ee6c1cf325cba57def22231c.1740014950.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Commit b5c58b2fdc42 ("dma-mapping: direct calls for dma-iommu") switched
to use direct calls to dma-iommu, but missed the dma_vmap_noncontiguous,
dma_vunmap_noncontiguous and dma_mmap_noncontiguous behavior keyed off the
presence of the alloc_noncontiguous method.
Fix this by removing the now unused alloc_noncontiguous and
free_noncontiguous methods and moving the vmapping and mmaping of the
noncontiguous allocations into the iommu code, as it is the only provider
of actually noncontiguous allocations.
Fixes: b5c58b2fdc42 ("dma-mapping: direct calls for dma-iommu")
Reported-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
Tested-by: Xi Ruoyao <xry111@xry111.site>
|
|
Directly call into dma-iommu just like we have been doing for dma-direct
for a while. This avoids the indirect call overhead for IOMMU ops and
removes the need to have DMA ops entirely for many common configurations.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- reduce duplicate swiotlb pool lookups (Michael Kelley)
- minor small fixes (Yicong Yang, Yang Li)
* tag 'dma-mapping-6.11-2024-07-19' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix kernel-doc description for swiotlb_del_transient
swiotlb: reduce swiotlb pool lookups
dma-mapping: benchmark: Don't starve others when doing the test
|
|
With CONFIG_SWIOTLB_DYNAMIC enabled, each round-trip map/unmap pair
in the swiotlb results in 6 calls to swiotlb_find_pool(). In multiple
places, the pool is found and used in one function, and then must
be found again in the next function that is called because only the
tlb_addr is passed as an argument. These are the six call sites:
dma_direct_map_page:
1. swiotlb_map -> swiotlb_tbl_map_single -> swiotlb_bounce
dma_direct_unmap_page:
2. dma_direct_sync_single_for_cpu -> is_swiotlb_buffer
3. dma_direct_sync_single_for_cpu -> swiotlb_sync_single_for_cpu ->
swiotlb_bounce
4. is_swiotlb_buffer
5. swiotlb_tbl_unmap_single -> swiotlb_del_transient
6. swiotlb_tbl_unmap_single -> swiotlb_release_slots
Reduce the number of calls by finding the pool at a higher level, and
passing it as an argument instead of searching again. A key change is
for is_swiotlb_buffer() to return a pool pointer instead of a boolean,
and then pass this pool pointer to subsequent swiotlb functions.
There are 9 occurrences of is_swiotlb_buffer() used to test if a buffer
is a swiotlb buffer before calling a swiotlb function. To reduce code
duplication in getting the pool pointer and passing it as an argument,
introduce inline wrappers for this pattern. The generated code is
essentially unchanged.
Since is_swiotlb_buffer() no longer returns a boolean, rename some
functions to reflect the change:
* swiotlb_find_pool() becomes __swiotlb_find_pool()
* is_swiotlb_buffer() becomes swiotlb_find_pool()
* is_xen_swiotlb_buffer() becomes xen_swiotlb_find_pool()
With these changes, a round-trip map/unmap pair requires only 2 pool
lookups (listed using the new names and wrappers):
dma_direct_unmap_page:
1. dma_direct_sync_single_for_cpu -> swiotlb_find_pool
2. swiotlb_tbl_unmap_single -> swiotlb_find_pool
These changes come from noticing the inefficiencies in a code review,
not from performance measurements. With CONFIG_SWIOTLB_DYNAMIC,
__swiotlb_find_pool() is not trivial, and it uses an RCU read lock,
so avoiding the redundant calls helps performance in a hot path.
When CONFIG_SWIOTLB_DYNAMIC is *not* set, the code size reduction
is minimal and the perf benefits are likely negligible, but no
harm is done.
No functional change is intended.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Petr Tesarik <petr@tesarici.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Somewhere amongst previous refactorings, the pgprot value in
__iommu_dma_alloc_noncontiguous() became entirely unused, and the one
used in iommu_dma_alloc_remap() can be computed locally rather than by
its one remaining caller. Clean 'em up.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/c2a81b72df59a71a13f8bad94f834e627c4c93dd.1717504749.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Despite carefully rewording the kerneldoc to describe the new direct
interaction with dma_range_map, it seems I managed to confuse myself in
removing the redundant force_aperture check and ended up making the code
not do that at all. This led to dma_range_maps inadvertently being able
to set iovad->start_pfn = 0, and all the nonsensical chaos which ensues
from there. Restore the correct behaviour of constraining base_pfn to
the domain aperture regardless of dma_range_map, and not trying to apply
dma_range_map constraints to the basic IOVA domain since they will be
properly handled with reserved regions later.
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Reported-by: Jerry Snitselaar <jsnitsel@redhat.com>
Fixes: ad4750b07d34 ("iommu/dma: Make limit checks self-contained")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Link: https://lore.kernel.org/r/721fa6baebb0924aa40db0b8fb86bcb4538434af.1716232484.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- optimize DMA sync calls when they are no-ops (Alexander Lobakin)
- fix swiotlb padding for untrusted devices (Michael Kelley)
- add documentation for swiotb (Michael Kelley)
* tag 'dma-mapping-6.10-2024-05-20' of git://git.infradead.org/users/hch/dma-mapping:
dma: fix DMA sync for drivers not calling dma_set_mask*()
xsk: use generic DMA sync shortcut instead of a custom one
page_pool: check for DMA sync shortcut earlier
page_pool: don't use driver-set flags field directly
page_pool: make sure frag API fields don't span between cachelines
iommu/dma: avoid expensive indirect calls for sync operations
dma: avoid redundant calls for sync operations
dma: compile-out DMA sync op calls when not used
iommu/dma: fix zeroing of bounce buffer padding used by untrusted devices
swiotlb: remove alloc_size argument to swiotlb_tbl_map_single()
Documentation/core-api: add swiotlb documentation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
"The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.
Notable series include:
- Lucas Stach has provided some page-mapping cleanup/consolidation/
maintainability work in the series "mm/treewide: Remove pXd_huge()
API".
- In the series "Allow migrate on protnone reference with
MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
one test.
- In their series "Memory allocation profiling" Kent Overstreet and
Suren Baghdasaryan have contributed a means of determining (via
/proc/allocinfo) whereabouts in the kernel memory is being
allocated: number of calls and amount of memory.
- Matthew Wilcox has provided the series "Various significant MM
patches" which does a number of rather unrelated things, but in
largely similar code sites.
- In his series "mm: page_alloc: freelist migratetype hygiene"
Johannes Weiner has fixed the page allocator's handling of
migratetype requests, with resulting improvements in compaction
efficiency.
- In the series "make the hugetlb migration strategy consistent"
Baolin Wang has fixed a hugetlb migration issue, which should
improve hugetlb allocation reliability.
- Liu Shixin has hit an I/O meltdown caused by readahead in a
memory-tight memcg. Addressed in the series "Fix I/O high when
memory almost met memcg limit".
- In the series "mm/filemap: optimize folio adding and splitting"
Kairui Song has optimized pagecache insertion, yielding ~10%
performance improvement in one test.
- Baoquan He has cleaned up and consolidated the early zone
initialization code in the series "mm/mm_init.c: refactor
free_area_init_core()".
- Baoquan has also redone some MM initializatio code in the series
"mm/init: minor clean up and improvement".
- MM helper cleanups from Christoph Hellwig in his series "remove
follow_pfn".
- More cleanups from Matthew Wilcox in the series "Various
page->flags cleanups".
- Vlastimil Babka has contributed maintainability improvements in the
series "memcg_kmem hooks refactoring".
- More folio conversions and cleanups in Matthew Wilcox's series:
"Convert huge_zero_page to huge_zero_folio"
"khugepaged folio conversions"
"Remove page_idle and page_young wrappers"
"Use folio APIs in procfs"
"Clean up __folio_put()"
"Some cleanups for memory-failure"
"Remove page_mapping()"
"More folio compat code removal"
- David Hildenbrand chipped in with "fs/proc/task_mmu: convert
hugetlb functions to work on folis".
- Code consolidation and cleanup work related to GUP's handling of
hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
- Rick Edgecombe has developed some fixes to stack guard gaps in the
series "Cover a guard gap corner case".
- Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
series "mm/ksm: fix ksm exec support for prctl".
- Baolin Wang has implemented NUMA balancing for multi-size THPs.
This is a simple first-cut implementation for now. The series is
"support multi-size THP numa balancing".
- Cleanups to vma handling helper functions from Matthew Wilcox in
the series "Unify vma_address and vma_pgoff_address".
- Some selftests maintenance work from Dev Jain in the series
"selftests/mm: mremap_test: Optimizations and style fixes".
- Improvements to the swapping of multi-size THPs from Ryan Roberts
in the series "Swap-out mTHP without splitting".
- Kefeng Wang has significantly optimized the handling of arm64's
permission page faults in the series
"arch/mm/fault: accelerate pagefault when badaccess"
"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
- GUP cleanups from David Hildenbrand in "mm/gup: consistently call
it GUP-fast".
- hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
path to use struct vm_fault".
- selftests build fixes from John Hubbard in the series "Fix
selftests/mm build without requiring "make headers"".
- Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
series "Improved Memory Tier Creation for CPUless NUMA Nodes".
Fixes the initialization code so that migration between different
memory types works as intended.
- David Hildenbrand has improved follow_pte() and fixed an errant
driver in the series "mm: follow_pte() improvements and acrn
follow_pte() fixes".
- David also did some cleanup work on large folio mapcounts in his
series "mm: mapcount for large folios + page_mapcount() cleanups".
- Folio conversions in KSM in Alex Shi's series "transfer page to
folio in KSM".
- Barry Song has added some sysfs stats for monitoring multi-size
THP's in the series "mm: add per-order mTHP alloc and swpout
counters".
- Some zswap cleanups from Yosry Ahmed in the series "zswap
same-filled and limit checking cleanups".
- Matthew Wilcox has been looking at buffer_head code and found the
documentation to be lacking. The series is "Improve buffer head
documentation".
- Multi-size THPs get more work, this time from Lance Yang. His
series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
optimizes the freeing of these things.
- Kemeng Shi has added more userspace-visible writeback
instrumentation in the series "Improve visibility of writeback".
- Kemeng Shi then sent some maintenance work on top in the series
"Fix and cleanups to page-writeback".
- Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
the series "Improve anon_vma scalability for anon VMAs". Intel's
test bot reported an improbable 3x improvement in one test.
- SeongJae Park adds some DAMON feature work in the series
"mm/damon: add a DAMOS filter type for page granularity access recheck"
"selftests/damon: add DAMOS quota goal test"
- Also some maintenance work in the series
"mm/damon/paddr: simplify page level access re-check for pageout"
"mm/damon: misc fixes and improvements"
- David Hildenbrand has disabled some known-to-fail selftests ni the
series "selftests: mm: cow: flag vmsplice() hugetlb tests as
XFAIL".
- memcg metadata storage optimizations from Shakeel Butt in "memcg:
reduce memory consumption by memcg stats".
- DAX fixes and maintenance work from Vishal Verma in the series
"dax/bus.c: Fixups for dax-bus locking""
* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
selftests: cgroup: add tests to verify the zswap writeback path
mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
mm/damon/core: fix return value from damos_wmark_metric_value
mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
selftests: cgroup: remove redundant enabling of memory controller
Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
Docs/mm/damon/design: use a list for supported filters
Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
selftests/damon: classify tests for functionalities and regressions
selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
selftests/damon: add a test for DAMOS quota goal
...
|
|
into next
|
|
When IOMMU is on, the actual synchronization happens in the same cases
as with the direct DMA. Advertise %DMA_F_CAN_SKIP_SYNC in IOMMU DMA to
skip sync ops calls (indirect) for non-SWIOTLB buffers.
perf profile before the patch:
18.53% [kernel] [k] gq_rx_skb
14.77% [kernel] [k] napi_reuse_skb
8.95% [kernel] [k] skb_release_data
5.42% [kernel] [k] dev_gro_receive
5.37% [kernel] [k] memcpy
<*> 5.26% [kernel] [k] iommu_dma_sync_sg_for_cpu
4.78% [kernel] [k] tcp_gro_receive
<*> 4.42% [kernel] [k] iommu_dma_sync_sg_for_device
4.12% [kernel] [k] ipv6_gro_receive
3.65% [kernel] [k] gq_pool_get
3.25% [kernel] [k] skb_gro_receive
2.07% [kernel] [k] napi_gro_frags
1.98% [kernel] [k] tcp6_gro_receive
1.27% [kernel] [k] gq_rx_prep_buffers
1.18% [kernel] [k] gq_rx_napi_handler
0.99% [kernel] [k] csum_partial
0.74% [kernel] [k] csum_ipv6_magic
0.72% [kernel] [k] free_pcp_prepare
0.60% [kernel] [k] __napi_poll
0.58% [kernel] [k] net_rx_action
0.56% [kernel] [k] read_tsc
<*> 0.50% [kernel] [k] __x86_indirect_thunk_r11
0.45% [kernel] [k] memset
After patch, lines with <*> no longer show up, and overall
cpu usage looks much better (~60% instead of ~72%):
25.56% [kernel] [k] gq_rx_skb
9.90% [kernel] [k] napi_reuse_skb
7.39% [kernel] [k] dev_gro_receive
6.78% [kernel] [k] memcpy
6.53% [kernel] [k] skb_release_data
6.39% [kernel] [k] tcp_gro_receive
5.71% [kernel] [k] ipv6_gro_receive
4.35% [kernel] [k] napi_gro_frags
4.34% [kernel] [k] skb_gro_receive
3.50% [kernel] [k] gq_pool_get
3.08% [kernel] [k] gq_rx_napi_handler
2.35% [kernel] [k] tcp6_gro_receive
2.06% [kernel] [k] gq_rx_prep_buffers
1.32% [kernel] [k] csum_partial
0.93% [kernel] [k] csum_ipv6_magic
0.65% [kernel] [k] net_rx_action
iavf yields +10% of Mpps on Rx. This also unblocks batched allocations
of XSk buffers when IOMMU is active.
Co-developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
iommu_dma_map_page() allocates swiotlb memory as a bounce buffer when an
untrusted device wants to map only part of the memory in an granule. The
goal is to disallow the untrusted device having DMA access to unrelated
kernel data that may be sharing the granule. To meet this goal, the
bounce buffer itself is zeroed, and any additional swiotlb memory up to
alloc_size after the bounce buffer end (i.e., "post-padding") is also
zeroed.
However, as of commit 901c7280ca0d ("Reinstate some of "swiotlb: rework
"fix info leak with DMA_FROM_DEVICE"""), swiotlb_tbl_map_single() always
initializes the contents of the bounce buffer to the original memory.
Zeroing the bounce buffer is redundant and probably wrong per the
discussion in that commit. Only the post-padding needs to be zeroed.
Also, when the DMA min_align_mask is non-zero, the allocated bounce
buffer space may not start on a granule boundary. The swiotlb memory
from the granule boundary to the start of the allocated bounce buffer
might belong to some unrelated bounce buffer. So as described in the
"second issue" in [1], it can't be zeroed to protect against untrusted
devices. But as of commit af133562d5af ("swiotlb: extend buffer
pre-padding to alloc_align_mask if necessary"), swiotlb_tbl_map_single()
allocates pre-padding slots when necessary to meet min_align_mask
requirements, making it possible to zero the pre-padding area as well.
Finally, iommu_dma_map_page() uses the swiotlb for untrusted devices
and also for certain kmalloc() memory. Current code does the zeroing
for both cases, but it is needed only for the untrusted device case.
Fix all of this by updating iommu_dma_map_page() to zero both the
pre-padding and post-padding areas, but not the actual bounce buffer.
Do this only in the case where the bounce buffer is used because
of an untrusted device.
[1] https://lore.kernel.org/all/20210929023300.335969-1-stevensd@google.com/
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Petr Tesarik <petr@tesarici.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Currently swiotlb_tbl_map_single() takes alloc_align_mask and
alloc_size arguments to specify an swiotlb allocation that is larger
than mapping_size. This larger allocation is used solely by
iommu_dma_map_single() to handle untrusted devices that should not have
DMA visibility to memory pages that are partially used for unrelated
kernel data.
Having two arguments to specify the allocation is redundant. While
alloc_align_mask naturally specifies the alignment of the starting
address of the allocation, it can also implicitly specify the size
by rounding up the mapping_size to that alignment.
Additionally, the current approach has an edge case bug.
iommu_dma_map_page() already does the rounding up to compute the
alloc_size argument. But swiotlb_tbl_map_single() then calculates the
alignment offset based on the DMA min_align_mask, and adds that offset to
alloc_size. If the offset is non-zero, the addition may result in a value
that is larger than the max the swiotlb can allocate. If the rounding up
is done _after_ the alignment offset is added to the mapping_size (and
the original mapping_size conforms to the value returned by
swiotlb_max_mapping_size), then the max that the swiotlb can allocate
will not be exceeded.
In view of these issues, simplify the swiotlb_tbl_map_single() interface
by removing the alloc_size argument. Most call sites pass the same value
for mapping_size and alloc_size, and they pass alloc_align_mask as zero.
Just remove the redundant argument from these callers, as they will see
no functional change. For iommu_dma_map_page() also remove the alloc_size
argument, and have swiotlb_tbl_map_single() compute the alloc_size by
rounding up mapping_size after adding the offset based on min_align_mask.
This has the side effect of fixing the edge case bug but with no other
functional change.
Also add a sanity test on the alloc_align_mask. While IOMMU code
currently ensures the granule is not larger than PAGE_SIZE, if that
guarantee were to be removed in the future, the downstream effect on the
swiotlb might go unnoticed until strange allocation failures occurred.
Tested on an ARM64 system with 16K page size and some kernel test-only
hackery to allow modifying the DMA min_align_mask and the granule size
that becomes the alloc_align_mask. Tested these combinations with a
variety of original memory addresses and sizes, including those that
reproduce the edge case bug:
* 4K granule and 0 min_align_mask
* 4K granule and 0xFFF min_align_mask (4K - 1)
* 16K granule and 0xFFF min_align_mask
* 64K granule and 0xFFF min_align_mask
* 64K granule and 0x3FFF min_align_mask (16K - 1)
With the changes, all combinations pass.
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Petr Tesarik <petr@tesarici.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
It's somewhat hard to see, but arm64's arch_setup_dma_ops() should only
ever call iommu_setup_dma_ops() after a successful iommu_probe_device(),
which means there should be no harm in achieving the same order of
operations by running it off the back of iommu_probe_device() itself.
This then puts it in line with the x86 and s390 .probe_finalize bodges,
letting us pull it all into the main flow properly. As a bonus this lets
us fold in and de-scope the PCI workaround setup as well.
At this point we can also then pull the call up inside the group mutex,
and avoid having to think about whether iommu_group_store_type() could
theoretically race and free the domain if iommu_setup_dma_ops() ran just
*before* iommu_device_use_default_domain() claims it... Furthermore we
replace one .probe_finalize call completely, since the only remaining
implementations are now one which only needs to run once for the initial
boot-time probe, and two which themselves render that path unreachable.
This leaves us a big step closer to realistically being able to unpick
the variety of different things that iommu_setup_dma_ops() has been
muddling together, and further streamline iommu-dma into core API flows
in future.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> # For Intel IOMMU
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/bebea331c1d688b34d9862eefd5ede47503961b8.1713523152.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
It's now easy to retrieve the device's DMA limits if we want to check
them against the domain aperture, so do that ourselves instead of
relying on them being passed through the callchain.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/e28a114243d1e79eb3609aded034f8529521333f.1713523152.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
After redefining alloc_pages, all uses of that name are being replaced.
Change the conflicting names to prevent preprocessor from replacing them
when it's not intended.
Link: https://lkml.kernel.org/r/20240321163705.3067592-18-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Gary Guo <gary@garyguo.net>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Free the IOMMU page tables via iommu_put_pages_list(). The page tables
were allocated via iommu_alloc_* functions in architecture specific
places, but are released in dma-iommu if the freelist is gathered during
map/unmap operations into iommu_iotlb_gather data structure.
Currently, only iommu/intel that does that.
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Acked-by: David Rientjes <rientjes@google.com>
Link: https://lore.kernel.org/r/20240413002522.1101315-3-pasha.tatashin@soleen.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
"This has a set of swiotlb alignment fixes for sometimes very long
standing bugs from Will. We've been discussion them for a while and
they should be solid now"
* tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE
iommu/dma: Force swiotlb_max_mapping_size on an untrusted device
swiotlb: Fix alignment checks when both allocation and DMA masks are present
swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()
swiotlb: Enforce page alignment in swiotlb_alloc()
swiotlb: Fix double-allocation of slots due to broken alignment handling
|
|
The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
an NVME device can map a size between 300KB~512KB, which certainly failed
the swiotlb mappings, though the default pool of swiotlb has many slots:
systemd[1]: Started Journal Service.
=> nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
note: journal-offline[392] exited with irqs disabled
note: journal-offline[392] exited with preempt_count 1
Call trace:
[ 3.099918] swiotlb_tbl_map_single+0x214/0x240
[ 3.099921] iommu_dma_map_page+0x218/0x328
[ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
[ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
[ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
[ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
[ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
[ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
[ 3.103463] __submit_bio+0x44/0x220
[ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
[ 3.103470] submit_bio_noacct+0x180/0x6c8
[ 3.103471] submit_bio+0x34/0x130
[ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
[ 3.104766] mpage_submit_folio+0xa0/0x100
[ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
[ 3.104771] ext4_do_writepages+0x6a0/0xd78
[ 3.105615] ext4_writepages+0x80/0x118
[ 3.105616] do_writepages+0x90/0x1e8
[ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
[ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
[ 3.106656] file_write_and_wait_range+0x84/0x120
[ 3.106658] ext4_sync_file+0x7c/0x4c0
[ 3.106660] vfs_fsync_range+0x3c/0xa8
[ 3.106663] do_fsync+0x44/0xc0
Since untrusted devices might go down the swiotlb pathway with dma-iommu,
these devices should not map a size larger than swiotlb_max_mapping_size.
To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
the iommu_dma_opt_mapping_size().
Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
Link: https://lore.kernel.org/r/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
[will: Drop redundant is_swiotlb_active(dev) check]
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
iommu-dma does not explicitly reference min_align_mask since we already
assume that will be less than or equal to any typical IOVA granule.
We wouldn't realistically expect to see the case where it is larger, and
that would be non-trivial to support, however for the sake of reasoning
(particularly around the interaction with SWIOTLB), let's clearly
enforce the assumption.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/dbb4d2d8e5d1691ac9a6c67e9758904e6c447ba5.1709553942.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core changes:
- Fix race conditions in device probe path
- Retire IOMMU bus_ops
- Support for passing custom allocators to page table drivers
- Clean up Kconfig around IOMMU_SVA
- Support for sharing SVA domains with all devices bound to a mm
- Firmware data parsing cleanup
- Tracing improvements for iommu-dma code
- Some smaller fixes and cleanups
ARM-SMMU drivers:
- Device-tree binding updates:
- Add additional compatible strings for Qualcomm SoCs
- Document Adreno clocks for Qualcomm's SM8350 SoC
- SMMUv2:
- Implement support for the ->domain_alloc_paging() callback
- Ensure Secure context is restored following suspend of Qualcomm
SMMU implementation
- SMMUv3:
- Disable stalling mode for the "quiet" context descriptor
- Minor refactoring and driver cleanups
Intel VT-d driver:
- Cleanup and refactoring
AMD IOMMU driver:
- Improve IO TLB invalidation logic
- Small cleanups and improvements
Rockchip IOMMU driver:
- DT binding update to add Rockchip RK3588
Apple DART driver:
- Apple M1 USB4/Thunderbolt DART support
- Cleanups
Virtio IOMMU driver:
- Add support for iotlb_sync_map
- Enable deferred IO TLB flushes"
* tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (66 commits)
iommu: Don't reserve 0-length IOVA region
iommu/vt-d: Move inline helpers to header files
iommu/vt-d: Remove unused vcmd interfaces
iommu/vt-d: Remove unused parameter of intel_pasid_setup_pass_through()
iommu/vt-d: Refactor device_to_iommu() to retrieve iommu directly
iommu/sva: Fix memory leak in iommu_sva_bind_device()
dt-bindings: iommu: rockchip: Add Rockchip RK3588
iommu/dma: Trace bounce buffer usage when mapping buffers
iommu/arm-smmu: Convert to domain_alloc_paging()
iommu/arm-smmu: Pass arm_smmu_domain to internal functions
iommu/arm-smmu: Implement IOMMU_DOMAIN_BLOCKED
iommu/arm-smmu: Convert to a global static identity domain
iommu/arm-smmu: Reorganize arm_smmu_domain_add_master()
iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Master cannot be NULL in arm_smmu_write_strtab_ent()
iommu/arm-smmu-v3: Add a type for the STE
iommu/arm-smmu-v3: disable stall for quiet_cd
iommu/qcom: restore IOMMU state if needed
iommu/arm-smmu-qcom: Add QCM2290 MDSS compatible
iommu/arm-smmu-qcom: Add missing GMU entry to match table
...
|
|
commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with code that was not yet upstream and depended on the previous
definition.
To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.
Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When commit 82612d66d51d ("iommu: Allow the dma-iommu api to
use bounce buffers") was introduced, it did not add the logic
for tracing the bounce buffer usage from iommu_dma_map_page().
All of the users of swiotlb_tbl_map_single() trace their bounce
buffer usage, except iommu_dma_map_page(). This makes it difficult
to track SWIOTLB usage from that function. Thus, trace bounce buffer
usage from iommu_dma_map_page().
Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers")
Cc: stable@vger.kernel.org # v5.15+
Cc: Tom Murphy <murphyt7@tcd.ie>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Saravana Kannan <saravanak@google.com>
Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com>
Link: https://lore.kernel.org/r/20231208234141.2356157-1-isaacmanjarres@google.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Flush queues currently use a fixed compile time size of 256 entries.
This being a power of 2 allows the compiler to use shift and mask
instead of more expensive modulo operations. With per-CPU flush queues
larger queue sizes would hit per-CPU allocation limits, with a single
flush queue these limits do not apply however. Also with single queues
being particularly suitable for virtualized environments with expensive
IOTLB flushes these benefit especially from larger queues and thus fewer
flushes.
To this end re-order struct iova_fq so we can use a dynamic array and
introduce the flush queue size and timeouts as new options in the
iommu_dma_options struct. So as not to lose the shift and mask
optimization, use a power of 2 for the length and use explicit shift and
mask instead of letting the compiler optimize this.
A large queue size and 1 second timeout is then set for the shadow on
flush case set by s390 paged memory guests. This then brings performance
on par with the previous s390 specific DMA API implementation.
Acked-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> #s390
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20230928-dma_iommu-v13-6-9e5fc4dacc36@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
In some virtualized environments, including s390 paged memory guests,
IOTLB flushes are used to update IOMMU shadow tables. Due to this, they
are much more expensive than in typical bare metal environments or
non-paged s390 guests. In addition they may parallelize poorly in
virtualized environments. This changes the trade off for flushing IOVAs
such that minimizing the number of IOTLB flushes trumps any benefit of
cheaper queuing operations or increased paralellism.
In this scenario per-CPU flush queues pose several problems. Firstly
per-CPU memory is often quite limited prohibiting larger queues.
Secondly collecting IOVAs per-CPU but flushing via a global timeout
reduces the number of IOVAs flushed for each timeout especially on s390
where PCI interrupts may not be bound to a specific CPU.
Let's introduce a single flush queue mode that reuses the same queue
logic but only allocates a single global queue. This mode is selected by
dma-iommu if a newly introduced .shadow_on_flush flag is set in struct
dev_iommu. As a first user the s390 IOMMU driver sets this flag during
probe_device. With the unchanged small FQ size and timeouts this setting
is worse than per-CPU queues but a follow up patch will make the FQ size
and timeout variable. Together this allows the common IOVA flushing code
to more closely resemble the global flush behavior used on s390's
previous internal DMA API implementation.
Link: https://lore.kernel.org/all/9a466109-01c5-96b0-bf03-304123f435ee@arm.com/
Acked-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> #s390
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Link: https://lore.kernel.org/r/20230928-dma_iommu-v13-5-9e5fc4dacc36@linux.ibm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
Per the reasoning in commit 4bf7fda4dce2 ("iommu/dma: Add config for
PCI SAC address trick") and its subsequent revert, this mechanism no
longer serves its original purpose, but now only works around broken
hardware/drivers in a way that is unfortunately too impactful to remove.
This does not, however, prevent us from solving the performance impact
which that workaround has on large-scale systems that don't need it.
Once the 32-bit IOVA space fills up and a workload starts allocating and
freeing on both sides of the boundary, the opportunistic SAC allocation
can then end up spending significant time hunting down scattered
fragments of free 32-bit space, or just reestablishing max32_alloc_size.
This can easily be exacerbated by a change in allocation pattern, such
as by changing the network MTU, which can increase pressure on the
32-bit space by leaving a large quantity of cached IOVAs which are now
the wrong size to be recycled, but also won't be freed since the
non-opportunistic allocations can still be satisfied from the whole
64-bit space without triggering the reclaim path.
However, in the context of a workaround where smaller DMA addresses
aren't simply a preference but a necessity, if we get to that point at
all then in fact it's already the endgame. The nature of the allocator
is currently such that the first IOVA we give to a device after the
32-bit space runs out will be the highest possible address for that
device, ever. If that works, then great, we know we can optimise for
speed by always allocating from the full range. And if it doesn't, then
the worst has already happened and any brokenness is now showing, so
there's little point in continuing to try to hide it.
To that end, implement a flag to refine the SAC business into a
per-device policy that can automatically get itself out of the way if
and when it stops being useful.
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/b8502b115b915d2a3fabde367e099e39106686c8.1681392791.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core changes:
- iova_magazine_alloc() optimization
- Make flush-queue an IOMMU driver capability
- Consolidate the error handling around device attachment
AMD IOMMU changes:
- AVIC Interrupt Remapping Improvements
- Some minor fixes and cleanups
Intel VT-d changes from Lu Baolu:
- Small and misc cleanups
ARM-SMMU changes from Will Deacon:
- Device-tree binding updates:
- Add missing clocks for SC8280XP and SA8775 Adreno SMMUs
- Add two new Qualcomm SMMUs in SDX75 and SM6375
- Workarounds for Arm MMU-700 errata:
- 1076982: Avoid use of SEV-based cmdq wakeup
- 2812531: Terminate command batches with a CMD_SYNC
- Enforce single-stage translation to avoid nesting-related errata
- Set the correct level hint for range TLB invalidation on teardown
.. and some other minor fixes and cleanups (including Freescale PAMU
and virtio-iommu changes)"
* tag 'iommu-updates-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (50 commits)
iommu/vt-d: Remove commented-out code
iommu/vt-d: Remove two WARN_ON in domain_context_mapping_one()
iommu/vt-d: Handle the failure case of dmar_reenable_qi()
iommu/vt-d: Remove unnecessary (void*) conversions
iommu/amd: Remove extern from function prototypes
iommu/amd: Use BIT/BIT_ULL macro to define bit fields
iommu/amd: Fix DTE_IRQ_PHYS_ADDR_MASK macro
iommu/amd: Fix compile error for unused function
iommu/amd: Improving Interrupt Remapping Table Invalidation
iommu/amd: Do not Invalidate IRT when IRTE caching is disabled
iommu/amd: Introduce Disable IRTE Caching Support
iommu/amd: Remove the unused struct amd_ir_data.ref
iommu/amd: Switch amd_iommu_update_ga() to use modify_irte_ga()
iommu/arm-smmu-v3: Set TTL invalidation hint better
iommu/arm-smmu-v3: Document nesting-related errata
iommu/arm-smmu-v3: Add explicit feature for nesting
iommu/arm-smmu-v3: Document MMU-700 erratum 2812531
iommu/arm-smmu-v3: Work around MMU-600 erratum 1076982
dt-bindings: arm-smmu: Add SDX75 SMMU compatible
dt-bindings: arm-smmu: Add SM6375 GPU SMMU
...
|
|
Similarly to the direct DMA, bounce small allocations as they may have
originated from a kmalloc() cache not safe for DMA. Unlike the direct
DMA, iommu_dma_map_sg() cannot call iommu_dma_map_sg_swiotlb() for all
non-coherent devices as this would break some cases where the iova is
expected to be contiguous (dmabuf). Instead, scan the scatterlist for
any small sizes and only go the swiotlb path if any element of the list
needs bouncing (note that iommu_dma_map_page() would still only bounce
those buffers which are not DMA-aligned).
To avoid scanning the scatterlist on the 'sync' operations, introduce an
SG_DMA_SWIOTLB flag set by iommu_dma_map_sg_swiotlb(). The
dev_use_swiotlb() function together with the newly added
dev_use_sg_swiotlb() now check for both untrusted devices and unaligned
kmalloc() buffers (suggested by Robin Murphy).
Link: https://lkml.kernel.org/r/20230612153201.554742-16-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jerry Snitselaar <jsnitsel@redhat.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
sg_is_dma_bus_address() is inconsistent with the naming pattern of its
corresponding setters and its own kerneldoc, so take the majority vote and
rename it sg_dma_is_bus_address() (and fix up the missing underscores in
the kerneldoc too). This gives us a nice clear pattern where SG DMA flags
are SG_DMA_<NAME>, and the helpers for acting on them are
sg_dma_<action>_<name>().
Link: https://lkml.kernel.org/r/20230612153201.554742-14-catalin.marinas@arm.com
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Link: https://lore.kernel.org/r/fa2eca2862c7ffc41b50337abffb2dfd2864d3ea.1685036694.git.robin.murphy@arm.com
Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It remains really handy to have distinct DMA domain types within core
code for the sake of default domain policy selection, but we can now
hide that detail from drivers by using the new capability instead.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Jerry Snitselaar <jsnitsel@redhat.com> # amd, intel, smmu-v3
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1c552d99e8ba452bdac48209fa74c0bdd52fd9d9.1683233867.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
|