summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-02iommu: Cleanup comments for dev_enable/disable_featLu Baolu
The dev_enable/disable_feat ops have been removed by commit <f984fb09e60e> ("iommu: Remove iommu_dev_enable/disable_feature()"). Cleanup the comments to make the code clean. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250430025249.2371751-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu: Protect against overflow in iommu_pgsize()Jason Gunthorpe
On a 32 bit system calling: iommu_map(0, 0x40000000) When using the AMD V1 page table type with a domain->pgsize of 0xfffff000 causes iommu_pgsize() to miscalculate a result of: size=0x40000000 count=2 count should be 1. This completely corrupts the mapping process. This is because the final test to adjust the pagesize malfunctions when the addition overflows. Use check_add_overflow() to prevent this. Fixes: b1d99dc5f983 ("iommu: Hook up '->unmap_pages' driver callback") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/0-v1-3ad28fc2e3a3+163327-iommu_overflow_pgsize_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu: Handle yet another race around registrationRobin Murphy
Next up on our list of race windows to close is another one during iommu_device_register() - it's now OK again for multiple instances to run their bus_iommu_probe() in parallel, but an iommu_probe_device() can still also race against a running bus_iommu_probe(). As Johan has managed to prove, this has now become a lot more visible on DT platforms wth driver_async_probe where a client driver is attempting to probe in parallel with its IOMMU driver - although commit b46064a18810 ("iommu: Handle race with default domain setup") resolves this from the client driver's point of view, this isn't before of_iommu_configure() has had the chance to attempt to "replay" a probe that the bus walk hasn't even tried yet, and so still cause the out-of-order group allocation behaviour that we're trying to clean up (and now warning about). The most reliable thing to do here is to explicitly keep track of the "iommu_device_register() is still running" state, so we can then special-case the ops lookup for the replay path (based on dev->iommu again) to let that think it's still waiting for the IOMMU driver to appear at all. This still leaves the longstanding theoretical case of iommu_bus_notifier() being triggered during bus_iommu_probe(), but it's not so simple to defer a notifier, and nobody's ever reported that being a visible issue, so let's quietly kick that can down the road for now... Reported-by: Johan Hovold <johan@kernel.org> Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path") Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/88d54c1b48fed8279aa47d30f3d75173685bb26a.1745516488.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu: Allow attaching static domains in iommu_attach_device_pasid()Lu Baolu
The idxd driver attaches the default domain to a PASID of the device to perform kernel DMA using that PASID. The domain is attached to the device's PASID through iommu_attach_device_pasid(), which checks if the domain->owner matches the iommu_ops retrieved from the device. If they do not match, it returns a failure. if (ops != domain->owner || pasid == IOMMU_NO_PASID) return -EINVAL; The static identity domain implemented by the intel iommu driver doesn't specify the domain owner. Therefore, kernel DMA with PASID doesn't work for the idxd driver if the device translation mode is set to passthrough. Generally the owner field of static domains are not set because they are already part of iommu ops. Add a helper domain_iommu_ops_compatible() that checks if a domain is compatible with the device's iommu ops. This helper explicitly allows the static blocked and identity domains associated with the device's iommu_ops to be considered compatible. Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity domain") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220031 Cc: stable@vger.kernel.org Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/linux-iommu/20250422191554.GC1213339@ziepe.ca/ Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20250424034123.2311362-1-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu/io-pgtable-arm: dynamically allocate selftest device structArnd Bergmann
In general a 'struct device' is way too large to be put on the kernel stack. Apparently something just caused it to grow a slightly larger, which pushed the arm_lpae_do_selftests() function over the warning limit in some configurations: drivers/iommu/io-pgtable-arm.c:1423:19: error: stack frame size (1032) exceeds limit (1024) in 'arm_lpae_do_selftests' [-Werror,-Wframe-larger-than] 1423 | static int __init arm_lpae_do_selftests(void) | ^ Change the function to use a dynamically allocated faux_device instead of the on-stack device structure. Fixes: ca25ec247aad ("iommu/io-pgtable-arm: Remove iommu_dev==NULL special case") Link: https://lore.kernel.org/all/ab75a444-22a1-47f5-b3c0-253660395b5a@arm.com/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/20250423164826.2931382-1-arnd@kernel.org Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu: Hide ops.domain_alloc behind CONFIG_FSL_PAMUJason Gunthorpe
fsl_pamu is the last user of domain_alloc(), and it is using it to create something weird that doesn't really fit into the iommu subsystem architecture. It is a not a paging domain since it doesn't have any map/unmap ops. It may be some special kind of identity domain. For now just leave it as is. Wrap it's definition in CONFIG_FSL_PAMU to discourage any new drivers from attempting to use it. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/5-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu: Do not call domain_alloc() in iommu_sva_domain_alloc()Jason Gunthorpe
No driver implements SVA under domain_alloc() anymore, this is dead code. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/4-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu/virtio: Move to domain_alloc_paging()Jason Gunthorpe
virtio has the complication that it sometimes wants to return a paging domain for IDENTITY which makes this conversion a little different than other drivers. Add a viommu_domain_alloc_paging() that combines viommu_domain_alloc() and viommu_domain_finalise() to always return a fully initialized and finalized paging domain. Use viommu_domain_alloc_identity() to implement the special non-bypass IDENTITY flow by calling viommu_domain_alloc_paging() then viommu_domain_map_identity(). Remove support for deferred finalize and the vdomain->mutex. Remove core support for domain_alloc() IDENTITY as virtio was the last driver using it. Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/3-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu: Add domain_alloc_identity()Jason Gunthorpe
virtio-iommu has a mode where the IDENTITY domain is actually a paging domain with an identity mapping covering some of the system address space manually created. To support this add a new domain_alloc_identity() op that accepts the struct device so that virtio can allocate and fully finalize a paging domain to return. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/2-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu/virtio: Break out bypass identity support into a global staticJason Gunthorpe
To make way for a domain_alloc_paging conversion add the typical global static IDENTITY domain. This supports VMMs that have a VIRTIO_IOMMU_F_BYPASS_CONFIG config. If the VMM does not have support then the domain_alloc path is still used, which creates an IDENTITY domain out of a paging domain. Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/1-v4-ff5fb6b03bd1+288-iommu_virtio_domains_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu: Remove iommu_dev_enable/disable_feature()Lu Baolu
No external drivers use these interfaces anymore. Furthermore, no existing iommu drivers implement anything in the callbacks. Remove them to avoid dead code. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://lore.kernel.org/r/20250418080130.1844424-9-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommufd: Remove unnecessary IOMMU_DEV_FEAT_IOPFLu Baolu
The iopf enablement has been moved to the iommu drivers. It is unnecessary for iommufd to handle iopf enablement. Remove the iopf enablement logic to avoid duplication. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250418080130.1844424-8-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28uacce: Remove unnecessary IOMMU_DEV_FEAT_IOPFLu Baolu
None of the drivers implement anything for IOMMU_DEV_FEAT_IOPF anymore, remove it to avoid dead code. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Zhangfei Gao <zhangfei.gao@linaro.org> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250418080130.1844424-7-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28dmaengine: idxd: Remove unnecessary IOMMU_DEV_FEAT_IOPFLu Baolu
The IOMMU_DEV_FEAT_IOPF implementation in the iommu driver is just a no-op. It will also be removed from the iommu driver in the subsequent patch. Remove it to avoid dead code. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Acked-by: Vinod Koul <vkoul@kernel.org> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250418080130.1844424-6-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommufd/selftest: Put iopf enablement in domain attach pathLu Baolu
Update iopf enablement in the iommufd mock device driver to use the new method, similar to the arm-smmu-v3 driver. Enable iopf support when any domain with an iopf_handler is attached, and disable it when the domain is removed. Add a refcount in the mock device state structure to keep track of the number of domains set to the device and PASIDs that require iopf. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250418080130.1844424-5-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu/vt-d: Put iopf enablement in domain attach pathLu Baolu
Update iopf enablement in the driver to use the new method, similar to the arm-smmu-v3 driver. Enable iopf support when any domain with an iopf_handler is attached, and disable it when the domain is removed. Place all the logic for controlling the PRI and iopf queue in the domain set/remove/replace paths. Keep track of the number of domains set to the device and PASIDs that require iopf. When the first domain requiring iopf is attached, add the device to the iopf queue and enable PRI. When the last domain is removed, remove it from the iopf queue and disable PRI. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250418080130.1844424-4-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu: Remove IOMMU_DEV_FEAT_SVAJason Gunthorpe
None of the drivers implement anything here anymore, remove the dead code. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250418080130.1844424-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-28iommu/arm-smmu-v3: Put iopf enablement in the domain attach pathJason Gunthorpe
SMMUv3 co-mingles FEAT_IOPF and FEAT_SVA behaviors so that fault reporting doesn't work unless both are enabled. This is not correct and causes problems for iommufd which does not enable FEAT_SVA for it's fault capable domains. These APIs are both obsolete, update SMMUv3 to use the new method like AMD implements. A driver should enable iopf support when a domain with an iopf_handler is attached, and disable iopf support when the domain is removed. Move the fault support logic to sva domain allocation and to domain attach, refusing to create or attach fault capable domains if the HW doesn't support it. Move all the logic for controlling the iopf queue under arm_smmu_attach_prepare(). Keep track of the number of domains on the master (over all the SSIDs) that require iopf. When the first domain requiring iopf is attached create the iopf queue, when the last domain is detached destroy it. Turn FEAT_IOPF and FEAT_SVA into no ops. Remove the sva_lock, this is all protected by the group mutex. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250418080130.1844424-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu: Split out and tidy up Arm KconfigRobin Murphy
There are quite a lot of options for the Arm drivers, still all buried in the top-level Kconfig. For ease of use and consistency with all the other subdirectories, break these out into drivers/arm. For similar clarity and self-consistency, also tweak the ARM_SMMU sub-options to use "if" instead of "depends", to match ARM_SMMU_V3. Lastly also clean up the slightly messy description of ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT as highlighted by Geert - by now we really shouldn't need commentary on v4.x kernel behaviour anyway - and downgrade it to EXPERT as the first step in the 6-year-old threat to remove it entirely. Cc: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Link: https://lore.kernel.org/r/a614ec86ba78c09cd16e348f633f6bb38793391f.1742480488.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu: Avoid introducing more racesRobin Murphy
Although the lock-juggling is only a temporary workaround, we don't want it to make things avoidably worse. Jason was right to be nervous, since bus_iommu_probe() doesn't care *which* IOMMU instance it's probing for, so it probably is possible for one walk to finish a probe which a different walk started, thus we do want to check for that. Also there's no need to drop the lock just to have of_iommu_configure() do nothing when a fwspec already exists; check that directly and avoid opening a window at all in that (still somewhat likely) case. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/09d901ad11b3a410fbb6e27f7d04ad4609c3fe4a.1741706365.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/vtd: Remove iommu_alloc_pages_node()Jason Gunthorpe
Intel is the only thing that uses this now, convert to the size versions, trying to avoid PAGE_SHIFT. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/23-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/amd: Use iommu_alloc_pages_node_sz() for the IRTJason Gunthorpe
Use the actual size of the irq_table allocation, limiting to 128 due to the HW alignment needs. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/22-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Remove iommu_alloc_page_node()Jason Gunthorpe
Use iommu_alloc_pages_node_sz() instead. AMD and Intel are both using 4K pages for these structures since those drivers only work on 4K PAGE_SIZE. riscv is also spec'd to use SZ_4K. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/21-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Remove iommu_alloc_page/pages()Jason Gunthorpe
A few small changes to the remaining drivers using these will allow them to be removed: - Exynos wants to allocate fixed 16K/8K allocations - Rockchip already has a define SPAGE_SIZE which is used by the dma_map immediately following, using SPAGE_ORDER which is a lg2size - tegra has size constants already for its two allocations Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu: Update various drivers to pass in lg2sz instead of order to iommu pagesJason Gunthorpe
Convert most of the places calling get_order() as an argument to the iommu-pages allocator into order_base_2() or the _sz flavour instead. These places already have an exact size, there is no particular reason to use order here. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/19-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/riscv: Update to use iommu_alloc_pages_node_lg2()Jason Gunthorpe
One part of RISCV already has a computed size, however the queue allocation must be aligned to 4k. The other objects are 4k by spec. Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/18-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/amd: Use roundup_pow_two() instead of get_order()Jason Gunthorpe
If x >= PAGE_SIZE then: 1 << (get_order(x) + PAGE_SHIFT) == roundup_pow_two() Inline this into the only caller, compute the size of the HW device table in terms of 4K pages which matches the HW definition. Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/17-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/amd: Change rlookup, irq_lookup, and alias to use kvalloc()Jason Gunthorpe
This is just CPU memory used by the driver to track things, it doesn't need to use iommu-pages. All of them are indexed by devid and devid is bounded by pci_seg->last_bdf or we are already out of bounds on the page allocation. Switch them to use some version of kvmalloc_array() and drop the now unused constants and remove the tbl_size() round up to PAGE_SIZE multiples logic. Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/16-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Allow sub page sizes to be passed into the allocatorJason Gunthorpe
Generally drivers have a specific idea what their HW structure size should be. In a lot of cases this is related to PAGE_SIZE, but not always. ARM64, for example, allows a 4K IO page table size on a 64K CPU page table system. Currently we don't have any good support for sub page allocations, but make the API accommodate this by accepting a sub page size from the caller and rounding up internally. This is done by moving away from order as the size input and using size: size == 1 << (order + PAGE_SHIFT) Following patches convert drivers away from using order and try to specify allocation sizes independent of PAGE_SIZE. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/15-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Move the __GFP_HIGHMEM checks into the common codeJason Gunthorpe
The entire allocator API is built around using the kernel virtual address, it is illegal to pass GFP_HIGHMEM in as a GFP flag. Block it in the common code. Remove the duplicated checks from drivers. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/14-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Move from struct page to struct ioptdesc and folioJason Gunthorpe
This brings the iommu page table allocator into the modern world of having its own private page descriptor and not re-using fields from struct page for its own purpose. It follows the basic pattern of struct ptdesc which did this transformation for the CPU page table allocator. Currently iommu-pages is pretty basic so this isn't a huge benefit, however I see a coming need for features that CPU allocator has, like sub PAGE_SIZE allocations, and RCU freeing. This provides the base infrastructure to implement those cleanly. Remove numa_node_id() calls from the inlines and instead use NUMA_NO_NODE which will get switched to numa_mem_id(), which seems to be the right ID to use for memory allocations. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/13-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Remove iommu_put_pages_list_old and the _GenericJason Gunthorpe
Nothing uses the old list_head path now, remove it. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/12-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu: Change iommu_iotlb_gather to use iommu_page_listJason Gunthorpe
This converts the remaining places using list of pages to the new API. The Intel free path was shared with its gather path, so it is converted at the same time. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/11-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/amd: Convert to use struct iommu_pages_listJason Gunthorpe
Change the internal freelist to use struct iommu_pages_list. AMD uses the freelist to batch free the entire table during domain destruction, and to replace table levels with leafs during map. Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/10-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/riscv: Convert to use struct iommu_pages_listJason Gunthorpe
Change the internal freelist to use struct iommu_pages_list. riscv uses this page list to free page table levels that are replaced with leaf ptes. Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/9-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Formalize the freelist APIJason Gunthorpe
We want to get rid of struct page references outside the internal allocator implementation. The free list has the driver open code something like: list_add_tail(&virt_to_page(ptr)->lru, freelist); Move the above into a small inline and make the freelist into a wrapper type 'struct iommu_pages_list' so that the compiler can help check all the conversion. This struct has also proven helpful in some future ideas to convert to a singly linked list to get an extra pointer in the struct page, and to signal that the pages should be freed with RCU. Use a temporary _Generic so we don't need to rename the free function as the patches progress. Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/8-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: De-inline the substantial functionsJason Gunthorpe
These are called in a lot of places and are not trivial. Move them to the core module. Tidy some of the comments and function arguments, fold __iommu_alloc_account() into its only caller, change __iommu_free_account() into __iommu_free_page() to remove some duplication. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/7-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Remove iommu_free_page()Jason Gunthorpe
Use iommu_free_pages() instead. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/6-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Remove the order argument to iommu_free_pages()Jason Gunthorpe
Now that we have a folio under the allocation iommu_free_pages() can know the order of the original allocation and do the correct thing to free it. The next patch will rename iommu_free_page() to iommu_free_pages() so we have naming consistency with iommu_alloc_pages_node(). Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Make iommu_put_pages_list() work with high order allocationsJason Gunthorpe
alloc_pages_node(, order) needs to be paired with __free_pages(, order) to free all the allocated pages. For order != 0 the return from alloc_pages_node() is just a page list, it hasn't been formed into a folio. However iommu_put_pages_list() just calls put_page() on the head page of an allocation, which will end up leaking the tail pages if order != 0. Fix this by using __GFP_COMP to create a high order folio and then always use put_page() to free the full high order folio. __iommu_free_account() can get the order of the allocation via folio_order(), which corrects the accounting of high order allocations in iommu_put_pages_list(). This is the same technique slub uses. As far as I can tell, none of the places using high order allocations are also using the free list, so this not a current bug. Fixes: 06c375053cef ("iommu/vt-d: add wrapper functions for page allocations") Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/pages: Remove __iommu_alloc_pages()/__iommu_free_pages()Jason Gunthorpe
These were only used by tegra-smmu and leaked the struct page out of the API. Delete them since tega-smmu has been converted to the other APIs. In the process flatten the call tree so we have fewer one line functions calling other one line functions.. iommu_alloc_pages_node() is the real allocator and everything else can just call it directly. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/tegra: Do not use struct page as the handle for ptsJason Gunthorpe
Instead use the virtual address and dma_map_single() like as->pd uses. Introduce a small struct tegra_pt instead of void * to have some clarity what is using this API and add compile safety during the conversion. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-17iommu/terga: Do not use struct page as the handle for as->pd memoryJason Gunthorpe
Instead use the virtual address. Change from dma_map_page() to dma_map_single() which works directly on a KVA. Add a type for the pd table level for clarity. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2025-04-13Linux 6.15-rc2v6.15-rc2Linus Torvalds
2025-04-13Merge tag 'erofs-for-6.15-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: - Properly handle errors when file-backed I/O fails - Fix compilation issues on ARM platform (arm-linux-gnueabi) - Fix parsing of encoded extents - Minor cleanup * tag 'erofs-for-6.15-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: remove duplicate code erofs: fix encoded extents handling erofs: add __packed annotation to union(__le16..) erofs: set error to bio if file-backed IO fails
2025-04-13Merge tag 'ext4_for_linus-6.15-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A few more miscellaneous ext4 bug fixes and cleanups including some syzbot failures and fixing a stale file handing refeencing an inode previously used as a regular file, but which has been deleted and reused as an ea_inode would result in ext4 erroneously considering this a case of fs corruption" * tag 'ext4_for_linus-6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix off-by-one error in do_split ext4: make block validity check resistent to sb bh corruption ext4: avoid -Wflex-array-member-not-at-end warning Documentation: ext4: Add fields to ext4_super_block documentation ext4: don't treat fhandle lookup of ea_inode as FS corruption
2025-04-13Merge tag 'fixes-2025-04-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fix from Mike Rapoport: "Fix build of memblock test. Add missing stubs for mutex and free_reserved_area() to memblock tests" * tag 'fixes-2025-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock tests: Fix mutex related build error
2025-04-12ext4: fix off-by-one error in do_splitArtem Sadovnikov
Syzkaller detected a use-after-free issue in ext4_insert_dentry that was caused by out-of-bounds access due to incorrect splitting in do_split. BUG: KASAN: use-after-free in ext4_insert_dentry+0x36a/0x6d0 fs/ext4/namei.c:2109 Write of size 251 at addr ffff888074572f14 by task syz-executor335/5847 CPU: 0 UID: 0 PID: 5847 Comm: syz-executor335 Not tainted 6.12.0-rc6-syzkaller-00318-ga9cda7c0ffed #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 __asan_memcpy+0x40/0x70 mm/kasan/shadow.c:106 ext4_insert_dentry+0x36a/0x6d0 fs/ext4/namei.c:2109 add_dirent_to_buf+0x3d9/0x750 fs/ext4/namei.c:2154 make_indexed_dir+0xf98/0x1600 fs/ext4/namei.c:2351 ext4_add_entry+0x222a/0x25d0 fs/ext4/namei.c:2455 ext4_add_nondir+0x8d/0x290 fs/ext4/namei.c:2796 ext4_symlink+0x920/0xb50 fs/ext4/namei.c:3431 vfs_symlink+0x137/0x2e0 fs/namei.c:4615 do_symlinkat+0x222/0x3a0 fs/namei.c:4641 __do_sys_symlink fs/namei.c:4662 [inline] __se_sys_symlink fs/namei.c:4660 [inline] __x64_sys_symlink+0x7a/0x90 fs/namei.c:4660 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> The following loop is located right above 'if' statement. for (i = count-1; i >= 0; i--) { /* is more than half of this entry in 2nd half of the block? */ if (size + map[i].size/2 > blocksize/2) break; size += map[i].size; move++; } 'i' in this case could go down to -1, in which case sum of active entries wouldn't exceed half the block size, but previous behaviour would also do split in half if sum would exceed at the very last block, which in case of having too many long name files in a single block could lead to out-of-bounds access and following use-after-free. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Cc: stable@vger.kernel.org Fixes: 5872331b3d91 ("ext4: fix potential negative array index in do_split()") Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250404082804.2567-3-a.sadovnikov@ispras.ru Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-04-12ext4: make block validity check resistent to sb bh corruptionOjaswin Mujoo
Block validity checks need to be skipped in case they are called for journal blocks since they are part of system's protected zone. Currently, this is done by checking inode->ino against sbi->s_es->s_journal_inum, which is a direct read from the ext4 sb buffer head. If someone modifies this underneath us then the s_journal_inum field might get corrupted. To prevent against this, change the check to directly compare the inode with journal->j_inode. **Slight change in behavior**: During journal init path, check_block_validity etc might be called for journal inode when sbi->s_journal is not set yet. In this case we now proceed with ext4_inode_block_valid() instead of returning early. Since systems zones have not been set yet, it is okay to proceed so we can perform basic checks on the blocks. Suggested-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/0c06bc9ebfcd6ccfed84a36e79147bf45ff5adc1.1743142920.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-04-12ext4: avoid -Wflex-array-member-not-at-end warningGustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Use the `DEFINE_RAW_FLEX()` helper for an on-stack definition of a flexible structure where the size of the flexible-array member is known at compile-time, and refactor the rest of the code, accordingly. So, with these changes, fix the following warning: fs/ext4/mballoc.c:3041:40: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/Z-SF97N3AxcIMlSi@kspp Signed-off-by: Theodore Ts'o <tytso@mit.edu>