summaryrefslogtreecommitdiff
path: root/drivers/pci
AgeCommit message (Collapse)Author
2025-03-25Merge tag 'pm-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These are dominated by cpufreq updates which in turn are dominated by updates related to boost support in the core and drivers and amd-pstate driver optimizations. Apart from the above, there are some cpuidle updates including a rework of the most recent idle intervals handling in the venerable menu governor that leads to significant improvements in some performance benchmarks, as the governor is now more likely to predict a shorter idle duration in some cases, and there are updates of the core device power management code, mostly related to system suspend and resume, that should help to avoid potential issues arising when the drivers of devices depending on one another want to use different optimizations. There is also a usual collection of assorted fixes and cleanups, including removal of some unused code. Specifics: - Manage sysfs attributes and boost frequencies efficiently from cpufreq core to reduce boilerplate code in drivers (Viresh Kumar) - Minor cleanups to cpufreq drivers (Aaron Kling, Benjamin Schneider, Dhananjay Ugwekar, Imran Shaik, zuoqian) - Migrate some cpufreq drivers to using for_each_present_cpu() (Jacky Bai) - cpufreq-qcom-hw DT binding fixes (Krzysztof Kozlowski) - Use str_enable_disable() helper in cpufreq_online() (Lifeng Zheng) - Optimize the amd-pstate driver to avoid cases where call paths end up calling the same writes multiple times and needlessly caching variables through code reorganization, locking overhaul and tracing adjustments (Mario Limonciello, Dhananjay Ugwekar) - Make it possible to avoid enabling capacity-aware scheduling (CAS) in the intel_pstate driver and relocate a check for out-of-band (OOB) platform handling in it to make it detect OOB before checking HWP availability (Rafael Wysocki) - Fix dbs_update() to avoid inadvertent conversions of negative integer values to unsigned int which causes CPU frequency selection to be inaccurate in some cases when the "conservative" cpufreq governor is in use (Jie Zhan) - Update the handling of the most recent idle intervals in the menu cpuidle governor to prevent useful information from being discarded by it in some cases and improve the prediction accuracy (Rafael Wysocki) - Make it possible to tell the intel_idle driver to ignore its built-in table of idle states for the given processor, clean up the handling of auto-demotion disabling on Baytrail and Cherrytrail chips in it, and update its MAINTAINERS entry (David Arcari, Artem Bityutskiy, Rafael Wysocki) - Make some cpuidle drivers use for_each_present_cpu() instead of for_each_possible_cpu() during initialization to avoid issues occurring when nosmp or maxcpus=0 are used (Jacky Bai) - Clean up the Energy Model handling code somewhat (Rafael Wysocki) - Use kfree_rcu() to simplify the handling of runtime Energy Model updates (Li RongQing) - Add an entry for the Energy Model framework to MAINTAINERS as properly maintained (Lukasz Luba) - Address RCU-related sparse warnings in the Energy Model code (Rafael Wysocki) - Remove ENERGY_MODEL dependency on SMP and allow it to be selected when DEVFREQ is set without CPUFREQ so it can be used on a wider range of systems (Jeson Gao) - Unify error handling during runtime suspend and runtime resume in the core to help drivers to implement more consistent runtime PM error handling (Rafael Wysocki) - Drop a redundant check from pm_runtime_force_resume() and rearrange documentation related to __pm_runtime_disable() (Rafael Wysocki) - Rework the handling of the "smart suspend" driver flag in the PM core to avoid issues hat may occur when drivers using it depend on some other drivers and clean up the related PM core code (Rafael Wysocki, Colin Ian King) - Fix the handling of devices with the power.direct_complete flag set if device_suspend() returns an error for at least one device to avoid situations in which some of them may not be resumed (Rafael Wysocki) - Use mutex_trylock() in hibernate_compressor_param_set() to avoid a possible deadlock that may occur if the "compressor" hibernation module parameter is accessed during the registration of a new ieee80211 device (Lizhi Xu) - Suppress sleeping parent warning in device_pm_add() in the case when new children are added under a device with the power.direct_complete set after it has been processed by device_resume() (Xu Yang) - Remove needless return in three void functions related to system wakeup (Zijun Hu) - Replace deprecated kmap_atomic() with kmap_local_page() in the hibernation core code (David Reaver) - Remove unused helper functions related to system sleep (David Alan Gilbert) - Clean up s2idle_enter() so it does not lock and unlock CPU offline in vain and update comments in it (Ulf Hansson) - Clean up broken white space in dpm_wait_for_children() (Geert Uytterhoeven) - Update the cpupower utility to fix lib version-ing in it and memory leaks in error legs, remove hard-coded values, and implement CPU physical core querying (Thomas Renninger, John B. Wyatt IV, Shuah Khan, Yiwei Lin, Zhongqiu Han)" * tag 'pm-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (139 commits) PM: sleep: Fix bit masking operation dt-bindings: cpufreq: cpufreq-qcom-hw: Narrow properties on SDX75, SA8775p and SM8650 dt-bindings: cpufreq: cpufreq-qcom-hw: Drop redundant minItems:1 dt-bindings: cpufreq: cpufreq-qcom-hw: Add missing constraint for interrupt-names dt-bindings: cpufreq: cpufreq-qcom-hw: Add QCS8300 compatible cpufreq: Init cpufreq only for present CPUs PM: sleep: Fix handling devices with direct_complete set on errors cpuidle: Init cpuidle only for present CPUs PM: clk: Remove unused pm_clk_remove() PM: sleep: core: Fix indentation in dpm_wait_for_children() PM: s2idle: Extend comment in s2idle_enter() PM: s2idle: Drop redundant locks when entering s2idle PM: sleep: Remove unused pm_generic_ wrappers cpufreq: tegra186: Share policy per cluster cpupower: Make lib versioning scheme more obvious and fix version link PM: EM: Rework the depends on for CONFIG_ENERGY_MODEL PM: EM: Address RCU-related sparse warnings cpupower: Implement CPU physical core querying pm: cpupower: remove hard-coded topology depth values pm: cpupower: Fix cmd_monitor() error legs to free cpu_topology ...
2025-03-25Merge tag 'for-linus-6.15-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - cleanup: remove an used function - add support for a XenServer specific virtual PCI device - fix the handling of a sparse Xen hypervisor symbol table - avoid warnings when building the kernel with gcc 15 - fix use of devices behind a VMD bridge when running as a Xen PV dom0 * tag 'for-linus-6.15-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: PCI/MSI: Convert pci_msi_ignore_mask to per MSI domain flag PCI: vmd: Disable MSI remapping bypass under Xen xen/pci: Do not register devices with segments >= 0x10000 xen/pciback: Remove unused pcistub_get_pci_dev xenfs/xensyms: respect hypervisor's "next" indication xen/mcelog: Add __nonstring annotations for unterminated strings xen: Add support for XenServer 6.1 platform device
2025-03-25Merge tag 'irq-msi-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull MSI irq updates from Thomas Gleixner: - Switch the MSI descriptor locking to guards - Replace the broken PCI/TPH implementation, which lacks any form of serialization against concurrent modifications with a properly serialized mechanism in the PCI/MSI core code - Replace the MSI descriptor abuse in the SCSI/UFS Qualcom driver with dedicated driver internal storage * tag 'irq-msi-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/msi: Rename msi_[un]lock_descs() scsi: ufs: qcom: Remove the MSI descriptor abuse PCI/TPH: Replace the broken MSI-X control word update PCI/MSI: Provide a sane mechanism for TPH PCI: hv: Switch MSI descriptor locking to guard() PCI/MSI: Switch to MSI descriptor locking to guard() NTB/msi: Switch MSI descriptor locking to lock guard() soc: ti: ti_sci_inta_msi: Switch MSI descriptor locking to guard() genirq/msi: Use lock guards for MSI descriptor locking cleanup: Provide retain_ptr() genirq/msi: Make a few functions static
2025-03-24Merge tag 'x86-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: "x86 CPU features support: - Generate the <asm/cpufeaturemasks.h> header based on build config (H. Peter Anvin, Xin Li) - x86 CPUID parsing updates and fixes (Ahmed S. Darwish) - Introduce the 'setcpuid=' boot parameter (Brendan Jackman) - Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan Jackman) - Utilize CPU-type for CPU matching (Pawan Gupta) - Warn about unmet CPU feature dependencies (Sohil Mehta) - Prepare for new Intel Family numbers (Sohil Mehta) Percpu code: - Standardize & reorganize the x86 percpu layout and related cleanups (Brian Gerst) - Convert the stackprotector canary to a regular percpu variable (Brian Gerst) - Add a percpu subsection for cache hot data (Brian Gerst) - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak) - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak) MM: - Add support for broadcast TLB invalidation using AMD's INVLPGB instruction (Rik van Riel) - Rework ROX cache to avoid writable copy (Mike Rapoport) - PAT: restore large ROX pages after fragmentation (Kirill A. Shutemov, Mike Rapoport) - Make memremap(MEMREMAP_WB) map memory as encrypted by default (Kirill A. Shutemov) - Robustify page table initialization (Kirill A. Shutemov) - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn) - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW (Matthew Wilcox) KASLR: - x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir Singh) CPU bugs: - Implement FineIBT-BHI mitigation (Peter Zijlstra) - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta) - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta) - RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta) System calls: - Break up entry/common.c (Brian Gerst) - Move sysctls into arch/x86 (Joel Granados) Intel LAM support updates: (Maciej Wieczor-Retman) - selftests/lam: Move cpu_has_la57() to use cpuinfo flag - selftests/lam: Skip test if LAM is disabled - selftests/lam: Test get_user() LAM pointer handling AMD SMN access updates: - Add SMN offsets to exclusive region access (Mario Limonciello) - Add support for debugfs access to SMN registers (Mario Limonciello) - Have HSMP use SMN through AMD_NODE (Yazen Ghannam) Power management updates: (Patryk Wlazlyn) - Allow calling mwait_play_dead with an arbitrary hint - ACPI/processor_idle: Add FFH state handling - intel_idle: Provide the default enter_dead() handler - Eliminate mwait_play_dead_cpuid_hint() Build system: - Raise the minimum GCC version to 8.1 (Brian Gerst) - Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor) Kconfig: (Arnd Bergmann) - Add cmpxchg8b support back to Geode CPUs - Drop 32-bit "bigsmp" machine support - Rework CONFIG_GENERIC_CPU compiler flags - Drop configuration options for early 64-bit CPUs - Remove CONFIG_HIGHMEM64G support - Drop CONFIG_SWIOTLB for PAE - Drop support for CONFIG_HIGHPTE - Document CONFIG_X86_INTEL_MID as 64-bit-only - Remove old STA2x11 support - Only allow CONFIG_EISA for 32-bit Headers: - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers (Thomas Huth) Assembly code & machine code patching: - x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf) - x86/alternatives: Simplify callthunk patching (Peter Zijlstra) - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf) - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf) - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra) - x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> (Uros Bizjak) - Use named operands in inline asm (Uros Bizjak) - Improve performance by using asm_inline() for atomic locking instructions (Uros Bizjak) Earlyprintk: - Harden early_serial (Peter Zijlstra) NMI handler: - Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus() (Waiman Long) Miscellaneous fixes and cleanups: - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner, Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly Kuznetsov, Xin Li, liuye" * tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits) zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault x86/asm: Make asm export of __ref_stack_chk_guard unconditional x86/mm: Only do broadcast flush from reclaim if pages were unmapped perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones perf/x86/intel, x86/cpu: Simplify Intel PMU initialization x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions x86/asm: Use asm_inline() instead of asm() in clwb() x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h> x86/hweight: Use asm_inline() instead of asm() x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm() x86/hweight: Use named operands in inline asm() x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP x86/head/64: Avoid Clang < 17 stack protector in startup code x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro x86/cpu/intel: Limit the non-architectural constant_tsc model checks x86/mm/pat: Replace Intel x86_model checks with VFM ones x86/cpu/intel: Fix fast string initialization for extended Families ...
2025-03-24PCI: intel-gw: Remove intel_pcie_cpu_addr()Frank Li
Remove intel_pcie_cpu_addr(), the .cpu_addr_fixup() method, because the dwc core driver already handles address translation based on the devicetree description. [bhelgaas: this does require a minor dts change, but maintainer Lei Chuan Hua <lchuanhua@maxlinear.com> confirms that the driver is only used internally to Maxlinear and internal users will update dts: https://lore.kernel.org/r/BY3PR19MB507667CE7531D863E1E5F8AEBDD82@BY3PR19MB5076.namprd19.prod.outlook.com] Link: https://lore.kernel.org/r/20250305-intel-v1-1-40db3a685490@nxp.com Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-24PCI: imx6: Remove imx_pcie_cpu_addr_fixup()Frank Li
Remove imx_pcie_cpu_addr_fixup, the .cpu_addr_fixup() method, because the dwc core driver already handles address translation based on the devicetree description. Link: https://lore.kernel.org/r/20250315201548.858189-14-helgaas@kernel.org Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Acked-by: Richard Zhu <hongxing.zhu@nxp.com>
2025-03-24PCI: dwc: Use parent_bus_offset to remove need for .cpu_addr_fixup()Frank Li
We know the parent_bus_offset, either computed from a DT reg property (the offset is the CPU physical addr - the 'config'/'addr_space' address on the parent bus) or from a .cpu_addr_fixup() (which may have used a host bridge window offset). Apply that parent_bus_offset instead of calling .cpu_addr_fixup() when programming the ATU. This assumes all intermediate addresses are at the same offset from the CPU physical addresses. [bhelgaas: commit log] Link: https://lore.kernel.org/r/20250315201548.858189-13-helgaas@kernel.org Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-24PCI: dwc: ep: Ensure proper iteration over outbound map windowsFrank Li
Most systems' PCIe outbound map windows have non-zero physical addresses, but the possibility of encountering zero increased after following commit ("PCI: dwc: Use parent_bus_offset"). 'ep->outbound_addr[n]', representing 'parent_bus_address', might be 0 on some hardware, which trims high address bits through bus fabric before sending to the PCIe controller. Replace the iteration logic with 'for_each_set_bit()' to ensure only allocated map windows are iterated when determining the ATU index from a given address. Link: https://lore.kernel.org/r/20250315201548.858189-12-helgaas@kernel.org Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-24PCI: dwc: ep: Use devicetree 'reg[addr_space]' to derive CPU -> ATU addr offsetFrank Li
Endpoint ┌───────────────────────────────────────────────┐ │ pcie-ep@5f010000 │ │ ┌────────────────┐│ │ │ Endpoint ││ │ │ PCIe ││ │ │ Controller ││ │ bus@5f000000 │ ┌────────► │ ┌──────────┐ │ │ ││dynamically │ │ │ Outbound Transfer │ ││allocated │┌─────┐ │ Bus ┼─────►│ ATU ───────┘ ││PCI Addr ││ │ │ Fabric │Bus │ ││ ││ CPU ├───►│ │Addr │ ││ ││ │CPU │ │0x8000_0000 ││ │└─────┘Addr└──────────┘ │ ││ │ 0x7000_0000 └────────────────┘│ └───────────────────────────────────────────────┘ bus@5f000000 { compatible = "simple-bus"; ranges = <0x80000000 0x0 0x70000000 0x10000000>; pcie-ep@5f010000 { reg = <0x80000000 0x10000000>; reg-names ="addr_space"; ... }; ... }; In the diagram above, CPU writes data to outbound window address 0x7000_0000, and the bus fabric maps it to 0x8000_0000. The ATU uses bus address 0x8000_0000 as input address and maps to some PCI address dynamically allocated by a PCI device driver on the host side. The pcie-ep@5f010000 'reg[addr_space]' is the parent bus address, which is the input of PCIe controller, including the ATU. Set parent_bus_offset, the offset from the CPU address to the PCIe controller input address using dw_pcie_init_parent_bus_offset(). The parent_bus_offset is not used yet, so no functional change intended. [bhelgaas: commit log] Link: https://lore.kernel.org/r/20250315201548.858189-11-helgaas@kernel.org Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-24PCI: dwc: ep: Consolidate devicetree handling in dw_pcie_ep_get_resources()Bjorn Helgaas
Consolidate devicetree resource handling in dw_pcie_ep_get_resources(). No functional change intended. Link: https://lore.kernel.org/r/20250315201548.858189-10-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Frank Li <Frank.Li@nxp.com>
2025-03-24PCI: dwc: ep: Call epc_create() early in dw_pcie_ep_init()Bjorn Helgaas
Move devm_pci_epc_create() to the beginning of dw_pcie_ep_init(). devm_pci_epc_create() is generic code that doesn't depend on any DWC resource, so moving it earlier keeps all the subsequent devicetree-related code together. Link: https://lore.kernel.org/r/20250315201548.858189-9-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Frank Li <Frank.Li@nxp.com>
2025-03-24PCI: dwc: Use devicetree 'reg[config]' to derive CPU -> ATU addr offsetFrank Li
The 'ranges' property of a PCI controller's parent can indicate address translation information. Most system use 1:1 map between CPU physical and PCI controller input addresses. But some hardware, like i.MX8QXP, doesn't use 1:1 map. See below diagram: ┌─────────┐ ┌────────────┐ ┌─────┐ │ │ IA: 0x8ff8_0000 │ │ │ CPU ├───►│ ┌────►├─────────────────┐ │ PCI │ └─────┘ │ │ │ IA: 0x8ff0_0000 │ │ │ CPU Addr │ │ ┌─►├─────────────┐ │ │ Controller │ 0x7ff8_0000─┼───┘ │ │ │ │ │ │ │ │ │ │ │ │ │ PCI Addr 0x7ff0_0000─┼──────┘ │ │ └──► IOSpace ─┼────────────► │ │ │ │ │ 0 0x7000_0000─┼────────►├─────────┐ │ │ │ └─────────┘ │ └──────► CfgSpace ─┼────────────► Bus Fabric │ │ │ 0 │ │ │ └──────────► MemSpace ─┼────────────► IA: 0x8000_0000 │ │ 0x8000_0000 └────────────┘ bus@5f000000 { compatible = "simple-bus"; #address-cells = <1>; #size-cells = <1>; ranges = <0x80000000 0x0 0x70000000 0x10000000>; pcie@5f010000 { compatible = "fsl,imx8q-pcie"; reg = <0x5f010000 0x10000>, <0x8ff00000 0x80000>; reg-names = "dbi", "config"; ... }; }; Intermediate address (IA) here means the PCIe controller input address. The pcie@5f010000 'reg[config]' address is the parent bus (PCIe controller input) address of CfgSpace. The ATU in MemSpace is not explicitly described via devicetree, so we assume the offset from CPU address to intermediate MemSpace address is the same as that for CfgSpace. We could use bus@5f000000 'ranges' for the same purpose. Set parent_bus_offset using dw_pcie_init_parent_bus_offset(). The parent_bus_offset is not used yet, so no functional change intended. Link: https://lore.kernel.org/r/20250315201548.858189-8-helgaas@kernel.org Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-24PCI: dwc: Add dw_pcie_parent_bus_offset() checking and debugFrank Li
dw_pcie_parent_bus_offset() looks up the parent bus address of a PCI controller 'reg' property in devicetree. If implemented, .cpu_addr_fixup() is a hard-coded way to get the parent bus address corresponding to a CPU physical address. Add debug code to compare the address from .cpu_addr_fixup() with the address from devicetree. If they match, warn that .cpu_addr_fixup() is redundant and should be removed; if they differ, warn that something is wrong with the devicetree. If .cpu_addr_fixup() is not implemented, the parent bus address should be identical to the CPU physical address because we previously ignored the parent bus address from devicetree. If the devicetree has a different parent bus address, warn about it being broken. [bhelgaas: split debug to separate patch for easier future revert, commit log] Link: https://lore.kernel.org/r/20250315201548.858189-7-helgaas@kernel.org Signed-off-by: Frank Li <Frank.Li@nxp.com> [bhelgaas: squash Ioana Ciornei <ioana.ciornei@nxp.com> fix for NULL pointer deref when driver doesn't supply dw_pcie_ops, e.g., layerscape-pcie https://lore.kernel.org/r/20250319134339.3114817-1-ioana.ciornei@nxp.com] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-24PCI: dwc: Add dw_pcie_parent_bus_offset()Frank Li
Return the offset from CPU physical address to the parent bus address of the specified element of the devicetree 'reg' property. [bhelgaas: cpu_phy_addr -> cpu_phys_addr, return offset, split .cpu_addr_fixup() checking and debug to separate patch] Link: https://lore.kernel.org/r/20250315201548.858189-6-helgaas@kernel.org Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-23PCI/bwctrl: Fix NULL pointer dereference on bus number exhaustionLukas Wunner
When BIOS neglects to assign bus numbers to PCI bridges, the kernel attempts to correct that during PCI device enumeration. If it runs out of bus numbers, no pci_bus is allocated and the "subordinate" pointer in the bridge's pci_dev remains NULL. The PCIe bandwidth controller erroneously does not check for a NULL subordinate pointer and dereferences it on probe. Bandwidth control of unusable devices below the bridge is of questionable utility, so simply error out instead. This mirrors what PCIe hotplug does since commit 62e4492c3063 ("PCI: Prevent NULL dereference during pciehp probe"). The PCI core emits a message with KERN_INFO severity if it has run out of bus numbers. PCIe hotplug emits an additional message with KERN_ERR severity to inform the user that hotplug functionality is disabled at the bridge. A similar message for bandwidth control does not seem merited, given that its only purpose so far is to expose an up-to-date link speed in sysfs and throttle the link speed on certain laptops with limited Thermal Design Power. So error out silently. User-visible messages: pci 0000:16:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring [...] pci_bus 0000:45: busn_res: [bus 45-74] end is updated to 74 pci 0000:16:02.0: devices behind bridge are unusable because [bus 45-74] cannot be assigned for them [...] pcieport 0000:16:02.0: pciehp: Hotplug bridge without secondary bus, ignoring [...] BUG: kernel NULL pointer dereference RIP: pcie_update_link_speed pcie_bwnotif_enable pcie_bwnotif_probe pcie_port_probe_service really_probe Fixes: 665745f27487 ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller") Reported-by: Wouter Bijlsma <wouter@wouterbijlsma.nl> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219906 Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Tested-by: Wouter Bijlsma <wouter@wouterbijlsma.nl> Cc: stable@vger.kernel.org # v6.13+ Link: https://lore.kernel.org/r/3b6c8d973aedc48860640a9d75d20528336f1f3c.1742669372.git.lukas@wunner.de
2025-03-23PCI: xilinx-cpm: Add cpm_csr register mapping for CPM5_HOST1 variantThippeswamy Havalige
Update the CPM5 check to include CPM5_HOST1 variant. Previously, only CPM5 was considered when mapping the "cpm_csr" register. With this change, CPM5_HOST1 is also supported, ensuring proper resource mapping for this variant. Signed-off-by: Thippeswamy Havalige <thippeswamy.havalige@amd.com> [kwilczynski: commit log] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Link: https://lore.kernel.org/r/20250317124136.1317723-1-thippeswamy.havalige@amd.com
2025-03-23PCI: brcmstb: Make const read-only arrays staticColin Ian King
Don't populate the const read-only arrays "data" and "regs" on the stack at run time, instead make them static. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> [kwilczynski: commit log, wrap overly long line to 80 columns] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Link: https://lore.kernel.org/r/20250317143456.477901-1-colin.i.king@gmail.com
2025-03-23PCI: amd-mdb: Add AMD MDB Root Port driverThippeswamy Havalige
Add support for AMD MDB (Multimedia DMA Bridge) IP core as Root Port. The Versal2 devices include MDB Module. The integrated block for MDB along with the integrated bridge can function as PCIe Root Port controller at Gen5 32-GT/s operation per lane. Bridge supports error and INTx interrupts and are handled using platform specific interrupt line in Versal2. Signed-off-by: Thippeswamy Havalige <thippeswamy.havalige@amd.com> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Link: https://lore.kernel.org/r/20250228093351.923615-4-thippeswamy.havalige@amd.com [bhelgaas: only present on ARM64-based SoCs; squash Kconfig dependency on ARM64 from Geert Uytterhoeven <geert+renesas@glider.be>: https://lore.kernel.org/r/eaef1dea7edcf146aa377d5e5c5c85a76ff56bae.1742306383.git.geert+renesas@glider.be] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> [kwilczynski: commit log, code comments and error messages clean-up, drop redundant "depends on PCI" from Kconfig, expose the error code as part of error messages where appropriatie, change "depends on" expression to match existing style from other drivers] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
2025-03-21PCI/DOE: Allow enabling DOE without CXLAlistair Francis
PCIe devices (not CXL) can support DOE as well, so allow DOE to be enabled even if CXL isn't. Link: https://lore.kernel.org/r/20250306075211.1855177-4-alistair@alistair23.me Signed-off-by: Alistair Francis <alistair@alistair23.me> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2025-03-21PCI/DOE: Expose DOE features via sysfsAlistair Francis
PCIe r6.0 added support for Data Object Exchange (DOE). When DOE is supported, the DOE Discovery Feature must be implemented per PCIe r6.1, sec 6.30.1.1. DOE allows a requester to obtain information about the other DOE features supported by the device. The kernel already queries the DOE features supported and caches the values. Expose the values in sysfs to allow user space to determine which DOE features are supported by the PCIe device. By exposing the information to userspace, tools like lspci can relay the information to users. By listing all of the supported features we can allow userspace to parse the list, which might include vendor specific features as well as yet to be supported features. As the DOE Discovery feature must always be supported we treat it as a special named attribute case. This allows the usual PCI attribute_group handling to correctly create the doe_features directory when registering pci_doe_sysfs_group (otherwise it doesn't and sysfs_add_file_to_group() will seg fault). After this patch is supported you can see something like this when attaching a DOE device: $ ls /sys/devices/pci0000:00/0000:00:02.0//doe* 0001:01 0001:02 doe_discovery Link: https://lore.kernel.org/r/20250306075211.1855177-3-alistair@alistair23.me Signed-off-by: Alistair Francis <alistair@alistair23.me> [bhelgaas: drop pci_doe_sysfs_init() stub return, make DEVICE_ATTR_RO(doe_discovery) static] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2025-03-21s390/pci: Introduce pdev->non_mappable_bars and replace VFIO_PCI_MMAPNiklas Schnelle
The ability to map PCI resources to user-space is controlled by global defines. For vfio there is VFIO_PCI_MMAP which is only disabled on s390 and controls mapping of PCI resources using vfio-pci with a fallback option via the pread()/pwrite() interface. For the PCI core there is ARCH_GENERIC_PCI_MMAP_RESOURCE which enables a generic implementation for mapping PCI resources plus the newer sysfs interface. Then there is HAVE_PCI_MMAP which can be used with custom definitions of pci_mmap_resource_range() and the historical /proc/bus/pci interface. Both mechanisms are all or nothing. For s390 mapping PCI resources is possible and useful for testing and certain applications such as QEMU's vfio-pci based user-space NVMe driver. For certain devices, however access to PCI resources via mappings to user-space is not possible and these must be excluded from the general PCI resource mapping mechanisms. Introduce pdev->non_mappable_bars to indicate that a PCI device's BARs can not be accessed via mappings to user-space. In the future this enables per-device restrictions of PCI resource mapping. For now, set this flag for all PCI devices on s390 in line with the existing, general disable of PCI resource mapping. As s390 is the only user of the VFI_PCI_MMAP Kconfig options this can already be replaced with a check of this new flag. Also add similar checks in the other code protected by HAVE_PCI_MMAP respectively ARCH_GENERIC_PCI_MMAP in preparation for enabling these for supported devices. Link: https://lore.kernel.org/lkml/20250212132808.08dcf03c.alex.williamson@redhat.com/ Link: https://lore.kernel.org/r/20250226-vfio_pci_mmap-v7-2-c5c0f1d26efd@linux.ibm.com Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-21PCI: Fix NULL dereference in SR-IOV VF creation error pathShay Drory
Clean up when virtfn setup fails to prevent NULL pointer dereference during device removal. The kernel oops below occurred due to incorrect error handling flow when pci_setup_device() fails. Add pci_iov_scan_device(), which handles virtfn allocation and setup and cleans up if pci_setup_device() fails, so pci_iov_add_virtfn() doesn't need to call pci_stop_and_remove_bus_device(). This prevents accessing partially initialized virtfn devices during removal. BUG: kernel NULL pointer dereference, address: 00000000000000d0 RIP: 0010:device_del+0x3d/0x3d0 Call Trace: pci_remove_bus_device+0x7c/0x100 pci_iov_add_virtfn+0xfa/0x200 sriov_enable+0x208/0x420 mlx5_core_sriov_configure+0x6a/0x160 [mlx5_core] sriov_numvfs_store+0xae/0x1a0 Link: https://lore.kernel.org/r/20250310084524.599225-1-shayd@nvidia.com Fixes: e3f30d563a38 ("PCI: Make pci_destroy_dev() concurrent safe") Signed-off-by: Shay Drory <shayd@nvidia.com> [bhelgaas: commit log, return ERR_PTR(-ENOMEM) directly] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Keith Busch <kbusch@kernel.org>
2025-03-21PCI/bwctrl: Fix pcie_bwctrl_select_speed() return typeIlpo Järvinen
pcie_bwctrl_select_speed() should take __fls() of the speed bit, not return it as a raw value. Instead of directly returning 2.5GT/s speed bit, simply assign the fallback speed (2.5GT/s) into supported_speeds variable to share the normal return path that calls pcie_supported_speeds2target_speed() to calculate __fls(). This code path is not very likely to execute because pcie_get_supported_speeds() should provide valid ->supported_speeds but a spec violating device could fail to synthesize any speed in pcie_get_supported_speeds(). It could also happen in case the supported_speeds intersection is empty (also a violation of the current PCIe specs). Link: https://lore.kernel.org/r/20250321163103.5145-1-ilpo.jarvinen@linux.intel.com Fixes: de9a6c8d5dbf ("PCI/bwctrl: Add pcie_set_target_speed() to set PCIe Link Speed") Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-21PCI: pciehp: Don't enable HPIE when resuming in poll modeIlpo Järvinen
PCIe hotplug can operate in poll mode without interrupt handlers using a polling kthread only. eb34da60edee ("PCI: pciehp: Disable hotplug interrupt during suspend") failed to consider that and enables HPIE (Hot-Plug Interrupt Enable) unconditionally when resuming the Port. Only set HPIE if non-poll mode is in use. This makes pcie_enable_interrupt() match how pcie_enable_notification() already handles HPIE. Link: https://lore.kernel.org/r/20250321162114.3939-1-ilpo.jarvinen@linux.intel.com Fixes: eb34da60edee ("PCI: pciehp: Disable hotplug interrupt during suspend") Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Lukas Wunner <lukas@wunner.de>
2025-03-21PCI/MSI: Convert pci_msi_ignore_mask to per MSI domain flagRoger Pau Monne
Setting pci_msi_ignore_mask inhibits the toggling of the mask bit for both MSI and MSI-X entries globally, regardless of the IRQ chip they are using. Only Xen sets the pci_msi_ignore_mask when routing physical interrupts over event channels, to prevent PCI code from attempting to toggle the maskbit, as it's Xen that controls the bit. However, the pci_msi_ignore_mask being global will affect devices that use MSI interrupts but are not routing those interrupts over event channels (not using the Xen pIRQ chip). One example is devices behind a VMD PCI bridge. In that scenario the VMD bridge configures MSI(-X) using the normal IRQ chip (the pIRQ one in the Xen case), and devices behind the bridge configure the MSI entries using indexes into the VMD bridge MSI table. The VMD bridge then demultiplexes such interrupts and delivers to the destination device(s). Having pci_msi_ignore_mask set in that scenario prevents (un)masking of MSI entries for devices behind the VMD bridge. Move the signaling of no entry masking into the MSI domain flags, as that allows setting it on a per-domain basis. Set it for the Xen MSI domain that uses the pIRQ chip, while leaving it unset for the rest of the cases. Remove pci_msi_ignore_mask at once, since it was only used by Xen code, and with Xen dropping usage the variable is unneeded. This fixes using devices behind a VMD bridge on Xen PV hardware domains. Albeit Devices behind a VMD bridge are not known to Xen, that doesn't mean Linux cannot use them. By inhibiting the usage of VMD_FEAT_CAN_BYPASS_MSI_REMAP and the removal of the pci_msi_ignore_mask bodge devices behind a VMD bridge do work fine when use from a Linux Xen hardware domain. That's the whole point of the series. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Juergen Gross <jgross@suse.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Message-ID: <20250219092059.90850-4-roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-03-21PCI: vmd: Disable MSI remapping bypass under XenRoger Pau Monne
MSI remapping bypass (directly configuring MSI entries for devices on the VMD bus) won't work under Xen, as Xen is not aware of devices in such bus, and hence cannot configure the entries using the pIRQ interface in the PV case, and in the PVH case traps won't be setup for MSI entries for such devices. Until Xen is aware of devices in the VMD bus prevent the VMD_FEAT_CAN_BYPASS_MSI_REMAP capability from being used when running as any kind of Xen guest. The MSI remapping bypass is an optional feature of VMD bridges, and hence when running under Xen it will be masked and devices will be forced to redirect its interrupts from the VMD bridge. That mode of operation must always be supported by VMD bridges and works when Xen is not aware of devices behind the VMD bridge. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Message-ID: <20250219092059.90850-3-roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2025-03-20PCI: Move cardbus IO size declarations into pci/pci.hIlpo Järvinen
For some reason, cardbus related io/mem size declarations are in linux/pci.h, whereas non-cardbus sizes are already in pci/pci.h. Move all them into one place in pci/pci.h. Link: https://lore.kernel.org/r/20250311174701.3586-4-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-20PCI: Make pci_setup_bridge() staticIlpo Järvinen
pci_setup_bridge() is only used within setup-bus.c. Therefore, make it a static function. Link: https://lore.kernel.org/r/20250311174701.3586-3-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-20PCI: Move resource reassignment func declarations into pci/pci.hIlpo Järvinen
Neither pci_reassign_bridge_resources() nor pci_reassign_resource() is used outside of the PCI subsystem. They seem to be naturally static functions but since resource fitting/assignment is split between setup-bus.c and setup-res.c, they fall into different sides of the divide and need to be declared. Move the declarations of pci_reassign_bridge_resources() and pci_reassign_resource() into pci/pci.h to keep them internal to PCI subsystem. Link: https://lore.kernel.org/r/20250311174701.3586-2-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-20PCI: Move pci_rescan_bus_bridge_resize() declaration to pci/pci.hIlpo Järvinen
pci_rescan_bus_bridge_resize() is only used by code inside PCI subsystem. The comment also falsely advertises it to be for hotplug drivers, yet the only caller is from sysfs store function. Move the function declaration into pci/pci.h. Link: https://lore.kernel.org/r/20250311174701.3586-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-20PCI: Fix BAR resizing when VF BARs are assignedIlpo Järvinen
__resource_resize_store() attempts to release all resources of the device before attempting the resize. The loop, however, only covers standard BARs (< PCI_STD_NUM_BARS). If a device has VF BARs that are assigned, pci_reassign_bridge_resources() finds the bridge window still has some assigned child resources and returns -NOENT which makes pci_resize_resource() to detect an error and abort the resize. Change the release loop to cover all resources up to VF BARs which allows the resize operation to release the bridge windows and attempt to assigned them again with the different size. If SR-IOV is enabled, disallow resize as it requires releasing also IOV resources. Link: https://lore.kernel.org/r/20250320142837.8027-1-ilpo.jarvinen@linux.intel.com Fixes: 91fa127794ac ("PCI: Expose PCIe Resizable BAR support via sysfs") Reported-by: Michał Winiarski <michal.winiarski@intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
2025-03-20PCI: Allow PCI bridges to go to D3Hot on all non-x86Manivannan Sadhasivam
Currently, pci_bridge_d3_possible() encodes a variety of decision factors when deciding whether a given bridge can be put into D3. A particular one of note is for "recent enough PCIe ports." Per Rafael [0]: "There were hardware issues related to PM on x86 platforms predating the introduction of Connected Standby in Windows. For instance, programming a port into D3hot by writing to its PMCSR might cause the PCIe link behind it to go down and the only way to revive it was to power cycle the Root Complex. And similar." Thus, this function contains a DMI-based check for post-2015 BIOS. The above factors (Windows, x86) don't really apply to non-x86 systems, and also, many such systems don't have BIOS or DMI. However, we'd like to be able to suspend bridges on non-x86 systems too. Restrict the "recent enough" check to x86. If we find further incompatibilities, it probably makes sense to expand on the deny-list approach (i.e., bridge_d3_blacklist or similar). Link: https://lore.kernel.org/r/20250320110604.v6.1.Id0a0e78ab0421b6bce51c4b0b87e6aebdfc69ec7@changeid Link: https://lore.kernel.org/linux-pci/CAJZ5v0j_6jeMAQ7eFkZBe5Yi+USGzysxAgfemYh=-zq4h5W+Qg@mail.gmail.com/ [0] Link: https://lore.kernel.org/linux-pci/20240227225442.GA249898@bhelgaas/ [1] Link: https://lore.kernel.org/linux-pci/20240828210705.GA37859@bhelgaas/ [2] [Brian: rewrite to !X86 based on Rafael's suggestions] Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-18PCI: vmd: Make vmd_dev::cfg_lock a raw_spinlock_t typeRyo Takakura
The access to the PCI config space via pci_ops::read and pci_ops::write is a low-level hardware access. The functions can be accessed with disabled interrupts even on PREEMPT_RT. The pci_lock is a raw_spinlock_t for this purpose. A spinlock_t becomes a sleeping lock on PREEMPT_RT, so it cannot be acquired with disabled interrupts. The vmd_dev::cfg_lock is accessed in the same context as the pci_lock. Make vmd_dev::cfg_lock a raw_spinlock_t type so it can be used with interrupts disabled. This was reported as: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 Call Trace: rt_spin_lock+0x4e/0x130 vmd_pci_read+0x8d/0x100 [vmd] pci_user_read_config_byte+0x6f/0xe0 pci_read_config+0xfe/0x290 sysfs_kf_bin_read+0x68/0x90 Signed-off-by: Ryo Takakura <ryotkkr98@gmail.com> Tested-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Acked-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> [bigeasy: reword commit message] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Link: https://lore.kernel.org/r/20250218080830.ufw3IgyX@linutronix.de [kwilczynski: commit log] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> [bhelgaas: add back report info from https://lore.kernel.org/lkml/20241218115951.83062-1-ryotkkr98@gmail.com/] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-17mm: allow compound zone device pagesAlistair Popple
Zone device pages are used to represent various type of device memory managed by device drivers. Currently compound zone device pages are not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only user of higher order zone device pages and have their own page reference counting. A future change will unify FS DAX reference counting with normal page reference counting rules and remove the special FS DAX reference counting. Supporting that requires compound zone device pages. Supporting compound zone device pages requires compound_head() to distinguish between head and tail pages whilst still preserving the special struct page fields that are specific to zone device pages. A tail page is distinguished by having bit zero being set in page->compound_head, with the remaining bits pointing to the head page. For zone device pages page->compound_head is shared with page->pgmap. The page->pgmap field must be common to all pages within a folio, even if the folio spans memory sections. Therefore pgmap is the same for both head and tail pages and can be moved into the folio and we can use the standard scheme to find compound_head from a tail page. Link: https://lkml.kernel.org/r/67055d772e6102accf85161d0b57b0b3944292bf.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/mm_init: move p2pdma page refcount initialisation to p2pdmaAlistair Popple
Currently ZONE_DEVICE page reference counts are initialised by core memory management code in __init_zone_device_page() as part of the memremap() call which driver modules make to obtain ZONE_DEVICE pages. This initialises page refcounts to 1 before returning them to the driver. This was presumably done because it drivers had a reference of sorts on the page. It also ensured the page could always be mapped with vm_insert_page() for example and would never get freed (ie. have a zero refcount), freeing drivers of manipulating page reference counts. However it complicates figuring out whether or not a page is free from the mm perspective because it is no longer possible to just look at the refcount. Instead the page type must be known and if GUP is used a secondary pgmap reference is also sometimes needed. To simplify this it is desirable to remove the page reference count for the driver, so core mm can just use the refcount without having to account for page type or do other types of tracking. This is possible because drivers can always assume the page is valid as core kernel will never offline or remove the struct page. This means it is now up to drivers to initialise the page refcount as required. P2PDMA uses vm_insert_page() to map the page, and that requires a non-zero reference count when initialising the page so set that when the page is first mapped. Link: https://lkml.kernel.org/r/6aedb0ac2886dcc4503cb705273db5b3863a0b66.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17PCI: dwc: Consolidate devicetree handling in dw_pcie_host_get_resources()Bjorn Helgaas
Consolidate devicetree resource handling in dw_pcie_host_get_resources(). No functional change intended. Link: https://lore.kernel.org/r/20250315201548.858189-5-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Frank Li <Frank.Li@nxp.com>
2025-03-17PCI: dwc: Call devm_pci_alloc_host_bridge() early in dw_pcie_host_init()Frank Li
Move devm_pci_alloc_host_bridge() to the beginning of dw_pcie_host_init(). devm_pci_alloc_host_bridge() is generic code that doesn't depend on any DWC resource, so moving it earlier keeps all the subsequent devicetree-related code together. [bhelgaas: reorder earlier in series] Link: https://lore.kernel.org/r/20250315201548.858189-4-helgaas@kernel.org Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-17PCI: dwc: Rename cpu_addr to parent_bus_addr for ATU configurationFrank Li
Rename 'cpu_addr' to 'parent_bus_addr' in the DesignWare ATU configuration. The ATU translates parent bus addresses to PCI addresses, which are often the same as CPU addresses but can differ in systems where the bus fabric translates addresses before passing them to the PCIe controller. This renaming clarifies the purpose and avoids confusion. Link: https://lore.kernel.org/r/20250315201548.858189-3-helgaas@kernel.org Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-17PCI: dwc: Use resource start as ioremap() input in dw_pcie_pme_turn_off()Frank Li
The msg_res region translates writes into PCIe Message TLPs. Previously we mapped this region using atu.cpu_addr, the input address programmed into the ATU. "cpu_addr" is a misnomer because when a bus fabric translates addresses between the CPU and the ATU, the ATU input address is different from the CPU address. A future patch will rename "cpu_addr" and correct the value to be the ATU input address instead of the CPU physical address. Map the msg_res region before writing to it using the msg_res resource start, a CPU physical address. Link: https://lore.kernel.org/r/20250315201548.858189-2-helgaas@kernel.org Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-16PCI: histb: Fix an error handling path in histb_pcie_probe()Christophe JAILLET
If an error occurs after a successful phy_init() call, then phy_exit() should be called. Add the missing call, as already done in the remove function. Fixes: bbd11bddb398 ("PCI: hisi: Add HiSilicon STB SoC PCIe controller driver") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> [kwilczynski: remove unnecessary hipcie->phy NULL check from histb_pcie_probe() and squash a patch that removes similar NULL check for hipcie-phy from histb_pcie_remove() from https://lore.kernel.org/linux-pci/c369b5d25e17a44984ae5a889ccc28a59a0737f7.1742058005.git.christophe.jaillet@wanadoo.fr] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Link: https://lore.kernel.org/r/8301fc15cdea5d2dac21f57613e8e6922fb1ad95.1740854531.git.christophe.jaillet@wanadoo.fr
2025-03-15PCI: imx6: Use devm_clk_bulk_get_all() to fetch clocksRichard Zhu
Use devm_clk_bulk_get_all() helper to simplify clock handle code. No functional changes intended. Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com> [kwilczynski: commit log, refactor to use dev_err_probe()] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/r/20250226025628.1681206-1-hongxing.zhu@nxp.com
2025-03-15PCI: imx6: Identify controller via 'linux,pci-domain', not addressRichard Zhu
Instead of testing the controller register address to distinguish controller 1 from controller 0 on i.MX8MQ platforms, use the PCI domain number, which comes from the devicetree 'linux,pci-domain' property. All relevant devicetrees should already supply 'linux,pci-domain', which was added by c0b70f05c87f ("arm64: dts: imx8mq: use_dt_domains for pci node"). Instead of being set directly in imx_pcie_probe(), pci->dbi_base will be set by the DWC core in dw_pcie_get_resources(). No functional changes intended. Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> [bhelgaas: commit log] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://lore.kernel.org/r/20250226024256.1678103-3-hongxing.zhu@nxp.com
2025-03-14PCI: dw-rockchip: Hide broken ATS capability for RK3588 running in EP modeNiklas Cassel
When running the RK3588 in Endpoint mode, with an Intel host with IOMMU enabled, the host side prints: DMAR: VT-d detected Invalidation Time-out Error: SID 0 When running the RK3588 in Endpoint mode, with an AMD host with IOMMU enabled, the host side prints: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=63:00.0 address=0x42b5b01a0] Rockchip has confirmed that the ATS support for RK3588 only works when running the PCIe controller in Root Complex (RC) mode, see: https://lore.kernel.org/linux-pci/93cdce39-1ae6-4939-a3fc-db10be7564e5@rock-chips.com Usually, to handle these issues, we add a quirk for the PCI vendor and device ID in drivers/pci/quirks.c with quirk_no_ats(). That is because we cannot usually modify the capabilities on the EP side. In this case, we can modify the capabilities on the EP side. Thus, hide the broken ATS capability on RK3588 when running in EP mode. That way, we don't need any quirk on the host side, and we see no errors on the host side, and we can run pci_endpoint_test successfully, with the IOMMU enabled on the host side. Acked-by: Shawn Lin <shawn.lin@rock-chips.com> Signed-off-by: Niklas Cassel <cassel@kernel.org> [kwilczynski: commit log, tidy up code comments and error message] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Link: https://lore.kernel.org/r/20250310094826.842681-6-cassel@kernel.org
2025-03-14PCI: dwc: ep: Add dw_pcie_ep_hide_ext_capability()Niklas Cassel
Add dw_pcie_ep_hide_ext_capability() which can be used by an endpoint controller driver to hide a capability. This can be useful to hide a capability that is buggy, such that the host side does not try to enable the buggy capability. Suggested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Link: https://lore.kernel.org/r/20250310094826.842681-5-cassel@kernel.org
2025-03-14PCI: dwc: ep: Return -ENOMEM for allocation failuresDan Carpenter
If the bitmap or memory allocations fail, then dw_pcie_ep_init_registers() will incorrectly return a success. Return -ENOMEM instead. Fixes: 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure for drivers requiring refclk from host") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> [kwilczynski: commit log] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Krzysztof Wilczyński <kw@linux.com> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Link: https://lore.kernel.org/r/36dcb6fc-f292-4dd5-bd45-a8c6f9dc3df7@stanley.mountain
2025-03-14PCI: Check BAR index for validityPhilipp Stanner
Many functions in PCI use accessor macros such as pci_resource_len(), which take a BAR index. That index, however, is never checked for validity, potentially resulting in undefined behavior by overflowing the array pci_dev.resource in the macro pci_resource_n(). Since many users of those macros directly assign the accessed value to an unsigned integer, the macros cannot be changed easily anymore to return -EINVAL for invalid indexes. Consequently, the problem has to be mitigated in higher layers. Add pci_bar_index_valid(). Use it where appropriate. Link: https://lore.kernel.org/r/20250312080634.13731-4-phasta@kernel.org Closes: https://lore.kernel.org/all/adb53b1f-29e1-3d14-0e61-351fd2d3ff0d@linux.intel.com/ Reported-by: Bingbu Cao <bingbu.cao@linux.intel.com> Signed-off-by: Philipp Stanner <phasta@kernel.org> [kwilczynski: correct if-statement condition the pci_bar_index_is_valid() helper function uses, tidy up code comments] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> [bhelgaas: fix typo] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2025-03-14PCI: pciehp: Avoid unnecessary device replacement checkLukas Wunner
Hot-removal of nested PCI hotplug ports suffers from a long-standing race condition which can lead to a deadlock: A parent hotplug port acquires pci_lock_rescan_remove(), then waits for pciehp to unbind from a child hotplug port. Meanwhile that child hotplug port tries to acquire pci_lock_rescan_remove() as well in order to remove its own children. The deadlock only occurs if the parent acquires pci_lock_rescan_remove() first, not if the child happens to acquire it first. Several workarounds to avoid the issue have been proposed and discarded over the years, e.g.: https://lore.kernel.org/r/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/ A proper fix is being worked on, but needs more time as it is nontrivial and necessarily intrusive. Recent commit 9d573d19547b ("PCI: pciehp: Detect device replacement during system sleep") provokes more frequent occurrence of the deadlock when removing more than one Thunderbolt device during system sleep. The commit sought to detect device replacement, but also triggered on device removal. Differentiating reliably between replacement and removal is impossible because pci_get_dsn() returns 0 both if the device was removed, as well as if it was replaced with one lacking a Device Serial Number. Avoid the more frequent occurrence of the deadlock by checking whether the hotplug port itself was hot-removed. If so, there's no sense in checking whether its child device was replaced. This works because the ->resume_noirq() callback is invoked in top-down order for the entire hierarchy: A parent hotplug port detecting device replacement (or removal) marks all children as removed using pci_dev_set_disconnected() and a child hotplug port can then reliably detect being removed. Link: https://lore.kernel.org/r/02f166e24c87d6cde4085865cce9adfdfd969688.1741674172.git.lukas@wunner.de Fixes: 9d573d19547b ("PCI: pciehp: Detect device replacement during system sleep") Reported-by: Kenneth Crudup <kenny@panix.com> Closes: https://lore.kernel.org/r/83d9302a-f743-43e4-9de2-2dd66d91ab5b@panix.com/ Reported-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> Closes: https://lore.kernel.org/r/20240926125909.2362244-1-acelan.kao@canonical.com/ Tested-by: Kenneth Crudup <kenny@panix.com> Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Cc: stable@vger.kernel.org # v6.11+
2025-03-14PCI: Fix wrong length of devres arrayPhilipp Stanner
The array for the iomapping cookie addresses has a length of PCI_STD_NUM_BARS. This constant, however, only describes standard BARs; while PCI can allow for additional, special BARs. The total number of PCI resources is described by constant PCI_NUM_RESOURCES, which is also used in, e.g., pci_select_bars(). Thus, the devres array has so far been too small. Change the length of the devres array to PCI_NUM_RESOURCES. Link: https://lore.kernel.org/r/20250312080634.13731-3-phasta@kernel.org Fixes: bbaff68bf4a4 ("PCI: Add managed partial-BAR request and map infrastructure") Signed-off-by: Philipp Stanner <phasta@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Cc: stable@vger.kernel.org # v6.11+
2025-03-13PCI/TPH: Replace the broken MSI-X control word updateThomas Gleixner
The driver walks the MSI descriptors to test whether a descriptor exists for a given index. That's just abuse of the MSI internals. The same test can be done with a single function call by looking up whether there is a Linux interrupt number assigned at the index. What's worse is that the function is completely unserialized against modifications of the MSI-X control by operations issued from the interrupt core. It also brings the PCI/MSI-X internal cached control word out of sync. Remove the trainwreck and invoke the function provided by the PCI/MSI core to update it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/all/20250313130321.898592817@linutronix.de
2025-03-13PCI/MSI: Provide a sane mechanism for TPHThomas Gleixner
The PCI/TPH driver fiddles with the MSI-X control word of an active interrupt completely unserialized against concurrent operations issued from the interrupt core. It also brings the PCI/MSI-X internal cached control word out of sync. Provide a function, which has the required serialization and keeps the control word cache in sync. Unfortunately this requires to look up and lock the interrupt descriptor, which should be only done in the interrupt core code. But confining this particular oddity in the PCI/MSI core is the lesser of all evil. A interrupt core implementation would require a larger pile of infrastructure and indirections for dubious value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://lore.kernel.org/all/20250313130321.822790423@linutronix.de