summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2025-04-10x86/sev: Add missing RIP_REL_REF() invocations during sme_enable()Kevin Loughlin
[ Upstream commit 72dafb567760320f2de7447cd6e979bf9d4e5d17 ] The following commit: 1c811d403afd ("x86/sev: Fix position dependent variable references in startup code") introduced RIP_REL_REF() to force RIP-relative accesses to global variables, as needed to prevent crashes during early SEV/SME startup code. For completeness, RIP_REL_REF() should be used with additional variables during sme_enable(): https://lore.kernel.org/all/CAMj1kXHnA0fJu6zh634=fbJswp59kSRAbhW+ubDGj1+NYwZJ-Q@mail.gmail.com/ Access these vars with RIP_REL_REF() to prevent problem reoccurence. Fixes: 1c811d403afd ("x86/sev: Fix position dependent variable references in startup code") Signed-off-by: Kevin Loughlin <kevinloughlin@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20241122202322.977678-1-kevinloughlin@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10x86/platform: Only allow CONFIG_EISA for 32-bitArnd Bergmann
[ Upstream commit 976ba8da2f3c2f1e997f4f620da83ae65c0e3728 ] The CONFIG_EISA menu was cleaned up in 2018, but this inadvertently brought the option back on 64-bit machines: ISA remains guarded by a CONFIG_X86_32 check, but EISA no longer depends on ISA. The last Intel machines ith EISA support used a 82375EB PCI/EISA bridge from 1993 that could be paired with the 440FX chipset on early Pentium-II CPUs, long before the first x86-64 products. Fixes: 6630a8e50105 ("eisa: consolidate EISA Kconfig entry in drivers/eisa") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250226213714.4040853-11-arnd@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10x86/fpu: Avoid copying dynamic FP state from init_task in arch_dup_task_struct()Benjamin Berg
[ Upstream commit 5d3b81d4d8520efe888536b6906dc10fd1a228a8 ] The init_task instance of struct task_struct is statically allocated and may not contain the full FP state for userspace. As such, limit the copy to the valid area of both init_task and 'dst' and ensure all memory is initialized. Note that the FP state is only needed for userspace, and as such it is entirely reasonable for init_task to not contain parts of it. Fixes: 5aaeb5c01c5b ("x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86") Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250226133136.816901-1-benjamin@sipsolutions.net ---- v2: - Fix code if arch_task_struct_size < sizeof(init_task) by using memcpy_and_pad. Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10x86/fpu: Fix guest FPU state buffer allocation sizeStanislav Spassov
[ Upstream commit 1937e18cc3cf27e2b3ef70e8c161437051ab7608 ] Ongoing work on an optimization to batch-preallocate vCPU state buffers for KVM revealed a mismatch between the allocation sizes used in fpu_alloc_guest_fpstate() and fpstate_realloc(). While the former allocates a buffer sized to fit the default set of XSAVE features in UABI form (as per fpu_user_cfg), the latter uses its ksize argument derived (for the requested set of features) in the same way as the sizes found in fpu_kernel_cfg, i.e. using the compacted in-kernel representation. The correct size to use for guest FPU state should indeed be the kernel one as seen in fpstate_realloc(). The original issue likely went unnoticed through a combination of UABI size typically being larger than or equal to kernel size, and/or both amounting to the same number of allocated 4K pages. Fixes: 69f6ed1d14c6 ("x86/fpu: Provide infrastructure for KVM FPU cleanup") Signed-off-by: Stanislav Spassov <stanspas@amazon.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250218141045.85201-1-stanspas@amazon.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10x86/mm/pat: cpa-test: fix length for CPA_ARRAY testMike Rapoport (Microsoft)
[ Upstream commit 33ea120582a638b2f2e380a50686c2b1d7cce795 ] The CPA_ARRAY test always uses len[1] as numpages argument to change_page_attr_set() although the addresses array is different each iteration of the test loop. Replace len[1] with len[i] to have numpages matching the addresses array. Fixes: ecc729f1f471 ("x86/mm/cpa: Add ARRAY and PAGES_ARRAY selftests") Signed-off-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250126074733.1384926-2-rppt@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22x86/vmware: Parse MP tables for SEV-SNP enabled guests under VMware hypervisorsAjay Kaher
[ Upstream commit a2ab25529bbcea51b5e01dded79f45aeb94f644a ] Under VMware hypervisors, SEV-SNP enabled VMs are fundamentally able to boot without UEFI, but this regressed a year ago due to: 0f4a1e80989a ("x86/sev: Skip ROM range scans and validation for SEV-SNP guests") In this case, mpparse_find_mptable() has to be called to parse MP tables which contains the necessary boot information. [ mingo: Updated the changelog. ] Fixes: 0f4a1e80989a ("x86/sev: Skip ROM range scans and validation for SEV-SNP guests") Co-developed-by: Ye Li <ye.li@broadcom.com> Signed-off-by: Ye Li <ye.li@broadcom.com> Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ye Li <ye.li@broadcom.com> Reviewed-by: Kevin Loughlin <kevinloughlin@google.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250313173111.10918-1-ajay.kaher@broadcom.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22x86/microcode/AMD: Fix out-of-bounds on systems with CPU-less NUMA nodesFlorent Revest
commit e3e89178a9f4a80092578af3ff3c8478f9187d59 upstream. Currently, load_microcode_amd() iterates over all NUMA nodes, retrieves their CPU masks and unconditionally accesses per-CPU data for the first CPU of each mask. According to Documentation/admin-guide/mm/numaperf.rst: "Some memory may share the same node as a CPU, and others are provided as memory only nodes." Therefore, some node CPU masks may be empty and wouldn't have a "first CPU". On a machine with far memory (and therefore CPU-less NUMA nodes): - cpumask_of_node(nid) is 0 - cpumask_first(0) is CONFIG_NR_CPUS - cpu_data(CONFIG_NR_CPUS) accesses the cpu_info per-CPU array at an index that is 1 out of bounds This does not have any security implications since flashing microcode is a privileged operation but I believe this has reliability implications by potentially corrupting memory while flashing a microcode update. When booting with CONFIG_UBSAN_BOUNDS=y on an AMD machine that flashes a microcode update. I get the following splat: UBSAN: array-index-out-of-bounds in arch/x86/kernel/cpu/microcode/amd.c:X:Y index 512 is out of range for type 'unsigned long[512]' [...] Call Trace: dump_stack __ubsan_handle_out_of_bounds load_microcode_amd request_microcode_amd reload_store kernfs_fop_write_iter vfs_write ksys_write do_syscall_64 entry_SYSCALL_64_after_hwframe Change the loop to go over only NUMA nodes which have CPUs before determining whether the first CPU on the respective node needs microcode update. [ bp: Massage commit message, fix typo. ] Fixes: 7ff6edf4fef3 ("x86/microcode/AMD: Fix mixed steppings support") Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250310144243.861978-1-revest@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-22vhost: return task creation error instead of NULLKeith Busch
[ Upstream commit cb380909ae3b1ebf14d6a455a4f92d7916d790cb ] Lets callers distinguish why the vhost task creation failed. No one currently cares why it failed, so no real runtime change from this patch, but that will not be the case for long. Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-2-kbusch@meta.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22perf/x86/rapl: Add support for Intel Arrow Lake UAaron Ma
[ Upstream commit 68a9b0e313302451468c0b0eda53c383fa51a8f4 ] Add Arrow Lake U model for RAPL: $ ls -1 /sys/devices/power/events/ energy-cores energy-cores.scale energy-cores.unit energy-gpu energy-gpu.scale energy-gpu.unit energy-pkg energy-pkg.scale energy-pkg.unit energy-psys energy-psys.scale energy-psys.unit The same output as ArrowLake: $ perf stat -a -I 1000 --per-socket -e power/energy-pkg/ Signed-off-by: Aaron Ma <aaron.ma@canonical.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20241224145516.349028-1-aaron.ma@canonical.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22x86/irq: Define trace events conditionallyArnd Bergmann
[ Upstream commit 9de7695925d5d2d2085681ba935857246eb2817d ] When both of X86_LOCAL_APIC and X86_THERMAL_VECTOR are disabled, the irq tracing produces a W=1 build warning for the tracing definitions: In file included from include/trace/trace_events.h:27, from include/trace/define_trace.h:113, from arch/x86/include/asm/trace/irq_vectors.h:383, from arch/x86/kernel/irq.c:29: include/trace/stages/init.h:2:23: error: 'str__irq_vectors__trace_system_name' defined but not used [-Werror=unused-const-variable=] Make the tracepoints conditional on the same symbosl that guard their usage. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250225213236.3141752-1-arnd@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22x86/of: Don't use DTB for SMP setup if ACPI is enabledDmytro Maluka
[ Upstream commit 96f41f644c4885761b0d117fc36dc5dcf92e15ec ] There are cases when it is useful to use both ACPI and DTB provided by the bootloader, however in such cases we should make sure to prevent conflicts between the two. Namely, don't try to use DTB for SMP setup if ACPI is enabled. Precisely, this prevents at least: - incorrectly calling register_lapic_address(APIC_DEFAULT_PHYS_BASE) after the LAPIC was already successfully enumerated via ACPI, causing noisy kernel warnings and probably potential real issues as well - failed IOAPIC setup in the case when IOAPIC is enumerated via mptable instead of ACPI (e.g. with acpi=noirq), due to mpparse_parse_smp_config() overridden by x86_dtb_parse_smp_config() Signed-off-by: Dmytro Maluka <dmaluka@chromium.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250105172741.3476758-2-dmaluka@chromium.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-22perf/x86/intel: Use better start period for frequency modeKan Liang
[ Upstream commit a26b24b2e21f6222635a95426b9ef9eec63d69b1 ] Freqency mode is the current default mode of Linux perf. A period of 1 is used as a starting period. The period is auto-adjusted on each tick or an overflow, to meet the frequency target. The start period of 1 is too low and may trigger some issues: - Many HWs do not support period 1 well. https://lore.kernel.org/lkml/875xs2oh69.ffs@tglx/ - For an event that occurs frequently, period 1 is too far away from the real period. Lots of samples are generated at the beginning. The distribution of samples may not be even. - A low starting period for frequently occurring events also challenges virtualization, which has a longer path to handle a PMI. The limit_period value only checks the minimum acceptable value for HW. It cannot be used to set the start period, because some events may need a very low period. The limit_period cannot be set too high. It doesn't help with the events that occur frequently. It's hard to find a universal starting period for all events. The idea implemented by this patch is to only give an estimate for the popular HW and HW cache events. For the rest of the events, start from the lowest possible recommended value. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250117151913.3043942-3-kan.liang@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13x86/mm: Don't disable PCID when INVLPG has been fixed by microcodeXi Ruoyao
commit f24f669d03f884a6ef95cca84317d0f329e93961 upstream. Per the "Processor Specification Update" documentations referred by the intel-microcode-20240312 release note, this microcode release has fixed the issue for all affected models. So don't disable PCID if the microcode is new enough. The precise minimum microcode revision fixing the issue was provided by Pawan Intel. [ dhansen: comment and changelog tweaks ] Signed-off-by: Xi Ruoyao <xry111@xry111.site> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://lore.kernel.org/all/168436059559.404.13934972543631851306.tip-bot2@tip-bot2/ Link: https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/releases/tag/microcode-20240312 Link: https://cdrdv2.intel.com/v1/dl/getContent/740518 # RPL042, rev. 13 Link: https://cdrdv2.intel.com/v1/dl/getContent/682436 # ADL063, rev. 24 Link: https://lore.kernel.org/all/20240325231300.qrltbzf6twm43ftb@desk/ Link: https://lore.kernel.org/all/20240522020625.69418-1-xry111%40xry111.site Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13KVM: x86: Explicitly zero EAX and EBX when PERFMON_V2 isn't supported by KVMXiaoyao Li
commit f9dc8fb3afc968042bdaf4b6e445a9272071c9f3 upstream. Fix a goof where KVM sets CPUID.0x80000022.EAX to CPUID.0x80000022.EBX instead of zeroing both when PERFMON_V2 isn't supported by KVM. In practice, barring a buggy CPU (or vCPU model when running nested) only the !enable_pmu case is affected, as KVM always supports PERFMON_V2 if it's available in hardware, i.e. CPUID.0x80000022.EBX will be '0' if PERFMON_V2 is unsupported. For the !enable_pmu case, the bug is relatively benign as KVM will refuse to enable PMU capabilities, but a VMM that reflects KVM's supported CPUID into the guest could inadvertently induce #GPs in the guest due to advertising support for MSRs that KVM refuses to emulate. Fixes: 94cdeebd8211 ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250304082314.472202-3-xiaoyao.li@intel.com [sean: massage shortlog and changelog, tag for stable] Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQsSean Christopherson
commit 189ecdb3e112da703ac0699f4ec76aa78122f911 upstream. Snapshot the host's DEBUGCTL after disabling IRQs, as perf can toggle debugctl bits from IRQ context, e.g. when enabling/disabling events via smp_call_function_single(). Taking the snapshot (long) before IRQs are disabled could result in KVM effectively clobbering DEBUGCTL due to using a stale snapshot. Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13KVM: SVM: Manually context switch DEBUGCTL if LBR virtualization is disabledSean Christopherson
commit 433265870ab3455b418885bff48fa5fd02f7e448 upstream. Manually load the guest's DEBUGCTL prior to VMRUN (and restore the host's value on #VMEXIT) if it diverges from the host's value and LBR virtualization is disabled, as hardware only context switches DEBUGCTL if LBR virtualization is fully enabled. Running the guest with the host's value has likely been mildly problematic for quite some time, e.g. it will result in undesirable behavior if BTF diverges (with the caveat that KVM now suppresses guest BTF due to lack of support). But the bug became fatal with the introduction of Bus Lock Trap ("Detect" in kernel paralance) support for AMD (commit 408eb7417a92 ("x86/bus_lock: Add support for AMD")), as a bus lock in the guest will trigger an unexpected #DB. Note, suppressing the bus lock #DB, i.e. simply resuming the guest without injecting a #DB, is not an option. It wouldn't address the general issue with DEBUGCTL, e.g. for things like BTF, and there are other guest-visible side effects if BusLockTrap is left enabled. If BusLockTrap is disabled, then DR6.BLD is reserved-to-1; any attempts to clear it by software are ignored. But if BusLockTrap is enabled, software can clear DR6.BLD: Software enables bus lock trap by setting DebugCtl MSR[BLCKDB] (bit 2) to 1. When bus lock trap is enabled, ... The processor indicates that this #DB was caused by a bus lock by clearing DR6[BLD] (bit 11). DR6[11] previously had been defined to be always 1. and clearing DR6.BLD is "sticky" in that it's not set (i.e. lowered) by other #DBs: All other #DB exceptions leave DR6[BLD] unmodified E.g. leaving BusLockTrap enable can confuse a legacy guest that writes '0' to reset DR6. Reported-by: rangemachine@gmail.com Reported-by: whanos@sergal.fun Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219787 Closes: https://lore.kernel.org/all/bug-219787-28872@https.bugzilla.kernel.org%2F Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13KVM: x86: Snapshot the host's DEBUGCTL in common x86Sean Christopherson
commit fb71c795935652fa20eaf9517ca9547f5af99a76 upstream. Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in common x86, so that SVM can also use the snapshot. Opportunistically change the field to a u64. While bits 63:32 are reserved on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13KVM: SVM: Suppress DEBUGCTL.BTF on AMDSean Christopherson
commit d0eac42f5cecce009d315655bee341304fbe075e upstream. Mark BTF as reserved in DEBUGCTL on AMD, as KVM doesn't actually support BTF, and fully enabling BTF virtualization is non-trivial due to interactions with the emulator, guest_debug, #DB interception, nested SVM, etc. Don't inject #GP if the guest attempts to set BTF, as there's no way to communicate lack of support to the guest, and instead suppress the flag and treat the WRMSR as (partially) unsupported. In short, make KVM behave the same on AMD and Intel (VMX already squashes BTF). Note, due to other bugs in KVM's handling of DEBUGCTL, the only way BTF has "worked" in any capacity is if the guest simultaneously enables LBRs. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13KVM: SVM: Drop DEBUGCTL[5:2] from guest's effective valueSean Christopherson
commit ee89e8013383d50a27ea9bf3c8a69eed6799856f upstream. Drop bits 5:2 from the guest's effective DEBUGCTL value, as AMD changed the architectural behavior of the bits and broke backwards compatibility. On CPUs without BusLockTrap (or at least, in APMs from before ~2023), bits 5:2 controlled the behavior of external pins: Performance-Monitoring/Breakpoint Pin-Control (PBi)—Bits 5:2, read/write. Software uses thesebits to control the type of information reported by the four external performance-monitoring/breakpoint pins on the processor. When a PBi bit is cleared to 0, the corresponding external pin (BPi) reports performance-monitor information. When a PBi bit is set to 1, the corresponding external pin (BPi) reports breakpoint information. With the introduction of BusLockTrap, presumably to be compatible with Intel CPUs, AMD redefined bit 2 to be BLCKDB: Bus Lock #DB Trap (BLCKDB)—Bit 2, read/write. Software sets this bit to enable generation of a #DB trap following successful execution of a bus lock when CPL is > 0. and redefined bits 5:3 (and bit 6) as "6:3 Reserved MBZ". Ideally, KVM would treat bits 5:2 as reserved. Defer that change to a feature cleanup to avoid breaking existing guest in LTS kernels. For now, drop the bits to retain backwards compatibility (of a sort). Note, dropping bits 5:2 is still a guest-visible change, e.g. if the guest is enabling LBRs *and* the legacy PBi bits, then the state of the PBi bits is visible to the guest, whereas now the guest will always see '0'. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13KVM: SVM: Save host DR masks on CPUs with DebugSwapSean Christopherson
commit b2653cd3b75f62f29b72df4070e20357acb52bc4 upstream. When running SEV-SNP guests on a CPU that supports DebugSwap, always save the host's DR0..DR3 mask MSR values irrespective of whether or not DebugSwap is enabled, to ensure the host values aren't clobbered by the CPU. And for now, also save DR0..DR3, even though doing so isn't necessary (see below). SVM_VMGEXIT_AP_CREATE is deeply flawed in that it allows the *guest* to create a VMSA with guest-controlled SEV_FEATURES. A well behaved guest can inform the hypervisor, i.e. KVM, of its "requested" features, but on CPUs without ALLOWED_SEV_FEATURES support, nothing prevents the guest from lying about which SEV features are being enabled (or not!). If a misbehaving guest enables DebugSwap in a secondary vCPU's VMSA, the CPU will load the DR0..DR3 mask MSRs on #VMEXIT, i.e. will clobber the MSRs with '0' if KVM doesn't save its desired value. Note, DR0..DR3 themselves are "ok", as DR7 is reset on #VMEXIT, and KVM restores all DRs in common x86 code as needed via hw_breakpoint_restore(). I.e. there is no risk of host DR0..DR3 being clobbered (when it matters). However, there is a flaw in the opposite direction; because the guest can lie about enabling DebugSwap, i.e. can *disable* DebugSwap without KVM's knowledge, KVM must not rely on the CPU to restore DRs. Defer fixing that wart, as it's more of a documentation issue than a bug in the code. Note, KVM added support for DebugSwap on commit d1f85fbe836e ("KVM: SEV: Enable data breakpoints in SEV-ES"), but that is not an appropriate Fixes, as the underlying flaw exists in hardware, not in KVM. I.e. all kernels that support SEV-SNP need to be patched, not just kernels with KVM's full support for DebugSwap (ignoring that DebugSwap support landed first). Opportunistically fix an incorrect statement in the comment; on CPUs without DebugSwap, the CPU does NOT save or load debug registers, i.e. Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Cc: stable@vger.kernel.org Cc: Naveen N Rao <naveen@kernel.org> Cc: Kim Phillips <kim.phillips@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadowSean Christopherson
commit be45bc4eff33d9a7dae84a2150f242a91a617402 upstream. Enable/disable local IRQs, i.e. set/clear RFLAGS.IF, in the common svm_vcpu_enter_exit() just after/before guest_state_{enter,exit}_irqoff() so that VMRUN is not executed in an STI shadow. AMD CPUs have a quirk (some would say "bug"), where the STI shadow bleeds into the guest's intr_state field if a #VMEXIT occurs during injection of an event, i.e. if the VMRUN doesn't complete before the subsequent #VMEXIT. The spurious "interrupts masked" state is relatively benign, as it only occurs during event injection and is transient. Because KVM is already injecting an event, the guest can't be in HLT, and if KVM is querying IRQ blocking for injection, then KVM would need to force an immediate exit anyways since injecting multiple events is impossible. However, because KVM copies int_state verbatim from vmcb02 to vmcb12, the spurious STI shadow is visible to L1 when running a nested VM, which can trip sanity checks, e.g. in VMware's VMM. Hoist the STI+CLI all the way to C code, as the aforementioned calls to guest_state_{enter,exit}_irqoff() already inform lockdep that IRQs are enabled/disabled, and taking a fault on VMRUN with RFLAGS.IF=1 is already possible. I.e. if there's kernel code that is confused by running with RFLAGS.IF=1, then it's already a problem. In practice, since GIF=0 also blocks NMIs, the only change in exposure to non-KVM code (relative to surrounding VMRUN with STI+CLI) is exception handling code, and except for the kvm_rebooting=1 case, all exception in the core VM-Enter/VM-Exit path are fatal. Use the "raw" variants to enable/disable IRQs to avoid tracing in the "no instrumentation" code; the guest state helpers also take care of tracing IRQ state. Oppurtunstically document why KVM needs to do STI in the first place. Reported-by: Doug Covelli <doug.covelli@broadcom.com> Closes: https://lore.kernel.org/all/CADH9ctBs1YPmE4aCfGPNBwA10cA8RuAk2gO7542DjMZgs4uzJQ@mail.gmail.com Fixes: f14eec0a3203 ("KVM: SVM: move more vmentry code to assembly") Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250224165442.2338294-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13x86/sgx: Fix size overflows in sgx_encl_create()Jarkko Sakkinen
[ Upstream commit 0d3e0dfd68fb9e6b0ec865be9f3377cc3ff55733 ] The total size calculated for EPC can overflow u64 given the added up page for SECS. Further, the total size calculated for shmem can overflow even when the EPC size stays within limits of u64, given that it adds the extra space for 128 byte PCMD structures (one for each page). Address this by pre-evaluating the micro-architectural requirement of SGX: the address space size must be power of two. This is eventually checked up by ECREATE but the pre-check has the additional benefit of making sure that there is some space for additional data. Fixes: 888d24911787 ("x86/sgx: Add SGX_IOC_ENCLAVE_CREATE") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20250305050006.43896-1-jarkko@kernel.org Closes: https://lore.kernel.org/linux-sgx/c87e01a0-e7dd-4749-a348-0980d3444f04@stanley.mountain/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13x86/cpu: Properly parse CPUID leaf 0x2 TLB descriptor 0x63Ahmed S. Darwish
commit f6bdaab79ee4228a143ee1b4cb80416d6ffc0c63 upstream. CPUID leaf 0x2's one-byte TLB descriptors report the number of entries for specific TLB types, among other properties. Typically, each emitted descriptor implies the same number of entries for its respective TLB type(s). An emitted 0x63 descriptor is an exception: it implies 4 data TLB entries for 1GB pages and 32 data TLB entries for 2MB or 4MB pages. For the TLB descriptors parsing code, the entry count for 1GB pages is encoded at the intel_tlb_table[] mapping, but the 2MB/4MB entry count is totally ignored. Update leaf 0x2's parsing logic 0x2 to account for 32 data TLB entries for 2MB/4MB pages implied by the 0x63 descriptor. Fixes: e0ba94f14f74 ("x86/tlb_info: get last level TLB entry number of CPU") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250304085152.51092-4-darwi@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13x86/cpu: Validate CPUID leaf 0x2 EDX outputAhmed S. Darwish
commit 1881148215c67151b146450fb89ec22fd92337a7 upstream. CPUID leaf 0x2 emits one-byte descriptors in its four output registers EAX, EBX, ECX, and EDX. For these descriptors to be valid, the most significant bit (MSB) of each register must be clear. Leaf 0x2 parsing at intel.c only validated the MSBs of EAX, EBX, and ECX, but left EDX unchecked. Validate EDX's most-significant bit as well. Fixes: e0ba94f14f74 ("x86/tlb_info: get last level TLB entry number of CPU") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250304085152.51092-3-darwi@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13x86/cacheinfo: Validate CPUID leaf 0x2 EDX outputAhmed S. Darwish
commit 8177c6bedb7013cf736137da586cf783922309dd upstream. CPUID leaf 0x2 emits one-byte descriptors in its four output registers EAX, EBX, ECX, and EDX. For these descriptors to be valid, the most significant bit (MSB) of each register must be clear. The historical Git commit: 019361a20f016 ("- pre6: Intel: start to add Pentium IV specific stuff (128-byte cacheline etc)...") introduced leaf 0x2 output parsing. It only validated the MSBs of EAX, EBX, and ECX, but left EDX unchecked. Validate EDX's most-significant bit. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250304085152.51092-2-darwi@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13x86/boot: Sanitize boot params before parsing command lineArd Biesheuvel
commit c00b413a96261faef4ce22329153c6abd4acef25 upstream. The 5-level paging code parses the command line to look for the 'no5lvl' string, and does so very early, before sanitize_boot_params() has been called and has been given the opportunity to wipe bogus data from the fields in boot_params that are not covered by struct setup_header, and are therefore supposed to be initialized to zero by the bootloader. This triggers an early boot crash when using syslinux-efi to boot a recent kernel built with CONFIG_X86_5LEVEL=y and CONFIG_EFI_STUB=n, as the 0xff padding that now fills the unused PE/COFF header is copied into boot_params by the bootloader, and interpreted as the top half of the command line pointer. Fix this by sanitizing the boot_params before use. Note that there is no harm in calling this more than once; subsequent invocations are able to spot that the boot_params have already been cleaned up. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> # v6.1+ Link: https://lore.kernel.org/r/20250306155915.342465-2-ardb+git@google.com Closes: https://lore.kernel.org/all/202503041549.35913.ulrich.gemkow@ikr.uni-stuttgart.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13x86/microcode/AMD: Add some forgotten models to the SHA checkBorislav Petkov (AMD)
commit 058a6bec37c6c3b826158f6d26b75de43816a880 upstream. Add some more forgotten models to the SHA check. Fixes: 50cef76d5cb0 ("x86/microcode/AMD: Load only SHA256-checksummed patches") Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Toralf Förster <toralf.foerster@gmx.de> Link: https://lore.kernel.org/r/20250307220256.11816-1-bp@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-13x86/amd_nb: Use rdmsr_safe() in amd_get_mmconfig_range()Andrew Cooper
commit 14cb5d83068ecf15d2da6f7d0e9ea9edbcbc0457 upstream. Xen doesn't offer MSR_FAM10H_MMIO_CONF_BASE to all guests. This results in the following warning: unchecked MSR access error: RDMSR from 0xc0010058 at rIP: 0xffffffff8101d19f (xen_do_read_msr+0x7f/0xa0) Call Trace: xen_read_msr+0x1e/0x30 amd_get_mmconfig_range+0x2b/0x80 quirk_amd_mmconfig_area+0x28/0x100 pnp_fixup_device+0x39/0x50 __pnp_add_device+0xf/0x150 pnp_add_device+0x3d/0x100 pnpacpi_add_device_handler+0x1f9/0x280 acpi_ns_get_device_callback+0x104/0x1c0 acpi_ns_walk_namespace+0x1d0/0x260 acpi_get_devices+0x8a/0xb0 pnpacpi_init+0x50/0x80 do_one_initcall+0x46/0x2e0 kernel_init_freeable+0x1da/0x2f0 kernel_init+0x16/0x1b0 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1b/0x30 based on quirks for a "PNP0c01" device. Treating MMCFG as disabled is the right course of action, so no change is needed there. This was most likely exposed by fixing the Xen MSR accessors to not be silently-safe. Fixes: 3fac3734c43a ("xen/pv: support selecting safe/unsafe msr accesses") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250307002846.3026685-1-andrew.cooper3@citrix.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07x86/microcode/AMD: Load only SHA256-checksummed patchesBorislav Petkov (AMD)
commit 50cef76d5cb0e199cda19f026842560f6eedc4f7 upstream. Load patches for which the driver carries a SHA256 checksum of the patch blob. This can be disabled by adding "microcode.amd_sha_check=off" on the kernel cmdline. But it is highly NOT recommended. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07x86/microcode/AMD: Add get_patch_level()Borislav Petkov (AMD)
commit 037e81fb9d2dfe7b31fd97e5f578854e38f09887 upstream. Put the MSR_AMD64_PATCH_LEVEL reading of the current microcode revision the hw has, into a separate function. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20250211163648.30531-6-bp@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07x86/microcode/AMD: Get rid of the _load_microcode_amd() forward declarationBorislav Petkov (AMD)
commit b39c387164879eef71886fc93cee5ca7dd7bf500 upstream. Simply move save_microcode_in_initrd() down. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20250211163648.30531-5-bp@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07x86/microcode/AMD: Merge early_apply_microcode() into its single callsiteBorislav Petkov (AMD)
commit dc15675074dcfd79a2f10a6e39f96b0244961a01 upstream. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20250211163648.30531-4-bp@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07x86/microcode/AMD: Remove unused save_microcode_in_initrd_amd() declarationsBorislav Petkov (AMD)
commit 3ef0740d10b005a45e8ae5b4b7b5d37bfddf63c0 upstream. Commit a7939f016720 ("x86/microcode/amd: Cache builtin/initrd microcode early") renamed it to save_microcode_in_initrd() and made it static. Zap the forgotten declarations. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20250211163648.30531-3-bp@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07x86/microcode/AMD: Remove ugly linebreak in __verify_patch_section() signatureBorislav Petkov (AMD)
commit 7103f0589ac220eac3d2b1e8411494b31b883d06 upstream. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20250211163648.30531-2-bp@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07x86/microcode/AMD: Have __apply_microcode_amd() return boolBorislav Petkov (AMD)
commit 78e0aadbd4c6807a06a9d25bc190fe515d3f3c42 upstream This is the natural thing to do anyway. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07x86/microcode/AMD: Return bool from find_blobs_in_containers()Nikolay Borisov
commit a85c08aaa665b5436d325f6d7138732a0e1315ce upstream. Instead of open-coding the check for size/data move it inside the function and make it return a boolean indicating whether data was found or not. No functional changes. [ bp: Write @ret in find_blobs_in_containers() only on success. ] Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241018155151.702350-2-nik.borisov@suse.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07perf/x86: Fix low freqency setting issueKan Liang
commit 88ec7eedbbd21cad38707620ad6c48a4e9a87c18 upstream. Perf doesn't work at low frequencies: $ perf record -e cpu_core/instructions/ppp -F 120 Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu_core/instructions/ppp). "dmesg | grep -i perf" may provide additional information. The limit_period() check avoids a low sampling period on a counter. It doesn't intend to limit the frequency. The check in the x86_pmu_hw_config() should be limited to non-freq mode. The attr.sample_period and attr.sample_freq are union. The attr.sample_period should not be used to indicate the frequency mode. Fixes: c46e665f0377 ("perf/x86: Add INST_RETIRED.ALL workarounds") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250117151913.3043942-1-kan.liang@linux.intel.com Closes: https://lore.kernel.org/lkml/20250115154949.3147-1-ravi.bangoria@amd.com/ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-07x86/CPU: Fix warm boot hang regression on AMD SC1100 SoC systemsRussell Senior
[ Upstream commit bebe35bb738b573c32a5033499cd59f20293f2a3 ] I still have some Soekris net4826 in a Community Wireless Network I volunteer with. These devices use an AMD SC1100 SoC. I am running OpenWrt on them, which uses a patched kernel, that naturally has evolved over time. I haven't updated the ones in the field in a number of years (circa 2017), but have one in a test bed, where I have intermittently tried out test builds. A few years ago, I noticed some trouble, particularly when "warm booting", that is, doing a reboot without removing power, and noticed the device was hanging after the kernel message: [ 0.081615] Working around Cyrix MediaGX virtual DMA bugs. If I removed power and then restarted, it would boot fine, continuing through the message above, thusly: [ 0.081615] Working around Cyrix MediaGX virtual DMA bugs. [ 0.090076] Enable Memory-Write-back mode on Cyrix/NSC processor. [ 0.100000] Enable Memory access reorder on Cyrix/NSC processor. [ 0.100070] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.110058] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0 [ 0.120037] CPU: NSC Geode(TM) Integrated Processor by National Semi (family: 0x5, model: 0x9, stepping: 0x1) [...] In order to continue using modern tools, like ssh, to interact with the software on these old devices, I need modern builds of the OpenWrt firmware on the devices. I confirmed that the warm boot hang was still an issue in modern OpenWrt builds (currently using a patched linux v6.6.65). Last night, I decided it was time to get to the bottom of the warm boot hang, and began bisecting. From preserved builds, I narrowed down the bisection window from late February to late May 2019. During this period, the OpenWrt builds were using 4.14.x. I was able to build using period-correct Ubuntu 18.04.6. After a number of bisection iterations, I identified a kernel bump from 4.14.112 to 4.14.113 as the commit that introduced the warm boot hang. https://github.com/openwrt/openwrt/commit/07aaa7e3d62ad32767d7067107db64b6ade81537 Looking at the upstream changes in the stable kernel between 4.14.112 and 4.14.113 (tig v4.14.112..v4.14.113), I spotted a likely suspect: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=20afb90f730982882e65b01fb8bdfe83914339c5 So, I tried reverting just that kernel change on top of the breaking OpenWrt commit, and my warm boot hang went away. Presumably, the warm boot hang is due to some register not getting cleared in the same way that a loss of power does. That is approximately as much as I understand about the problem. More poking/prodding and coaching from Jonas Gorski, it looks like this test patch fixes the problem on my board: Tested against v6.6.67 and v4.14.113. Fixes: 18fb053f9b82 ("x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors") Debugged-by: Jonas Gorski <jonas.gorski@gmail.com> Signed-off-by: Russell Senior <russell@personaltelco.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/CAHP3WfOgs3Ms4Z+L9i0-iBOE21sdMk5erAiJurPjnrL9LSsgRA@mail.gmail.com Cc: Matthew Whitehead <tedheadster@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-27perf/x86/intel: Fix event constraints for LNCKan Liang
commit 782cffeec9ad96daa64ffb2d527b2a052fb02552 upstream. According to the latest event list, update the event constraint tables for Lion Cove core. The general rule (the event codes < 0x90 are restricted to counters 0-3.) has been removed. There is no restriction for most of the performance monitoring events. Fixes: a932aa0e868f ("perf/x86: Add Lunar Lake and Arrow Lake support") Reported-by: Amiri Khalil <amiri.khalil@intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20250219141005.2446823-1-kan.liang@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-27KVM: nVMX: Defer SVI update to vmcs01 on EOI when L2 is active w/o VIDChao Gao
[ Upstream commit 04bc93cf49d16d01753b95ddb5d4f230b809a991 ] If KVM emulates an EOI for L1's virtual APIC while L2 is active, defer updating GUEST_INTERUPT_STATUS.SVI, i.e. the VMCS's cache of the highest in-service IRQ, until L1 is active, as vmcs01, not vmcs02, needs to track vISR. The missed SVI update for vmcs01 can result in L1 interrupts being incorrectly blocked, e.g. if there is a pending interrupt with lower priority than the interrupt that was EOI'd. This bug only affects use cases where L1's vAPIC is effectively passed through to L2, e.g. in a pKVM scenario where L2 is L1's depriveleged host, as KVM will only emulate an EOI for L1's vAPIC if Virtual Interrupt Delivery (VID) is disabled in vmc12, and L1 isn't intercepting L2 accesses to its (virtual) APIC page (or if x2APIC is enabled, the EOI MSR). WARN() if KVM updates L1's ISR while L2 is active with VID enabled, as an EOI from L2 is supposed to affect L2's vAPIC, but still defer the update, to try to keep L1 alive. Specifically, KVM forwards all APICv-related VM-Exits to L1 via nested_vmx_l1_wants_exit(): case EXIT_REASON_APIC_ACCESS: case EXIT_REASON_APIC_WRITE: case EXIT_REASON_EOI_INDUCED: /* * The controls for "virtualize APIC accesses," "APIC- * register virtualization," and "virtual-interrupt * delivery" only come from vmcs12. */ return true; Fixes: c7c9c56ca26f ("x86, apicv: add virtual interrupt delivery support") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/kvm/20230312180048.1778187-1-jason.cj.chen@intel.com Reported-by: Markku Ahvenjärvi <mankku@gmail.com> Closes: https://lore.kernel.org/all/20240920080012.74405-1-mankku@gmail.com Cc: Janne Karhunen <janne.karhunen@gmail.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: drop request, handle in VMX, write changelog] Tested-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20241128000010.4051275-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-27KVM: x86: Inline kvm_get_apic_mode() in lapic.hSean Christopherson
[ Upstream commit adfec1f4591cf8c69664104eaf41e06b2e7b767e ] Inline kvm_get_apic_mode() in lapic.h to avoid a CALL+RET as well as an export. The underlying kvm_apic_mode() helper is public information, i.e. there is no state/information that needs to be hidden from vendor modules. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-5-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Stable-dep-of: 04bc93cf49d1 ("KVM: nVMX: Defer SVI update to vmcs01 on EOI when L2 is active w/o VID") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-27KVM: x86: Get vcpu->arch.apic_base directly and drop kvm_get_apic_base()Sean Christopherson
[ Upstream commit d91060e342a66b52d9bd64f0b123b9c306293b76 ] Access KVM's emulated APIC base MSR value directly instead of bouncing through a helper, as there is no reason to add a layer of indirection, and there are other MSRs with a "set" but no "get", e.g. EFER. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-4-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Stable-dep-of: 04bc93cf49d1 ("KVM: nVMX: Defer SVI update to vmcs01 on EOI when L2 is active w/o VID") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21x86/static-call: Remove early_boot_irqs_disabled check to fix Xen PVH dom0Andrew Cooper
commit 5cc2db37124bb33914996d6fdbb2ddb3811f2945 upstream. __static_call_update_early() has a check for early_boot_irqs_disabled, but is used before early_boot_irqs_disabled is set up in start_kernel(). Xen PV has always special cased early_boot_irqs_disabled, but Xen PVH does not and falls over the BUG when booting as dom0. It is very suspect that early_boot_irqs_disabled starts as 0, becomes 1 for a time, then becomes 0 again, but as this needs backporting to fix a breakage in a security fix, dropping the BUG_ON() is the far safer option. Fixes: 0ef8047b737d ("x86/static-call: provide a way to do very early static-call updates") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219620 Reported-by: Alex Zenla <alex@edera.dev> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Alex Zenla <alex@edera.dev> Link: https://lore.kernel.org/r/20241221211046.6475-1-andrew.cooper3@citrix.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21perf/x86/intel: Ensure LBRs are disabled when a CPU is startingSean Christopherson
commit c631a2de7ae48d50434bdc205d901423f8577c65 upstream. Explicitly clear DEBUGCTL.LBR when a CPU is starting, prior to purging the LBR MSRs themselves, as at least one system has been found to transfer control to the kernel with LBRs enabled (it's unclear whether it's a BIOS flaw or a CPU goof). Because the kernel preserves the original DEBUGCTL, even when toggling LBRs, leaving DEBUGCTL.LBR as is results in running with LBRs enabled at all times. Closes: https://lore.kernel.org/all/c9d8269bff69f6359731d758e3b1135dedd7cc61.camel@redhat.com Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20250131010721.470503-1-seanjc@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAFKan Liang
commit 47a973fd75639fe80d59f9e1860113bb2a0b112b upstream. The EAX of the CPUID Leaf 023H enumerates the mask of valid sub-leaves. To tell the availability of the sub-leaf 1 (enumerate the counter mask), perf should check the bit 1 (0x2) of EAS, rather than bit 0 (0x1). The error is not user-visible on bare metal. Because the sub-leaf 0 and the sub-leaf 1 are always available. However, it may bring issues in a virtualization environment when a VMM only enumerates the sub-leaf 0. Introduce the cpuid35_e?x to replace the macros, which makes the implementation style consistent. Fixes: eb467aaac21e ("perf/x86/intel: Support Architectural PerfMon Extension leaf") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20250129154820.3755948-3-kan.liang@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21KVM: nSVM: Enter guest mode before initializing nested NPT MMUSean Christopherson
commit 46d6c6f3ef0eaff71c2db6d77d4e2ebb7adac34f upstream. When preparing vmcb02 for nested VMRUN (or state restore), "enter" guest mode prior to initializing the MMU for nested NPT so that guest_mode is set in the MMU's role. KVM's model is that all L2 MMUs are tagged with guest_mode, as the behavior of hypervisor MMUs tends to be significantly different than kernel MMUs. Practically speaking, the bug is relatively benign, as KVM only directly queries role.guest_mode in kvm_mmu_free_guest_mode_roots() and kvm_mmu_page_ad_need_write_protect(), which SVM doesn't use, and in paths that are optimizations (mmu_page_zap_pte() and shadow_mmu_try_split_huge_pages()). And while the role is incorprated into shadow page usage, because nested NPT requires KVM to be using NPT for L1, reusing shadow pages across L1 and L2 is impossible as L1 MMUs will always have direct=1, while L2 MMUs will have direct=0. Hoist the TLB processing and setting of HF_GUEST_MASK to the beginning of the flow instead of forcing guest_mode in the MMU, as nothing in nested_vmcb02_prepare_control() between the old and new locations touches TLB flush requests or HF_GUEST_MASK, i.e. there's no reason to present inconsistent vCPU state to the MMU. Fixes: 69cb877487de ("KVM: nSVM: move MMU setup to nested_prepare_vmcb_control") Cc: stable@vger.kernel.org Reported-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://lore.kernel.org/r/20250130010825.220346-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loopSean Christopherson
commit c2fee09fc167c74a64adb08656cb993ea475197e upstream. Move the conditional loading of hardware DR6 with the guest's DR6 value out of the core .vcpu_run() loop to fix a bug where KVM can load hardware with a stale vcpu->arch.dr6. When the guest accesses a DR and host userspace isn't debugging the guest, KVM disables DR interception and loads the guest's values into hardware on VM-Enter and saves them on VM-Exit. This allows the guest to access DRs at will, e.g. so that a sequence of DR accesses to configure a breakpoint only generates one VM-Exit. For DR0-DR3, the logic/behavior is identical between VMX and SVM, and also identical between KVM_DEBUGREG_BP_ENABLED (userspace debugging the guest) and KVM_DEBUGREG_WONT_EXIT (guest using DRs), and so KVM handles loading DR0-DR3 in common code, _outside_ of the core kvm_x86_ops.vcpu_run() loop. But for DR6, the guest's value doesn't need to be loaded into hardware for KVM_DEBUGREG_BP_ENABLED, and SVM provides a dedicated VMCB field whereas VMX requires software to manually load the guest value, and so loading the guest's value into DR6 is handled by {svm,vmx}_vcpu_run(), i.e. is done _inside_ the core run loop. Unfortunately, saving the guest values on VM-Exit is initiated by common x86, again outside of the core run loop. If the guest modifies DR6 (in hardware, when DR interception is disabled), and then the next VM-Exit is a fastpath VM-Exit, KVM will reload hardware DR6 with vcpu->arch.dr6 and clobber the guest's actual value. The bug shows up primarily with nested VMX because KVM handles the VMX preemption timer in the fastpath, and the window between hardware DR6 being modified (in guest context) and DR6 being read by guest software is orders of magnitude larger in a nested setup. E.g. in non-nested, the VMX preemption timer would need to fire precisely between #DB injection and the #DB handler's read of DR6, whereas with a KVM-on-KVM setup, the window where hardware DR6 is "dirty" extends all the way from L1 writing DR6 to VMRESUME (in L1). L1's view: ========== <L1 disables DR interception> CPU 0/KVM-7289 [023] d.... 2925.640961: kvm_entry: vcpu 0 A: L1 Writes DR6 CPU 0/KVM-7289 [023] d.... 2925.640963: <hack>: Set DRs, DR6 = 0xffff0ff1 B: CPU 0/KVM-7289 [023] d.... 2925.640967: kvm_exit: vcpu 0 reason EXTERNAL_INTERRUPT intr_info 0x800000ec D: L1 reads DR6, arch.dr6 = 0 CPU 0/KVM-7289 [023] d.... 2925.640969: <hack>: Sync DRs, DR6 = 0xffff0ff0 CPU 0/KVM-7289 [023] d.... 2925.640976: kvm_entry: vcpu 0 L2 reads DR6, L1 disables DR interception CPU 0/KVM-7289 [023] d.... 2925.640980: kvm_exit: vcpu 0 reason DR_ACCESS info1 0x0000000000000216 CPU 0/KVM-7289 [023] d.... 2925.640983: kvm_entry: vcpu 0 CPU 0/KVM-7289 [023] d.... 2925.640983: <hack>: Set DRs, DR6 = 0xffff0ff0 L2 detects failure CPU 0/KVM-7289 [023] d.... 2925.640987: kvm_exit: vcpu 0 reason HLT L1 reads DR6 (confirms failure) CPU 0/KVM-7289 [023] d.... 2925.640990: <hack>: Sync DRs, DR6 = 0xffff0ff0 L0's view: ========== L2 reads DR6, arch.dr6 = 0 CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216 CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit: vcpu 23 reason DR_ACCESS info1 0x0000000000000216 L2 => L1 nested VM-Exit CPU 23/KVM-5046 [001] ..... 3410.005610: kvm_nested_vmexit_inject: reason: DR_ACCESS ext_inf1: 0x0000000000000216 CPU 23/KVM-5046 [001] d.... 3410.005610: kvm_entry: vcpu 23 CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_exit: vcpu 23 reason VMREAD CPU 23/KVM-5046 [001] d.... 3410.005611: kvm_entry: vcpu 23 CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason VMREAD CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_entry: vcpu 23 L1 writes DR7, L0 disables DR interception CPU 23/KVM-5046 [001] d.... 3410.005612: kvm_exit: vcpu 23 reason DR_ACCESS info1 0x0000000000000007 CPU 23/KVM-5046 [001] d.... 3410.005613: kvm_entry: vcpu 23 L0 writes DR6 = 0 (arch.dr6) CPU 23/KVM-5046 [001] d.... 3410.005613: <hack>: Set DRs, DR6 = 0xffff0ff0 A: <L1 writes DR6 = 1, no interception, arch.dr6 is still '0'> B: CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_exit: vcpu 23 reason PREEMPTION_TIMER CPU 23/KVM-5046 [001] d.... 3410.005614: kvm_entry: vcpu 23 C: L0 writes DR6 = 0 (arch.dr6) CPU 23/KVM-5046 [001] d.... 3410.005614: <hack>: Set DRs, DR6 = 0xffff0ff0 L1 => L2 nested VM-Enter CPU 23/KVM-5046 [001] d.... 3410.005616: kvm_exit: vcpu 23 reason VMRESUME L0 reads DR6, arch.dr6 = 0 Reported-by: John Stultz <jstultz@google.com> Closes: https://lkml.kernel.org/r/CANDhNCq5_F3HfFYABqFGCA1bPd_%2BxgNj-iDQhH4tDk%2Bwi8iZZg%40mail.gmail.com Fixes: 375e28ffc0cf ("KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT") Fixes: d67668e9dd76 ("KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6") Cc: stable@vger.kernel.org Cc: Jim Mattson <jmattson@google.com> Tested-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20250125011833.3644371-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21KVM: x86: Reject Hyper-V's SEND_IPI hypercalls if local APIC isn't in-kernelSean Christopherson
commit a8de7f100bb5989d9c3627d3a223ee1c863f3b69 upstream. Advertise support for Hyper-V's SEND_IPI and SEND_IPI_EX hypercalls if and only if the local API is emulated/virtualized by KVM, and explicitly reject said hypercalls if the local APIC is emulated in userspace, i.e. don't rely on userspace to opt-in to KVM_CAP_HYPERV_ENFORCE_CPUID. Rejecting SEND_IPI and SEND_IPI_EX fixes a NULL-pointer dereference if Hyper-V enlightenments are exposed to the guest without an in-kernel local APIC: dump_stack+0xbe/0xfd __kasan_report.cold+0x34/0x84 kasan_report+0x3a/0x50 __apic_accept_irq+0x3a/0x5c0 kvm_hv_send_ipi.isra.0+0x34e/0x820 kvm_hv_hypercall+0x8d9/0x9d0 kvm_emulate_hypercall+0x506/0x7e0 __vmx_handle_exit+0x283/0xb60 vmx_handle_exit+0x1d/0xd0 vcpu_enter_guest+0x16b0/0x24c0 vcpu_run+0xc0/0x550 kvm_arch_vcpu_ioctl_run+0x170/0x6d0 kvm_vcpu_ioctl+0x413/0xb20 __se_sys_ioctl+0x111/0x160 do_syscal1_64+0x30/0x40 entry_SYSCALL_64_after_hwframe+0x67/0xd1 Note, checking the sending vCPU is sufficient, as the per-VM irqchip_mode can't be modified after vCPUs are created, i.e. if one vCPU has an in-kernel local APIC, then all vCPUs have an in-kernel local APIC. Reported-by: Dongjie Zou <zoudongjie@huawei.com> Fixes: 214ff83d4473 ("KVM: x86: hyperv: implement PV IPI send hypercalls") Fixes: 2bc39970e932 ("x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID") Cc: stable@vger.kernel.org Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20250118003454.2619573-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-21x86/mm/tlb: Only trim the mm_cpumask once a secondRik van Riel
[ Upstream commit 6db2526c1d694c91c6e05e2f186c085e9460f202 ] Setting and clearing CPU bits in the mm_cpumask is only ever done by the CPU itself, from the context switch code or the TLB flush code. Synchronization is handled by switch_mm_irqs_off() blocking interrupts. Sending TLB flush IPIs to CPUs that are in the mm_cpumask, but no longer running the program causes a regression in the will-it-scale tlbflush2 test. This test is contrived, but a large regression here might cause a small regression in some real world workload. Instead of always sending IPIs to CPUs that are in the mm_cpumask, but no longer running the program, send these IPIs only once a second. The rest of the time we can skip over CPUs where the loaded_mm is different from the target mm. Reported-by: kernel test roboto <oliver.sang@intel.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20241204210316.612ee573@fangorn Closes: https://lore.kernel.org/oe-lkp/202411282207.6bd28eae-lkp@intel.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-21Grab mm lock before grabbing pt lockMaksym Planeta
[ Upstream commit 6d002348789bc16e9203e9818b7a3688787e3b29 ] Function xen_pin_page calls xen_pte_lock, which in turn grab page table lock (ptlock). When locking, xen_pte_lock expect mm->page_table_lock to be held before grabbing ptlock, but this does not happen when pinning is caused by xen_mm_pin_all. This commit addresses lockdep warning below, which shows up when suspending a Xen VM. [ 3680.658422] Freezing user space processes [ 3680.660156] Freezing user space processes completed (elapsed 0.001 seconds) [ 3680.660182] OOM killer disabled. [ 3680.660192] Freezing remaining freezable tasks [ 3680.661485] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 3680.685254] [ 3680.685265] ================================== [ 3680.685269] WARNING: Nested lock was not taken [ 3680.685274] 6.12.0+ #16 Tainted: G W [ 3680.685279] ---------------------------------- [ 3680.685283] migration/0/19 is trying to lock: [ 3680.685288] ffff88800bac33c0 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: xen_pin_page+0x175/0x1d0 [ 3680.685303] [ 3680.685303] but this task is not holding: [ 3680.685308] init_mm.page_table_lock [ 3680.685311] [ 3680.685311] stack backtrace: [ 3680.685316] CPU: 0 UID: 0 PID: 19 Comm: migration/0 Tainted: G W 6.12.0+ #16 [ 3680.685324] Tainted: [W]=WARN [ 3680.685328] Stopper: multi_cpu_stop+0x0/0x120 <- __stop_cpus.constprop.0+0x8c/0xd0 [ 3680.685339] Call Trace: [ 3680.685344] <TASK> [ 3680.685347] dump_stack_lvl+0x77/0xb0 [ 3680.685356] __lock_acquire+0x917/0x2310 [ 3680.685364] lock_acquire+0xce/0x2c0 [ 3680.685369] ? xen_pin_page+0x175/0x1d0 [ 3680.685373] _raw_spin_lock_nest_lock+0x2f/0x70 [ 3680.685381] ? xen_pin_page+0x175/0x1d0 [ 3680.685386] xen_pin_page+0x175/0x1d0 [ 3680.685390] ? __pfx_xen_pin_page+0x10/0x10 [ 3680.685394] __xen_pgd_walk+0x233/0x2c0 [ 3680.685401] ? stop_one_cpu+0x91/0x100 [ 3680.685405] __xen_pgd_pin+0x5d/0x250 [ 3680.685410] xen_mm_pin_all+0x70/0xa0 [ 3680.685415] xen_pv_pre_suspend+0xf/0x280 [ 3680.685420] xen_suspend+0x57/0x1a0 [ 3680.685428] multi_cpu_stop+0x6b/0x120 [ 3680.685432] ? update_cpumasks_hier+0x7c/0xa60 [ 3680.685439] ? __pfx_multi_cpu_stop+0x10/0x10 [ 3680.685443] cpu_stopper_thread+0x8c/0x140 [ 3680.685448] ? smpboot_thread_fn+0x20/0x1f0 [ 3680.685454] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 3680.685458] smpboot_thread_fn+0xed/0x1f0 [ 3680.685462] kthread+0xde/0x110 [ 3680.685467] ? __pfx_kthread+0x10/0x10 [ 3680.685471] ret_from_fork+0x2f/0x50 [ 3680.685478] ? __pfx_kthread+0x10/0x10 [ 3680.685482] ret_from_fork_asm+0x1a/0x30 [ 3680.685489] </TASK> [ 3680.685491] [ 3680.685491] other info that might help us debug this: [ 3680.685497] 1 lock held by migration/0/19: [ 3680.685500] #0: ffffffff8284df38 (pgd_lock){+.+.}-{3:3}, at: xen_mm_pin_all+0x14/0xa0 [ 3680.685512] [ 3680.685512] stack backtrace: [ 3680.685518] CPU: 0 UID: 0 PID: 19 Comm: migration/0 Tainted: G W 6.12.0+ #16 [ 3680.685528] Tainted: [W]=WARN [ 3680.685531] Stopper: multi_cpu_stop+0x0/0x120 <- __stop_cpus.constprop.0+0x8c/0xd0 [ 3680.685538] Call Trace: [ 3680.685541] <TASK> [ 3680.685544] dump_stack_lvl+0x77/0xb0 [ 3680.685549] __lock_acquire+0x93c/0x2310 [ 3680.685554] lock_acquire+0xce/0x2c0 [ 3680.685558] ? xen_pin_page+0x175/0x1d0 [ 3680.685562] _raw_spin_lock_nest_lock+0x2f/0x70 [ 3680.685568] ? xen_pin_page+0x175/0x1d0 [ 3680.685572] xen_pin_page+0x175/0x1d0 [ 3680.685578] ? __pfx_xen_pin_page+0x10/0x10 [ 3680.685582] __xen_pgd_walk+0x233/0x2c0 [ 3680.685588] ? stop_one_cpu+0x91/0x100 [ 3680.685592] __xen_pgd_pin+0x5d/0x250 [ 3680.685596] xen_mm_pin_all+0x70/0xa0 [ 3680.685600] xen_pv_pre_suspend+0xf/0x280 [ 3680.685607] xen_suspend+0x57/0x1a0 [ 3680.685611] multi_cpu_stop+0x6b/0x120 [ 3680.685615] ? update_cpumasks_hier+0x7c/0xa60 [ 3680.685620] ? __pfx_multi_cpu_stop+0x10/0x10 [ 3680.685625] cpu_stopper_thread+0x8c/0x140 [ 3680.685629] ? smpboot_thread_fn+0x20/0x1f0 [ 3680.685634] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 3680.685638] smpboot_thread_fn+0xed/0x1f0 [ 3680.685642] kthread+0xde/0x110 [ 3680.685645] ? __pfx_kthread+0x10/0x10 [ 3680.685649] ret_from_fork+0x2f/0x50 [ 3680.685654] ? __pfx_kthread+0x10/0x10 [ 3680.685657] ret_from_fork_asm+0x1a/0x30 [ 3680.685662] </TASK> [ 3680.685267] xen:grant_table: Grant tables using version 1 layout [ 3680.685921] OOM killer enabled. [ 3680.685934] Restarting tasks ... done. Signed-off-by: Maksym Planeta <maksym@exostellar.io> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20241204103516.3309112-1-maksym@exostellar.io> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>