Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Significant patch series in this pull request:
- "squashfs: Remove page->mapping references" (Matthew Wilcox) gets
us closer to being able to remove page->mapping
- "relayfs: misc changes" (Jason Xing) does some maintenance and
minor feature addition work in relayfs
- "kdump: crashkernel reservation from CMA" (Jiri Bohac) switches
us from static preallocation of the kdump crashkernel's working
memory over to dynamic allocation. So the difficulty of a-priori
estimation of the second kernel's needs is removed and the first
kernel obtains extra memory
- "generalize panic_print's dump function to be used by other
kernel parts" (Feng Tang) implements some consolidation and
rationalization of the various ways in which a failing kernel
splats information at the operator
* tag 'mm-nonmm-stable-2025-08-03-12-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (80 commits)
tools/getdelays: add backward compatibility for taskstats version
kho: add test for kexec handover
delaytop: enhance error logging and add PSI feature description
samples: Kconfig: fix spelling mistake "instancess" -> "instances"
fat: fix too many log in fat_chain_add()
scripts/spelling.txt: add notifer||notifier to spelling.txt
xen/xenbus: fix typo "notifer"
net: mvneta: fix typo "notifer"
drm/xe: fix typo "notifer"
cxl: mce: fix typo "notifer"
KVM: x86: fix typo "notifer"
MAINTAINERS: add maintainers for delaytop
ucount: use atomic_long_try_cmpxchg() in atomic_long_inc_below()
ucount: fix atomic_long_inc_below() argument type
kexec: enable CMA based contiguous allocation
stackdepot: make max number of pools boot-time configurable
lib/xxhash: remove unused functions
init/Kconfig: restore CONFIG_BROKEN help text
lib/raid6: update recov_rvv.c zero page usage
docs: update docs after introducing delaytop
...
|
|
Patch series "treewide: Fix typo "notifer"", v3.
There are some spelling mistakes of 'notifer' in comments which
should be 'notifier'.
Fix them and add it to scripts/spelling.txt.
This patch (of 8):
There are some spelling mistakes of 'notifer' which should be 'notifier'.
Link: https://lkml.kernel.org/r/576F0D85F6853074+20250722072734.19367-1-wangyuli@uniontech.com
Link: https://lkml.kernel.org/r/7F05778C3A1A9F8B+20250722073431.21983-1-wangyuli@uniontech.com
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
KVM SEV cache maintenance changes for 6.17
- Drop a superfluous WBINVD (on all CPUs!) when destroying a VM.
- Use WBNOINVD instead of WBINVD when possible, for SEV cache maintenance,
e.g. to minimize collateral damage when reclaiming memory from an SEV guest.
- When reclaiming memory from an SEV guest, only do cache flushes on CPUs that
have ever run a vCPU for the guest, i.e. don't flush the caches for CPUs
that can't possibly have cache lines with dirty, encrypted data.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD
Immutable branch for KVM tree to put the KVM patches from
https://lore.kernel.org/r/20250522233733.3176144-1-seanjc@google.com
ontop.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
KVM SVM changes for 6.17
Drop KVM's rejection of SNP's SMT and single-socket policy restrictions, and
instead rely on firmware to verify that the policy can actually be supported.
Don't bother checking that requested policy(s) can actually be satisfied, as
an incompatible policy doesn't put the kernel at risk in any way, and providing
guarantees with respect to the physical topology is outside of KVM's purview.
|
|
KVM local APIC changes for 6.17
Extract many of KVM's helpers for accessing architectural local APIC state
to common x86 so that they can be shared by guest-side code for Secure AVIC.
|
|
KVM x86 MMU changes for 6.17
- Exempt nested EPT from the the !USER + CR0.WP logic, as EPT doesn't interact
with CR0.WP.
- Move the TDX hardware setup code to tdx.c to better co-locate TDX code
and eliminate a few global symbols.
- Dynamically allocation the shadow MMU's hashed page list, and defer
allocating the hashed list until it's actually needed (the TDP MMU doesn't
use the list).
|
|
KVM x86 misc changes for 6.17
- Prevert the host's DEBUGCTL.FREEZE_IN_SMM (Intel only) when running the
guest. Failure to honor FREEZE_IN_SMM can bleed host state into the guest.
- Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter (Intel only) to
prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF.
- Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the
vCPU's CPUID model.
- Rework the MSR interception code so that the SVM and VMX APIs are more or
less identical.
- Recalculate all MSR intercepts from the "source" on MSR filter changes, and
drop the dedicated "shadow" bitmaps (and their awful "max" size defines).
- WARN and reject loading kvm-amd.ko instead of panicking the kernel if the
nested SVM MSRPM offsets tracker can't handle an MSR.
- Advertise support for LKGS (Load Kernel GS base), a new instruction that's
loosely related to FRED, but is supported and enumerated independently.
- Fix a user-triggerable WARN that syzkaller found by stuffing INIT_RECEIVED,
a.k.a. WFS, and then putting the vCPU into VMX Root Mode (post-VMXON). Use
the same approach KVM uses for dealing with "impossible" emulation when
running a !URG guest, and simply wait until KVM_RUN to detect that the vCPU
has architecturally impossible state.
- Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of
APERF/MPERF reads, so that a "properly" configured VM can "virtualize"
APERF/MPERF (with many caveats).
- Reject KVM_SET_TSC_KHZ if vCPUs have been created, as changing the "default"
frequency is unsupported for VMs with a "secure" TSC, and there's no known
use case for changing the default frequency for other VM types.
|
|
into HEAD
KVM VFIO device assignment cleanups for 6.17
Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking now
that KVM no longer uses assigned_device_count as a bad heuristic for "VM has
an irqbypass producer" or for "VM has access to host MMIO".
|
|
KVM MMIO Stale Data mitigation cleanup for 6.17
Rework KVM's mitigation for the MMIO State Data vulnerability to track
whether or not a vCPU has access to (host) MMIO based on the MMU that will be
used when running in the guest. The current approach doesn't actually detect
whether or not a guest has access to MMIO, and is prone to false negatives (and
to a lesser extent, false positives), as KVM_DEV_VFIO_FILE_ADD is optional, and
obviously only covers VFIO devices.
|
|
KVM IRQ changes for 6.17
- Rework irqbypass to track/match producers and consumers via an xarray
instead of a linked list. Using a linked list leads to O(n^2) insertion
times, which is hugely problematic for use cases that create large numbers
of VMs. Such use cases typically don't actually use irqbypass, but
eliminating the pointless registration is a future problem to solve as it
likely requires new uAPI.
- Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *",
to avoid making a simple concept unnecessarily difficult to understand.
- Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC,
and PIT emulation at compile time.
- Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c.
- Fix a variety of flaws and bugs in the AVIC device posted IRQ code.
- Inhibited AVIC if a vCPU's ID is too big (relative to what hardware
supports) instead of rejecting vCPU creation.
- Extend enable_ipiv module param support to SVM, by simply leaving IsRunning
clear in the vCPU's physical ID table entry.
- Disable IPI virtualization, via enable_ipiv, if the CPU is affected by
erratum #1235, to allow (safely) enabling AVIC on such CPUs.
- Dedup x86's device posted IRQ code, as the vast majority of functionality
can be shared verbatime between SVM and VMX.
- Harden the device posted IRQ code against bugs and runtime errors.
- Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1)
instead of O(n).
- Generate GA Log interrupts if and only if the target vCPU is blocking, i.e.
only if KVM needs a notification in order to wake the vCPU.
- Decouple device posted IRQs from VFIO device assignment, as binding a VM to
a VFIO group is not a requirement for enabling device posted IRQs.
- Clean up and document/comment the irqfd assignment code.
- Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e.
ensure an eventfd is bound to at most one irqfd through the entire host,
and add a selftest to verify eventfd:irqfd bindings are globally unique.
|
|
kvm_xen_schedop_poll does a kmalloc_array() when a VM polls the host
for more than one event channel potr (nr_ports > 1).
After the kmalloc_array(), the error paths need to go through the
"out" label, but the call to kvm_read_guest_virt() does not.
Fixes: 92c58965e965 ("KVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly")
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Manuel Andreas <manuel.andreas@tum.de>
[Adjusted commit message. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM TDX fixes for 6.16
- Fix a formatting goof in the TDX documentation.
- Reject KVM_SET_TSC_KHZ for guests with a protected TSC (currently only TDX).
- Ensure struct kvm_tdx_capabilities fields that are not explicitly set by KVM
are zeroed.
|
|
Remove TDVMCALLINFO_GET_QUOTE from user_tdvmcallinfo_1_r11 reported to
userspace to align with the direction of the GHCI spec.
Recently, concern was raised about a gap in the GHCI spec that left
ambiguity in how to expose to the guest that only a subset of GHCI
TDVMCalls were supported. During the back and forth on the spec details[0],
<GetQuote> was moved from an individually enumerable TDVMCall, to one that
is part of the 'base spec', meaning it doesn't have a specific bit in the
<GetTDVMCallInfo> return values. Although the spec[1] is still in draft
form, the GetQoute part has been agreed by the major TDX VMMs.
Unfortunately the commits that were upstreamed still treat <GetQuote> as
individually enumerable. They set bit 0 in the user_tdvmcallinfo_1_r11
which is reported to userspace to tell supported optional TDVMCalls,
intending to say that <GetQuote> is supported.
So stop reporting <GetQute> in user_tdvmcallinfo_1_r11 to align with
the direction of the spec, and allow some future TDVMCall to use that bit.
[0] https://lore.kernel.org/all/aEmuKII8FGU4eQZz@google.com/
[1] https://cdrdv2.intel.com/v1/dl/getContent/858626
Fixes: 28224ef02b56 ("KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities")
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20250717022010.677645-1-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Zero-allocate the kernel's kvm_tdx_capabilities structure and copy only
the number of CPUID entries from the userspace structure. As is, KVM
doesn't ensure kernel_tdvmcallinfo_1_{r11,r12} and user_tdvmcallinfo_1_r12
are zeroed, i.e. KVM will reflect whatever happens to be in the userspace
structure back at userspace, and thus may report garbage to userspace.
Zeroing the entire kernel structure also provides better semantics for the
reserved field. E.g. if KVM extends kvm_tdx_capabilities to enumerate new
information by repurposing bytes from the reserved field, userspace would
be required to zero the new field in order to get useful information back
(because older KVMs without support for the repurposed field would report
garbage, a la the aforementioned tdvmcallinfo bugs).
Fixes: 61bb28279623 ("KVM: TDX: Get system-wide info about TDX module on initialization")
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Closes: https://lore.kernel.org/all/3ef581f1-1ff1-4b99-b216-b316f6415318@intel.com
Tested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250714221928.1788095-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Reject KVM_SET_TSC_KHZ vCPU ioctl if guest's TSC is protected and not
changeable by KVM, and update the documentation to reflect it.
For such TSC protected guests, e.g. TDX guests, typically the TSC is
configured once at VM level before any vCPU are created and remains
unchanged during VM's lifetime. KVM provides the KVM_SET_TSC_KHZ VM
scope ioctl to allow the userspace VMM to configure the TSC of such VM.
After that the userspace VMM is not supposed to call the KVM_SET_TSC_KHZ
vCPU scope ioctl anymore when creating the vCPU.
The de facto userspace VMM Qemu does this for TDX guests. The upcoming
SEV-SNP guests with Secure TSC should follow.
Note, TDX support hasn't been fully released as of the "buggy" commit,
i.e. there is no established ABI to break.
Fixes: adafea110600 ("KVM: x86: Add infrastructure for secure TSC")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/71bbdf87fdd423e3ba3a45b57642c119ee2dd98c.1752444335.git.kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Reject the KVM_SET_TSC_KHZ VM ioctl when vCPUs have been created and
update the documentation to reflect it.
The VM scope KVM_SET_TSC_KHZ ioctl is used to set up the default TSC
frequency that all subsequently created vCPUs can use. It is only
intended to be called before any vCPU is created. Allowing it to be
called after that only results in confusion but nothing good.
Note this is an ABI change. But currently in Qemu (the de facto
userspace VMM) only TDX uses this VM ioctl, and it is only called once
before creating any vCPU, therefore the risk of breaking userspace is
pretty low.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/135a35223ce8d01cea06b6cef30bfe494ec85827.1752444335.git.kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
On AMD CPUs without ensuring cache consistency, each memory page
reclamation in an SEV guest triggers a call to do WBNOINVD/WBINVD on all
CPUs, thereby affecting the performance of other programs on the host.
Typically, an AMD server may have 128 cores or more, while the SEV guest
might only utilize 8 of these cores. Meanwhile, host can use qemu-affinity
to bind these 8 vCPUs to specific physical CPUs.
Therefore, keeping a record of the physical core numbers each time a vCPU
runs can help avoid flushing the cache for all CPUs every time.
Take care to allocate the cpumask used to track which CPUs have run a
vCPU when copying or moving an "encryption context", as nothing guarantees
memory in a mirror VM is a strict subset of the ASID owner, and the
destination VM for intrahost migration needs to maintain it's own set of
CPUs. E.g. for intrahost migration, if a CPU was used for the source VM
but not the destination VM, then it can only have cached memory that was
accessible to the source VM. And a CPU that was run in the source is also
used by the destination is no different than a CPU that was run in the
destination only.
Note, KVM is guaranteed to do flush caches prior to sev_vm_destroy(),
thanks to kvm_arch_guest_memory_reclaimed for SEV and SEV-ES, and
kvm_arch_gmem_invalidate() for SEV-SNP. I.e. it's safe to free the
cpumask prior to unregistering encrypted regions and freeing the ASID.
Opportunistically clean up sev_vm_destroy()'s comment regarding what is
(implicitly, what isn't) skipped for mirror VMs.
Cc: Srikanth Aithal <sraithal@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
Link: https://lore.kernel.org/r/20250522233733.3176144-9-seanjc@google.com
Link: https://lore.kernel.org/all/935a82e3-f7ad-47d7-aaaf-f3d2b62ed768@amd.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move apic_test_vector() to apic.h in order to reuse it in the Secure AVIC
guest APIC driver in later patches to test vector state in the APIC
backing page.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-14-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move apic_clear_vector() and apic_set_vector() helper functions to
apic.h in order to reuse them in the Secure AVIC guest APIC driver
in later patches to atomically set/clear vectors in the APIC backing
page.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-13-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the apic_get_reg(), apic_set_reg(), apic_get_reg64() and
apic_set_reg64() helper functions to apic.h in order to reuse them in the
Secure AVIC guest APIC driver in later patches to read/write registers
from/to the APIC backing page.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-12-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In preparation for using apic_find_highest_vector() in Secure AVIC
guest APIC driver, move it and associated macros to apic.h.
No functional change intended.
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-11-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In preparation for moving kvm-internal kvm_lapic_set_vector(),
kvm_lapic_clear_vector() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-10-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In preparation for moving kvm-internal __kvm_lapic_set_reg64(),
__kvm_lapic_get_reg64() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-9-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In preparation for moving kvm-internal __kvm_lapic_set_reg(),
__kvm_lapic_get_reg() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-8-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In preparation for moving kvm-internal find_highest_vector() to
apic.h for use in Secure AVIC APIC driver, rename find_highest_vector()
to apic_find_highest_vector() as part of the APIC API.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-7-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Change APIC base address from "char *" to "void *" in KVM
lapic's set/get helper functions. Pointer arithmetic for "void *"
and "char *" operate identically. With "void *" there is less
of a chance of doing the wrong thing, e.g. neglecting to cast and
reading a byte instead of the desired APIC register size.
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-6-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In preparation for moving most of the KVM's lapic helpers which
use VEC_POS/REG_POS macros to common APIC header for use in Secure
AVIC APIC driver, rename all VEC_POS/REG_POS macro usages to
APIC_VECTOR_TO_BIT_NUMBER/APIC_VECTOR_TO_REG_OFFSET and remove
VEC_POS/REG_POS.
While at it, clean up line wrap in find_highest_vector().
No functional change intended.
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-5-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Consolidate KVM's {REG,VEC}_POS() macros and lapic_vector_set_in_irr()'s
open coded equivalent logic in anticipation of the kernel gaining more
usage of vector => reg+bit lookups.
Use lapic_vector_set_in_irr()'s math as using divides for both the bit
number and register offset makes it easier to connect the dots, and for at
least one user, fixup_irqs(), "/ 32 * 0x10" generates ever so slightly
better code with gcc-14 (shaves a whole 3 bytes from the code stream):
((v) >> 5) << 4:
c1 ef 05 shr $0x5,%edi
c1 e7 04 shl $0x4,%edi
81 c7 00 02 00 00 add $0x200,%edi
(v) / 32 * 0x10:
c1 ef 05 shr $0x5,%edi
83 c7 20 add $0x20,%edi
c1 e7 04 shl $0x4,%edi
Keep KVM's tersely named macros as "wrappers" to avoid unnecessary churn
in KVM, and because the shorter names yield more readable code overall in
KVM.
The new macros type cast the vector parameter to "unsigned int". This is
required from better code generation for cases where an "int" is passed
to these macros in KVM code.
int v;
((v) >> 5) << 4:
c1 f8 05 sar $0x5,%eax
c1 e0 04 shl $0x4,%eax
((v) / 32 * 0x10):
85 ff test %edi,%edi
8d 47 1f lea 0x1f(%rdi),%eax
0f 49 c7 cmovns %edi,%eax
c1 f8 05 sar $0x5,%eax
c1 e0 04 shl $0x4,%eax
((unsigned int)(v) / 32 * 0x10):
c1 f8 05 sar $0x5,%eax
c1 e0 04 shl $0x4,%eax
(v) & (32 - 1):
89 f8 mov %edi,%eax
83 e0 1f and $0x1f,%eax
(v) % 32
89 fa mov %edi,%edx
c1 fa 1f sar $0x1f,%edx
c1 ea 1b shr $0x1b,%edx
8d 04 17 lea (%rdi,%rdx,1),%eax
83 e0 1f and $0x1f,%eax
29 d0 sub %edx,%eax
(unsigned int)(v) % 32:
89 f8 mov %edi,%eax
83 e0 1f and $0x1f,%eax
Overall kvm.ko text size is impacted if "unsigned int" is not used.
Bin Orig New (w/o unsigned int) New (w/ unsigned int)
lapic.o 28580 28772 28580
kvm.o 670810 671002 670810
kvm.ko 708079 708271 708079
No functional change intended.
[Neeraj: Type cast vec macro param to "unsigned int", provide data
in commit log on "unsigned int" requirement]
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-4-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When doing pointer arithmetic in apic_test_vector() and
kvm_lapic_{set|clear}_vector(), remove the unnecessary
parentheses surrounding the 'bitmap' parameter.
No functional change intended.
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-3-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Remove __apic_test_and_set_vector() and __apic_test_and_clear_vector(),
because the _only_ register that's safe to modify with a non-atomic
operation is ISR, because KVM isn't running the vCPU, i.e. hardware can't
service an IRQ or process an EOI for the relevant (virtual) APIC.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-2-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
AMD CPUs currently execute WBINVD in the host when unregistering SEV
guest memory or when deactivating SEV guests. Such cache maintenance is
performed to prevent data corruption, wherein the encrypted (C=1)
version of a dirty cache line might otherwise only be written back
after the memory is written in a different context (ex: C=0), yielding
corruption. However, WBINVD is performance-costly, especially because
it invalidates processor caches.
Strictly-speaking, unless the SEV ASID is being recycled (meaning the
SNP firmware requires the use of WBINVD prior to DF_FLUSH), the cache
invalidation triggered by WBINVD is unnecessary; only the writeback is
needed to prevent data corruption in remaining scenarios.
To improve performance in these scenarios, use WBNOINVD when available
instead of WBINVD. WBNOINVD still writes back all dirty lines
(preventing host data corruption by SEV guests) but does *not*
invalidate processor caches. Note that the implementation of wbnoinvd()
ensures fall back to WBINVD if WBNOINVD is unavailable.
In anticipation of forthcoming optimizations to limit the WBNOINVD only
to physical CPUs that have executed SEV guests, place the call to
wbnoinvd_on_all_cpus() in a wrapper function sev_writeback_caches().
Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250201000259.3289143-3-kevinloughlin@google.com
[sean: tweak comment regarding CLFUSH]
Cc: Francesco Lavra <francescolavra.fl@gmail.com>
Link: https://lore.kernel.org/r/20250522233733.3176144-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Before sev_vm_destroy() is called, kvm_arch_guest_memory_reclaimed()
has been called for SEV and SEV-ES and kvm_arch_gmem_invalidate()
has been called for SEV-SNP. These functions have already handled
flushing the memory. Therefore, this wbinvd_on_all_cpus() can
simply be dropped.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250522233733.3176144-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use wbinvd_on_cpu() to target a single CPU instead of open-coding an
equivalent, and drop KVM's wbinvd_ipi() now that all users have switched
to x86 library versions.
No functional change intended.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250522233733.3176144-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pull KVM fixes from Paolo Bonzini:
"Many patches, pretty much all of them small, that accumulated while I
was on vacation.
ARM:
- Remove the last leftovers of the ill-fated FPSIMD host state
mapping at EL2 stage-1
- Fix unexpected advertisement to the guest of unimplemented S2 base
granule sizes
- Gracefully fail initialising pKVM if the interrupt controller isn't
GICv3
- Also gracefully fail initialising pKVM if the carveout allocation
fails
- Fix the computing of the minimum MMIO range required for the host
on stage-2 fault
- Fix the generation of the GICv3 Maintenance Interrupt in nested
mode
x86:
- Reject SEV{-ES} intra-host migration if one or more vCPUs are
actively being created, so as not to create a non-SEV{-ES} vCPU in
an SEV{-ES} VM
- Use a pre-allocated, per-vCPU buffer for handling de-sparsification
of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too
large" issue
- Allow out-of-range/invalid Xen event channel ports when configuring
IRQ routing, to avoid dictating a specific ioctl() ordering to
userspace
- Conditionally reschedule when setting memory attributes to avoid
soft lockups when userspace converts huge swaths of memory to/from
private
- Add back MWAIT as a required feature for the MONITOR/MWAIT selftest
- Add a missing field in struct sev_data_snp_launch_start that
resulted in the guest-visible workarounds field being filled at the
wrong offset
- Skip non-canonical address when processing Hyper-V PV TLB flushes
to avoid VM-Fail on INVVPID
- Advertise supported TDX TDVMCALLs to userspace
- Pass SetupEventNotifyInterrupt arguments to userspace
- Fix TSC frequency underflow"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: avoid underflow when scaling TSC frequency
KVM: arm64: Remove kvm_arch_vcpu_run_map_fp()
KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizes
KVM: arm64: Don't free hyp pages with pKVM on GICv2
KVM: arm64: Fix error path in init_hyp_mode()
KVM: arm64: Adjust range correctly during host stage-2 faults
KVM: arm64: nv: Fix MI line level calculation in vgic_v3_nested_update_mi()
KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush
KVM: SVM: Add missing member in SNP_LAUNCH_START command structure
Documentation: KVM: Fix unexpected unindent warnings
KVM: selftests: Add back the missing check of MONITOR/MWAIT availability
KVM: Allow CPU to reschedule while setting per-page memory attributes
KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table.
KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks
KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL
KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight
KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
|
|
Extract KVM's open-coded calls to do writeback caches on multiple CPUs to
common library helpers for both WBINVD and WBNOINVD (KVM will use both).
Put the onus on the caller to check for a non-empty mask to simplify the
SMP=n implementation, e.g. so that it doesn't need to check that the one
and only CPU in the system is present in the mask.
[sean: move to lib, add SMP=n helpers, clarify usage]
Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250128015345.7929-2-szy0127@sjtu.edu.cn
Link: https://lore.kernel.org/20250522233733.3176144-5-seanjc@google.com
|
|
In function kvm_guest_time_update(), __scale_tsc() is used to calculate
a TSC *frequency* rather than a TSC value. With low-enough ratios,
a TSC value that is less than 1 would underflow to 0 and to an infinite
while loop in kvm_get_time_scale():
kvm_guest_time_update(struct kvm_vcpu *v)
if (kvm_caps.has_tsc_control)
tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz,
v->arch.l1_tsc_scaling_ratio);
__scale_tsc(u64 ratio, u64 tsc)
ratio=122380531, tsc=2299998, N=48
ratio*tsc >> N = 0.999... -> 0
Later in the function:
Call Trace:
<TASK>
kvm_get_time_scale arch/x86/kvm/x86.c:2458 [inline]
kvm_guest_time_update+0x926/0xb00 arch/x86/kvm/x86.c:3268
vcpu_enter_guest.constprop.0+0x1e70/0x3cf0 arch/x86/kvm/x86.c:10678
vcpu_run+0x129/0x8d0 arch/x86/kvm/x86.c:11126
kvm_arch_vcpu_ioctl_run+0x37a/0x13d0 arch/x86/kvm/x86.c:11352
kvm_vcpu_ioctl+0x56b/0xe60 virt/kvm/kvm_main.c:4188
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0x12d/0x190 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x59/0x110 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x78/0xe2
This can really happen only when fuzzing, since the TSC frequency
would have to be nonsensically low.
Fixes: 35181e86df97 ("KVM: x86: Add a common TSC scaling function")
Reported-by: Yuntao Liu <liuyuntao12@huawei.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs
without interception.
The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not
handled at all. The MSR values are not zeroed on vCPU creation, saved
on suspend, or restored on resume. No accommodation is made for
processor migration or for sharing a logical processor with other
tasks. No adjustments are made for non-unit TSC multipliers. The MSRs
do not account for time the same way as the comparable PMU events,
whether the PMU is virtualized by the traditional emulation method or
the new mediated pass-through approach.
Nonetheless, in a properly constrained environment, this capability
can be combined with a guest CPUID table that advertises support for
CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the
effective physical CPU frequency in /proc/cpuinfo. Moreover, there is
no performance cost for this capability.
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-3-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Store each "disabled exit" boolean in a single bit rather than a byte.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-2-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Advertise support for LKGS (load into IA32_KERNEL_GS_BASE) to userspace
if the instruction is supported by the underlying CPU.
LKGS is introduced with FRED to completely eliminate the need to swapgs
explicilty. It behaves like the MOV to GS instruction except that it
loads the base address into the IA32_KERNEL_GS_BASE MSR instead of the
GS segment’s descriptor cache, which is exactly what Linux kernel does
to load a user level GS base. Thus there is no need to SWAPGS away
from the kernel GS base.
LKGS is an independent CPU feature that works correctly in a KVM guest
without requiring explicit enablement.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250626173521.2301088-1-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add VMX_HOST_OWNED_DEBUGCTL_BITS to track which bits are host-owned, i.e.
need to be preserved when running the guest, to dedup the logic without
having to incur a memory load to get at kvm_x86_ops.HOST_OWNED_DEBUGCTL.
No functional change intended.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/all/aF1yni8U6XNkyfRf@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU speculation fixes from Borislav Petkov:
"Add the mitigation logic for Transient Scheduler Attacks (TSA)
TSA are new aspeculative side channel attacks related to the execution
timing of instructions under specific microarchitectural conditions.
In some cases, an attacker may be able to use this timing information
to infer data from other contexts, resulting in information leakage.
Add the usual controls of the mitigation and integrate it into the
existing speculation bugs infrastructure in the kernel"
* tag 'tsa_x86_bugs_for_6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/process: Move the buffer clearing before MONITOR
x86/microcode/AMD: Add TSA microcode SHAs
KVM: SVM: Advertise TSA CPUID bits to guests
x86/bugs: Add a Transient Scheduler Attacks mitigation
x86/bugs: Rename MDS machinery to something more generic
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Make sure DR6 and DR7 are initialized to their architectural values
and not accidentally cleared, leading to misconfigurations
* tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/traps: Initialize DR7 by writing its architectural reset value
x86/traps: Initialize DR6 by writing its architectural reset value
|
|
Drop kvm_arch_{start,end}_assignment() and all associated code now that
KVM x86 no longer consumes assigned_device_count. Tracking whether or not
a VFIO-assigned device is formally associated with a VM is fundamentally
flawed, as such an association is optional for general usage, i.e. is prone
to false negatives. E.g. prior to commit 2edd9cb79fb3 ("kvm: detect
assigned device via irqbypass manager"), device passthrough via VFIO would
fail to enable IRQ bypass if userspace omitted the formal VFIO<=>KVM
binding.
And device drivers that *need* the VFIO<=>KVM connection, e.g. KVM-GT,
shouldn't be relying on generic x86 tracking infrastructure.
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that KVM explicitly tracks the number of possible bypass IRQs, and
doesn't conflate IRQ bypass with host MMIO access, stop bumping the
assigned device count when adding an IRQ bypass producer.
This reverts commit 2edd9cb79fb31b0907c6e0cdce2824780cf9b153.
Link: https://lore.kernel.org/r/20250523011756.3243624-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Merge the MMIO stale data branch with the device posted IRQs branch to
provide a common base for removing KVM's tracking of "assigned" devices.
Link: https://lore.kernel.org/all/20250523011756.3243624-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use svm_set_intercept_for_msr() directly to configure IA32_XSS MSR
interception, ensuring consistency with other cases where MSRs are
intercepted depending on guest caps and CPUIDs.
No functional change intended.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250612081947.94081-3-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In KVM guests with Hyper-V hypercalls enabled, the hypercalls
HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX
allow a guest to request invalidation of portions of a virtual TLB.
For this, the hypercall parameter includes a list of GVAs that are supposed
to be invalidated.
However, when non-canonical GVAs are passed, there is currently no
filtering in place and they are eventually passed to checked invocations of
INVVPID on Intel / INVLPGA on AMD. While AMD's INVLPGA silently ignores
non-canonical addresses (effectively a no-op), Intel's INVVPID explicitly
signals VM-Fail and ultimately triggers the WARN_ONCE in invvpid_error():
invvpid failed: ext=0x0 vpid=1 gva=0xaaaaaaaaaaaaa000
WARNING: CPU: 6 PID: 326 at arch/x86/kvm/vmx/vmx.c:482
invvpid_error+0x91/0xa0 [kvm_intel]
Modules linked in: kvm_intel kvm 9pnet_virtio irqbypass fuse
CPU: 6 UID: 0 PID: 326 Comm: kvm-vm Not tainted 6.15.0 #14 PREEMPT(voluntary)
RIP: 0010:invvpid_error+0x91/0xa0 [kvm_intel]
Call Trace:
vmx_flush_tlb_gva+0x320/0x490 [kvm_intel]
kvm_hv_vcpu_flush_tlb+0x24f/0x4f0 [kvm]
kvm_arch_vcpu_ioctl_run+0x3013/0x5810 [kvm]
Hyper-V documents that invalid GVAs (those that are beyond a partition's
GVA space) are to be ignored. While not completely clear whether this
ruling also applies to non-canonical GVAs, it is likely fine to make that
assumption, and manual testing on Azure confirms "real" Hyper-V interprets
the specification in the same way.
Skip non-canonical GVAs when processing the list of address to avoid
tripping the INVVPID failure. Alternatively, KVM could filter out "bad"
GVAs before inserting into the FIFO, but practically speaking the only
downside of pushing validation to the final processing is that doing so
is suboptimal for the guest, and no well-behaved guest will request TLB
flushes for non-canonical addresses.
Fixes: 260970862c88 ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently")
Cc: stable@vger.kernel.org
Signed-off-by: Manuel Andreas <manuel.andreas@tum.de>
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/c090efb3-ef82-499f-a5e0-360fc8420fb7@tum.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Enforce the MMIO State Data mitigation if KVM has ever mapped host MMIO
into the VM, not if the VM has an assigned device. VFIO is but one of
many ways to map host MMIO into a KVM guest, and even within VFIO,
formally attaching a device to a VM via KVM_DEV_VFIO_FILE_ADD is entirely
optional.
Track whether or not the guest can access host MMIO on a per-MMU basis,
i.e. based on whether or not the vCPU has a mapping to host MMIO. For
simplicity, track MMIO mappings in "special" rools (those without a
kvm_mmu_page) at the VM level, as only Intel CPUs are vulnerable, and so
only legacy 32-bit shadow paging is affected, i.e. lack of precise
tracking is a complete non-issue.
Make the per-MMU and per-VM flags sticky. Detecting when *all* MMIO
mappings have been removed would be absurdly complex. And in practice,
removing MMIO from a guest will be done by deleting the associated memslot,
which by default will force KVM to re-allocate all roots. Special roots
will forever be mitigated, but as above, the affected scenarios are not
expected to be performance sensitive.
Use a VMX_RUN flag to communicate the need for a buffers flush to
vmx_vcpu_enter_exit() so that kvm_vcpu_can_access_host_mmio() and all its
dependencies don't need to be marked __always_inline, e.g. so that KASAN
doesn't trigger a noinstr violation.
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Fixes: 8cb861e9e3c9 ("x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data")
Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Extract a common function from MSR interception disabling logic and create
disabling and enabling functions based on it. This removes most of the
duplicated code for MSR interception disabling/enabling.
No functional change intended.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com
[sean: s/enable/set, inline the wrappers]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|