Age | Commit message (Collapse) | Author |
|
The kernel's initcall infrastructure lacks the ability to express
dependencies between initcalls, whereas the modules infrastructure
automatically handles dependencies via symbol loading. Ensure the
PSP SEV driver is initialized before proceeding in sev_hardware_setup()
if KVM is built-in as the dependency isn't handled by the initcall
infrastructure.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/f78ddb64087df27e7bcb1ae0ab53f55aa0804fab.1739226950.git.ashish.kalra@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Simplify code by replacing &to_kvm_svm(kvm)->sev_info with
to_kvm_sev_info() helper function. Wherever possible, drop the local
variable declaration and directly use the helper instead.
No functional changes.
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20250123055140.144378-1-nikunj@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pull kvm updates from Paolo Bonzini:
"Loongarch:
- Clear LLBCTL if secondary mmu mapping changes
- Add hypercall service support for usermode VMM
x86:
- Add a comment to kvm_mmu_do_page_fault() to explain why KVM
performs a direct call to kvm_tdp_page_fault() when RETPOLINE is
enabled
- Ensure that all SEV code is compiled out when disabled in Kconfig,
even if building with less brilliant compilers
- Remove a redundant TLB flush on AMD processors when guest CR4.PGE
changes
- Use str_enabled_disabled() to replace open coded strings
- Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's
APICv cache prior to every VM-Enter
- Overhaul KVM's CPUID feature infrastructure to track all vCPU
capabilities instead of just those where KVM needs to manage state
and/or explicitly enable the feature in hardware. Along the way,
refactor the code to make it easier to add features, and to make it
more self-documenting how KVM is handling each feature
- Rework KVM's handling of VM-Exits during event vectoring; this
plugs holes where KVM unintentionally puts the vCPU into infinite
loops in some scenarios (e.g. if emulation is triggered by the
exit), and brings parity between VMX and SVM
- Add pending request and interrupt injection information to the
kvm_exit and kvm_entry tracepoints respectively
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU
when loading guest/host PKRU, due to a refactoring of the kernel
helpers that didn't account for KVM's pre-checking of the need to
do WRPKRU
- Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not.
Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically
went to userspace, and all the others did not; the new code need
not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not
- As part of enabling TDX virtual machines, support support
separation of private/shared EPT into separate roots.
When TDX will be enabled, operations on private pages will need to
go through the privileged TDX Module via SEAMCALLs; as a result,
they are limited and relatively slow compared to reading a PTE.
The patches included in 6.14 allow KVM to keep a mirror of the
private EPT in host memory, and define entries in kvm_x86_ops to
operate on external page tables such as the TDX private EPT
- The recently introduced conversion of the NX-page reclamation
kthread to vhost_task moved the task under the main process. The
task is created as soon as KVM_CREATE_VM was invoked and this, of
course, broke userspace that didn't expect to see any child task of
the VM process until it started creating its own userspace threads.
In particular crosvm refuses to fork() if procfs shows any child
task, so unbreak it by creating the task lazily. This is arguably a
userspace bug, as there can be other kinds of legitimate worker
tasks and they wouldn't impede fork(); but it's not like userspace
has a way to distinguish kernel worker tasks right now. Should they
show as "Kthread: 1" in proc/.../status?
x86 - Intel:
- Fix a bug where KVM updates hardware's APICv cache of the highest
ISR bit while L2 is active, while ultimately results in a
hardware-accelerated L1 EOI effectively being lost
- Honor event priority when emulating Posted Interrupt delivery
during nested VM-Enter by queueing KVM_REQ_EVENT instead of
immediately handling the interrupt
- Rework KVM's processing of the Page-Modification Logging buffer to
reap entries in the same order they were created, i.e. to mark gfns
dirty in the same order that hardware marked the page/PTE dirty
- Misc cleanups
Generic:
- Cleanup and harden kvm_set_memory_region(); add proper lockdep
assertions when setting memory regions and add a dedicated API for
setting KVM-internal memory regions. The API can then explicitly
disallow all flags for KVM-internal memory regions
- Explicitly verify the target vCPU is online in kvm_get_vcpu() to
fix a bug where KVM would return a pointer to a vCPU prior to it
being fully online, and give kvm_for_each_vcpu() similar treatment
to fix a similar flaw
- Wait for a vCPU to come online prior to executing a vCPU ioctl, to
fix a bug where userspace could coerce KVM into handling the ioctl
on a vCPU that isn't yet onlined
- Gracefully handle xarray insertion failures; even though such
failures are impossible in practice after xa_reserve(), reserving
an entry is always followed by xa_store() which does not know (or
differentiate) whether there was an xa_reserve() before or not
RISC-V:
- Zabha, Svvptc, and Ziccrse extension support for guests. None of
them require anything in KVM except for detecting them and marking
them as supported; Zabha adds byte and halfword atomic operations,
while the others are markers for specific operation of the TLB and
of LL/SC instructions respectively
- Virtualize SBI system suspend extension for Guest/VM
- Support firmware counters which can be used by the guests to
collect statistics about traps that occur in the host
Selftests:
- Rework vcpu_get_reg() to return a value instead of using an
out-param, and update all affected arch code accordingly
- Convert the max_guest_memory_test into a more generic
mmu_stress_test. The basic gist of the "conversion" is to have the
test do mprotect() on guest memory while vCPUs are accessing said
memory, e.g. to verify KVM and mmu_notifiers are working as
intended
- Play nice with treewrite builds of unsupported architectures, e.g.
arm (32-bit), as KVM selftests' Makefile doesn't do anything to
ensure the target architecture is actually one KVM selftests
supports
- Use the kernel's $(ARCH) definition instead of the target triple
for arch specific directories, e.g. arm64 instead of aarch64,
mainly so as not to be different from the rest of the kernel
- Ensure that format strings for logging statements are checked by
the compiler even when the logging statement itself is disabled
- Attempt to whack the last LLC references/misses mole in the Intel
PMU counters test by adding a data load and doing CLFLUSH{OPT} on
the data instead of the code being executed. It seems that modern
Intel CPUs have learned new code prefetching tricks that bypass the
PMU counters
- Fix a flaw in the Intel PMU counters test where it asserts that
events are counting correctly without actually knowing what the
events count given the underlying hardware; this can happen if
Intel reuses a formerly microarchitecture-specific event encoding
as an architectural event, as was the case for Top-Down Slots"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits)
kvm: defer huge page recovery vhost task to later
KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
KVM: Disallow all flags for KVM-internal memslots
KVM: x86: Drop double-underscores from __kvm_set_memory_region()
KVM: Add a dedicated API for setting KVM-internal memslots
KVM: Assert slots_lock is held when setting memory regions
KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
LoongArch: KVM: Add hypercall service support for usermode VMM
LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed
KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
KVM: VMX: read the PML log in the same order as it was written
KVM: VMX: refactor PML terminology
KVM: VMX: Fix comment of handle_vmx_instruction()
KVM: VMX: Reinstate __exit attribute for vmx_exit()
KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
KVM: x86: Avoid double RDPKRU when loading host/guest PKRU
KVM: x86: Use LVT_TIMER instead of an open coded literal
RISC-V: KVM: Add new exit statstics for redirected traps
RISC-V: KVM: Update firmware counters for various events
RISC-V: KVM: Redirect instruction access fault trap to guest
...
|
|
Make the completion of hypercalls go through the complete_hypercall
function pointer argument, no matter if the hypercall exits to
userspace or not. Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
specifically went to userspace, and all the others did not; the new code
need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
all whether there was an exit to userspace or not.
|
|
KVM x86 misc changes for 6.14:
- Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
instead of just those where KVM needs to manage state and/or explicitly
enable the feature in hardware. Along the way, refactor the code to make
it easier to add features, and to make it more self-documenting how KVM
is handling each feature.
- Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
where KVM unintentionally puts the vCPU into infinite loops in some scenarios
(e.g. if emulation is triggered by the exit), and brings parity between VMX
and SVM.
- Add pending request and interrupt injection information to the kvm_exit and
kvm_entry tracepoints respectively.
- Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
loading guest/host PKRU, due to a refactoring of the kernel helpers that
didn't account for KVM's pre-checking of the need to do WRPKRU.
|
|
Remove hard-coded strings by using the str_enabled_disabled() helper
function.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20241227094450.674104-2-thorsten.blum@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add and use user_exit_on_hypercall() to check if userspace wants to handle
a KVM hypercall instead of open-coding the logic everywhere.
No functional change intended.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
[sean: squash into one patch, keep explicit KVM_HC_MAP_GPA_RANGE check]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-ID: <20241128004344.4072099-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
QEMU up to 9.2.0 is assuming that vcpu->run->hypercall.ret is 0 on exit and
it never modifies it when processing KVM_EXIT_HYPERCALL. Make this explicit
in the code, to avoid breakage when KVM starts modifying that field.
This in principle is not a good idea... It would have been much better if
KVM had set the field to -KVM_ENOSYS from the beginning, so that a dumb
userspace that does nothing on KVM_EXIT_HYPERCALL would tell the guest it
does not support KVM_HC_MAP_GPA_RANGE. However, breaking userspace is
a Very Bad Thing, as everybody should know.
Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Switch all queries (except XSAVES) of guest features from guest CPUID to
guest capabilities, i.e. replace all calls to guest_cpuid_has() with calls
to guest_cpu_cap_has().
Keep guest_cpuid_has() around for XSAVES, but subsume its helper
guest_cpuid_get_register() and add a compile-time assertion to prevent
using guest_cpuid_has() for any other feature. Add yet another comment
for XSAVE to explain why KVM is allowed to query its raw guest CPUID.
Opportunistically drop the unused guest_cpuid_clear(), as there should be
no circumstance in which KVM needs to _clear_ a guest CPUID feature now
that everything is tracked via cpu_caps. E.g. KVM may need to _change_
a feature to emulate dynamic CPUID flags, but KVM should never need to
clear a feature in guest CPUID to prevent it from being used by the guest.
Delete the last remnants of the governed features framework, as the lone
holdout was vmx_adjust_secondary_exec_control()'s divergent behavior for
governed vs. ungoverned features.
Note, replacing guest_cpuid_has() checks with guest_cpu_cap_has() when
computing reserved CR4 bits is a nop when viewed as a whole, as KVM's
capabilities are already incorporated into the calculation, i.e. if a
feature is present in guest CPUID but unsupported by KVM, its CR4 bit
was already being marked as reserved, checking guest_cpu_cap_has() simply
double-stamps that it's a reserved bit.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-51-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
As the first step toward replacing KVM's so-called "governed features"
framework with a more comprehensive, less poorly named implementation,
replace the "kvm_governed_feature" function prefix with "guest_cpu_cap"
and rename guest_can_use() to guest_cpu_cap_has().
The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and
provides a more clear distinction between guest capabilities, which are
KVM controlled (heh, or one might say "governed"), and guest CPUID, which
with few exceptions is fully userspace controlled.
Opportunistically rewrite the comment about XSS passthrough for SEV-ES
guests to avoid referencing so many functions, as such comments are prone
to becoming stale (case in point...).
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Avoid unreachable() as it can (and will in the absence of UBSAN)
generate fallthrough code. Use BUG() so we get a UD2 trap (with
unreachable annotation).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20241128094312.028316261@infradead.org
|
|
Pull kvm updates from Paolo Bonzini:
"The biggest change here is eliminating the awful idea that KVM had of
essentially guessing which pfns are refcounted pages.
The reason to do so was that KVM needs to map both non-refcounted
pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP
VMAs that contain refcounted pages.
However, the result was security issues in the past, and more recently
the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by
struct page but is not refcounted. In particular this broke virtio-gpu
blob resources (which directly map host graphics buffers into the
guest as "vram" for the virtio-gpu device) with the amdgpu driver,
because amdgpu allocates non-compound higher order pages and the tail
pages could not be mapped into KVM.
This requires adjusting all uses of struct page in the
per-architecture code, to always work on the pfn whenever possible.
The large series that did this, from David Stevens and Sean
Christopherson, also cleaned up substantially the set of functions
that provided arch code with the pfn for a host virtual addresses.
The previous maze of twisty little passages, all different, is
replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the
non-__ versions of these two, and kvm_prefetch_pages) saving almost
200 lines of code.
ARM:
- Support for stage-1 permission indirection (FEAT_S1PIE) and
permission overlays (FEAT_S1POE), including nested virt + the
emulated page table walker
- Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This
call was introduced in PSCIv1.3 as a mechanism to request
hibernation, similar to the S4 state in ACPI
- Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
part of it, introduce trivial initialization of the host's MPAM
context so KVM can use the corresponding traps
- PMU support under nested virtualization, honoring the guest
hypervisor's trap configuration and event filtering when running a
nested guest
- Fixes to vgic ITS serialization where stale device/interrupt table
entries are not zeroed when the mapping is invalidated by the VM
- Avoid emulated MMIO completion if userspace has requested
synchronous external abort injection
- Various fixes and cleanups affecting pKVM, vCPU initialization, and
selftests
LoongArch:
- Add iocsr and mmio bus simulation in kernel.
- Add in-kernel interrupt controller emulation.
- Add support for virtualization extensions to the eiointc irqchip.
PPC:
- Drop lingering and utterly obsolete references to PPC970 KVM, which
was removed 10 years ago.
- Fix incorrect documentation references to non-existing ioctls
RISC-V:
- Accelerate KVM RISC-V when running as a guest
- Perf support to collect KVM guest statistics from host side
s390:
- New selftests: more ucontrol selftests and CPU model sanity checks
- Support for the gen17 CPU model
- List registers supported by KVM_GET/SET_ONE_REG in the
documentation
x86:
- Cleanup KVM's handling of Accessed and Dirty bits to dedup code,
improve documentation, harden against unexpected changes.
Even if the hardware A/D tracking is disabled, it is possible to
use the hardware-defined A/D bits to track if a PFN is Accessed
and/or Dirty, and that removes a lot of special cases.
- Elide TLB flushes when aging secondary PTEs, as has been done in
x86's primary MMU for over 10 years.
- Recover huge pages in-place in the TDP MMU when dirty page logging
is toggled off, instead of zapping them and waiting until the page
is re-accessed to create a huge mapping. This reduces vCPU jitter.
- Batch TLB flushes when dirty page logging is toggled off. This
reduces the time it takes to disable dirty logging by ~3x.
- Remove the shrinker that was (poorly) attempting to reclaim shadow
page tables in low-memory situations.
- Clean up and optimize KVM's handling of writes to
MSR_IA32_APICBASE.
- Advertise CPUIDs for new instructions in Clearwater Forest
- Quirk KVM's misguided behavior of initialized certain feature MSRs
to their maximum supported feature set, which can result in KVM
creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to
a non-zero value results in the vCPU having invalid state if
userspace hides PDCM from the guest, which in turn can lead to
save/restore failures.
- Fix KVM's handling of non-canonical checks for vCPUs that support
LA57 to better follow the "architecture", in quotes because the
actual behavior is poorly documented. E.g. most MSR writes and
descriptor table loads ignore CR4.LA57 and operate purely on
whether the CPU supports LA57.
- Bypass the register cache when querying CPL from kvm_sched_out(),
as filling the cache from IRQ context is generally unsafe; harden
the cache accessors to try to prevent similar issues from occuring
in the future. The issue that triggered this change was already
fixed in 6.12, but was still kinda latent.
- Advertise AMD_IBPB_RET to userspace, and fix a related bug where
KVM over-advertises SPEC_CTRL when trying to support cross-vendor
VMs.
- Minor cleanups
- Switch hugepage recovery thread to use vhost_task.
These kthreads can consume significant amounts of CPU time on
behalf of a VM or in response to how the VM behaves (for example
how it accesses its memory); therefore KVM tried to place the
thread in the VM's cgroups and charge the CPU time consumed by that
work to the VM's container.
However the kthreads did not process SIGSTOP/SIGCONT, and therefore
cgroups which had KVM instances inside could not complete freezing.
Fix this by replacing the kthread with a PF_USER_WORKER thread, via
the vhost_task abstraction. Another 100+ lines removed, with
generally better behavior too like having these threads properly
parented in the process tree.
- Revert a workaround for an old CPU erratum (Nehalem/Westmere) that
didn't really work; there was really nothing to work around anyway:
the broken patch was meant to fix nested virtualization, but the
PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the
erratum.
- Fix 6.12 regression where CONFIG_KVM will be built as a module even
if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is
'y'.
x86 selftests:
- x86 selftests can now use AVX.
Documentation:
- Use rST internal links
- Reorganize the introduction to the API document
Generic:
- Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock
instead of RCU, so that running a vCPU on a different task doesn't
encounter long due to having to wait for all CPUs become quiescent.
In general both reads and writes are rare, but userspace that
supports confidential computing is introducing the use of "helper"
vCPUs that may jump from one host processor to another. Those will
be very happy to trigger a synchronize_rcu(), and the effect on
performance is quite the disaster"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits)
KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
KVM: x86: add back X86_LOCAL_APIC dependency
Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
KVM: x86: switch hugepage recovery thread to vhost_task
KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
Documentation: KVM: fix malformed table
irqchip/loongson-eiointc: Add virt extension support
LoongArch: KVM: Add irqfd support
LoongArch: KVM: Add PCHPIC user mode read and write functions
LoongArch: KVM: Add PCHPIC read and write functions
LoongArch: KVM: Add PCHPIC device support
LoongArch: KVM: Add EIOINTC user mode read and write functions
LoongArch: KVM: Add EIOINTC read and write functions
LoongArch: KVM: Add EIOINTC device support
LoongArch: KVM: Add IPI user mode read and write function
LoongArch: KVM: Add IPI read and write function
LoongArch: KVM: Add IPI device support
LoongArch: KVM: Add iocsr and mmio bus simulation in kernel
KVM: arm64: Pass on SVE mapping failures
...
|
|
Pull 'struct fd' class updates from Al Viro:
"The bulk of struct fd memory safety stuff
Making sure that struct fd instances are destroyed in the same scope
where they'd been created, getting rid of reassignments and passing
them by reference, converting to CLASS(fd{,_pos,_raw}).
We are getting very close to having the memory safety of that stuff
trivial to verify"
* tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
deal with the last remaing boolean uses of fd_file()
css_set_fork(): switch to CLASS(fd_raw, ...)
memcg_write_event_control(): switch to CLASS(fd)
assorted variants of irqfd setup: convert to CLASS(fd)
do_pollfd(): convert to CLASS(fd)
convert do_select()
convert vfs_dedupe_file_range().
convert cifs_ioctl_copychunk()
convert media_request_get_by_fd()
convert spu_run(2)
switch spufs_calls_{get,put}() to CLASS() use
convert cachestat(2)
convert do_preadv()/do_pwritev()
fdget(), more trivial conversions
fdget(), trivial conversions
privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget()
o2hb_region_dev_store(): avoid goto around fdget()/fdput()
introduce "fd_pos" class, convert fdget_pos() users to it.
fdget_raw() users: switch to CLASS(fd_raw)
convert vmsplice() to CLASS(fd)
...
|
|
- Drop obsolete references to PPC970 KVM, which was removed 10 years ago.
- Fix incorrect references to non-existing ioctls
- List registers supported by KVM_GET/SET_ONE_REG on s390
- Use rST internal links
- Reorganize the introduction to the API document
|
|
Ensure that snp gctx page allocation is adequately deallocated on
failure during snp_launch_start.
Fixes: 136d8bc931c8 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command")
CC: Sean Christopherson <seanjc@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Ashish Kalra <ashish.kalra@amd.com>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: John Allen <john.allen@amd.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: "David S. Miller" <davem@davemloft.net>
CC: Michael Roth <michael.roth@amd.com>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Russ Weight <russ.weight@linux.dev>
CC: Danilo Krummrich <dakr@redhat.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: "Rafael J. Wysocki" <rafael@kernel.org>
CC: Tianfei zhang <tianfei.zhang@intel.com>
CC: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Dionna Glaze <dionnaglaze@google.com>
Message-ID: <20241105010558.1266699-2-dionnaglaze@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
If snp_guest_req_init() fails, return the provided error code up the
stack to userspace, e.g. so that userspace can log that KVM_SEV_INIT2
failed, as opposed to some random operation later in VM setup failing
because SNP wasn't actually enabled for the VM.
Note, KVM itself doesn't consult the return value from __sev_guest_init(),
i.e. the fallout is purely that userspace may be confused.
Fixes: 88caf544c930 ("KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202410192220.MeTyHPxI-lkp@intel.com
Link: https://lore.kernel.org/r/20241031203214.1585751-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
fdget() is the first thing done in scope, all matching fdput() are
immediately followed by leaving the scope.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Provide the "struct page" associated with a guest_memfd pfn as an output
from __kvm_gmem_get_pfn() so that KVM guest page fault handlers can
directly put the page instead of having to rely on
kvm_pfn_to_refcounted_page().
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-47-seanjc@google.com>
|
|
Now that all kvm_vcpu_{,un}map() users pass "true" for @dirty, have them
pass "true" as a @writable param to kvm_vcpu_map(), and thus create a
read-only mapping when possible.
Note, creating read-only mappings can be theoretically slower, as they
don't play nice with fast GUP due to the need to break CoW before mapping
the underlying PFN. But practically speaking, creating a mapping isn't
a super hot path, and getting a writable mapping for reading is weird and
confusing.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-34-seanjc@google.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' updates from Al Viro:
"Just the 'struct fd' layout change, with conversion to accessor
helpers"
* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
add struct fd constructors, get rid of __to_fd()
struct fd: representation change
introduce fd_file(), convert all accessors to it.
|
|
The copy_from_user() function returns the number of bytes which it
was not able to copy. Return -EFAULT instead.
Fixes: dee5a47cc7a4 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Message-ID: <20240612115040.2423290-4-dan.carpenter@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
If snp_lookup_rmpentry() fails then "assigned" is printed in the error
message but it was never initialized. Initialize it to false.
Fixes: dee5a47cc7a4 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Message-ID: <20240612115040.2423290-3-dan.carpenter@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
This check is currently performed by sev_gmem_post_populate(), but it
applies to all callers of kvm_gmem_populate(): the point of the function
is that the memory is being encrypted and some work has to be done
on all the gfns in order to encrypt them.
Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior
to invoking the callback, and stop the operation if a shared page
is encountered. Because CONFIG_KVM_PRIVATE_MEM in principle does
not require attributes, this makes kvm_gmem_populate() depend on
CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them).
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Do not allow populating the same page twice with startup data. In the
case of SEV-SNP, for example, the firmware does not allow it anyway,
since the launch-update operation is only possible on pages that are
still shared in the RMP.
Even if it worked, kvm_gmem_populate()'s callback is meant to have side
effects such as updating launch measurements, and updating the same
page twice is unlikely to have the desired results.
Races between calls to the ioctl are not possible because
kvm_gmem_populate() holds slots_lock and the VM should not be running.
But again, even if this worked on other confidential computing technology,
it doesn't matter to guest_memfd.c whether this is something fishy
such as missing synchronization in userspace, or rather something
intentional. One of the racers wins, and the page is initialized by
either kvm_gmem_prepare_folio() or kvm_gmem_populate().
Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use
the same errno that kvm_gmem_populate() is using.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM_PRE_FAULT_MEMORY for an SNP guest can race with
sev_gmem_post_populate() in bad ways. The following sequence for
instance can potentially trigger an RMP fault:
thread A, sev_gmem_post_populate: called
thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP
thread A, sev_gmem_post_populate: *vaddr = kmap_local_pfn(pfn + i);
thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE);
RMP #PF
Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's
initial private memory contents have been finalized via
KVM_SEV_SNP_LAUNCH_FINISH.
Beyond fixing this issue, it just sort of makes sense to enforce this,
since the KVM_PRE_FAULT_MEMORY documentation states:
"KVM maps memory as if the vCPU generated a stage-2 read page fault"
which sort of implies we should be acting on the same guest state that a
vCPU would see post-launch after the initial guest memory is all set up.
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The GHCB 2.0 specification defines 2 GHCB request types to allow SNP guests
to send encrypted messages/requests to firmware: SNP Guest Requests and SNP
Extended Guest Requests. These encrypted messages are used for things like
servicing attestation requests issued by the guest. Implementing support for
these is required to be fully GHCB-compliant.
For the most part, KVM only needs to handle forwarding these requests to
firmware (to be issued via the SNP_GUEST_REQUEST firmware command defined
in the SEV-SNP Firmware ABI), and then forwarding the encrypted response to
the guest.
However, in the case of SNP Extended Guest Requests, the host is also
able to provide the certificate data corresponding to the endorsement key
used by firmware to sign attestation report requests. This certificate data
is provided by userspace because:
1) It allows for different keys/key types to be used for each particular
guest with requiring any sort of KVM API to configure the certificate
table in advance on a per-guest basis.
2) It provides additional flexibility with how attestation requests might
be handled during live migration where the certificate data for
source/dest might be different.
3) It allows all synchronization between certificates and firmware/signing
key updates to be handled purely by userspace rather than requiring
some in-kernel mechanism to facilitate it. [1]
To support fetching certificate data from userspace, a new KVM exit type will
be needed to handle fetching the certificate from userspace. An attempt to
define a new KVM_EXIT_COCO/KVM_EXIT_COCO_REQ_CERTS exit type to handle this
was introduced in v1 of this patchset, but is still being discussed by
community, so for now this patchset only implements a stub version of SNP
Extended Guest Requests that does not provide certificate data, but is still
enough to provide compliance with the GHCB 2.0 spec.
|
|
Version 2 of GHCB specification added support for the SNP Extended Guest
Request Message NAE event. This event serves a nearly identical purpose
to the previously-added SNP_GUEST_REQUEST event, but for certain message
types it allows the guest to supply a buffer to be used for additional
information in some cases.
Currently the GHCB spec only defines extended handling of this sort in
the case of attestation requests, where the additional buffer is used to
supply a table of certificate data corresponding to the attestion
report's signing key. Support for this extended handling will require
additional KVM APIs to handle coordinating with userspace.
Whether or not the hypervisor opts to provide this certificate data is
optional. However, support for processing SNP_EXTENDED_GUEST_REQUEST
GHCB requests is required by the GHCB 2.0 specification for SNP guests,
so for now implement a stub implementation that provides an empty
certificate table to the guest if it supplies an additional buffer, but
otherwise behaves identically to SNP_GUEST_REQUEST.
Reviewed-by: Carlos Bilbao <carlos.bilbao.osdev@gmail.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240701223148.3798365-4-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Version 2 of GHCB specification added support for the SNP Guest Request
Message NAE event. The event allows for an SEV-SNP guest to make
requests to the SEV-SNP firmware through the hypervisor using the
SNP_GUEST_REQUEST API defined in the SEV-SNP firmware specification.
This is used by guests primarily to request attestation reports from
firmware. There are other request types are available as well, but the
specifics of what guest requests are being made generally does not
affect how they are handled by the hypervisor, which only serves as a
proxy for the guest requests and firmware responses.
Implement handling for these events.
When an SNP Guest Request is issued, the guest will provide its own
request/response pages, which could in theory be passed along directly
to firmware. However, these pages would need special care:
- Both pages are from shared guest memory, so they need to be
protected from migration/etc. occurring while firmware reads/writes
to them. At a minimum, this requires elevating the ref counts and
potentially needing an explicit pinning of the memory. This places
additional restrictions on what type of memory backends userspace
can use for shared guest memory since there would be some reliance
on using refcounted pages.
- The response page needs to be switched to Firmware-owned state
before the firmware can write to it, which can lead to potential
host RMP #PFs if the guest is misbehaved and hands the host a
guest page that KVM is writing to for other reasons (e.g. virtio
buffers).
Both of these issues can be avoided completely by using
separately-allocated bounce pages for both the request/response pages
and passing those to firmware instead. So that's the approach taken
here.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Co-developed-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
[mdr: ensure FW command failures are indicated to guest, drop extended
request handling to be re-written as separate patch, massage commit]
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240701223148.3798365-2-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM SVM changes for 6.11
- Make per-CPU save_area allocations NUMA-aware.
- Force sev_es_host_save_area() to be inlined to avoid calling into an
instrumentable function from noinstr code.
|
|
KVM x86 misc changes for 6.11
- Add a global struct to consolidate tracking of host values, e.g. EFER, and
move "shadow_phys_bits" into the structure as "maxphyaddr".
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
bus frequency, because TDX.
- Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
- Clean up KVM's handling of vendor specific emulation to consistently act on
"compatible with Intel/AMD", versus checking for a specific vendor.
- Misc cleanups
|
|
SEV-ES and thus SNP guest mandates LBR Virtualization to be _always_ ON.
Although commit b7e4be0a224f ("KVM: SEV-ES: Delegate LBR virtualization
to the processor") did the correct change for SEV-ES guests, it missed
the SNP. Fix it.
Reported-by: Srikanth Aithal <sraithal@amd.com>
Fixes: b7e4be0a224f ("KVM: SEV-ES: Delegate LBR virtualization to the processor")
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Message-ID: <20240605114810.1304-1-ravi.bangoria@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
save_area of per-CPU svm_data are dominantly accessed from their
own local CPUs, so allocate them node-local for performance reason
so rename __snp_safe_alloc_page as snp_safe_alloc_page_node which
accepts numa node id as input parameter, svm_cpu_init call it with
node id switched from cpu id
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240520120858.13117-4-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The allocation for the per-CPU save area in svm_cpu_init shouldn't
be accounted, So introduce __snp_safe_alloc_page helper, which has
gfp flag as input, svm_cpu_init calls __snp_safe_alloc_page with
GFP_KERNEL, snp_safe_alloc_page calls __snp_safe_alloc_page with
GFP_KERNEL_ACCOUNT as input
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240520120858.13117-3-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The input parameter 'vcpu' in snp_safe_alloc_page is not used.
Therefore, remove it.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240520120858.13117-2-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pull base x86 KVM support for running SEV-SNP guests from Michael Roth:
* add some basic infrastructure and introduces a new KVM_X86_SNP_VM
vm_type to handle differences versus the existing KVM_X86_SEV_VM and
KVM_X86_SEV_ES_VM types.
* implement the KVM API to handle the creation of a cryptographic
launch context, encrypt/measure the initial image into guest memory,
and finalize it before launching it.
* implement handling for various guest-generated events such as page
state changes, onlining of additional vCPUs, etc.
* implement the gmem/mmu hooks needed to prepare gmem-allocated pages
before mapping them into guest private memory ranges as well as
cleaning them up prior to returning them to the host for use as
normal memory. Because those cleanup hooks supplant certain
activities like issuing WBINVDs during KVM MMU invalidations, avoid
duplicating that work to avoid unecessary overhead.
This merge leaves out support support for attestation guest requests
and for loading the signing keys to be used for attestation requests.
|
|
As documented in APM[1], LBR Virtualization must be enabled for SEV-ES
guests. Although KVM currently enforces LBRV for SEV-ES guests, there
are multiple issues with it:
o MSR_IA32_DEBUGCTLMSR is still intercepted. Since MSR_IA32_DEBUGCTLMSR
interception is used to dynamically toggle LBRV for performance reasons,
this can be fatal for SEV-ES guests. For ex SEV-ES guest on Zen3:
[guest ~]# wrmsr 0x1d9 0x4
KVM: entry failed, hardware error 0xffffffff
EAX=00000004 EBX=00000000 ECX=000001d9 EDX=00000000
Fix this by never intercepting MSR_IA32_DEBUGCTLMSR for SEV-ES guests.
No additional save/restore logic is required since MSR_IA32_DEBUGCTLMSR
is of swap type A.
o KVM will disable LBRV if userspace sets MSR_IA32_DEBUGCTLMSR before the
VMSA is encrypted. Fix this by moving LBRV enablement code post VMSA
encryption.
[1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
2023, Vol 2, 15.35.2 Enabling SEV-ES.
https://bugzilla.kernel.org/attachment.cgi?id=304653
Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Co-developed-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Message-ID: <20240531044644.768-4-ravi.bangoria@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As documented in APM[1], LBR Virtualization must be enabled for SEV-ES
guests. So, prevent SEV-ES guests when LBRV support is missing.
[1]: AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
2023, Vol 2, 15.35.2 Enabling SEV-ES.
https://bugzilla.kernel.org/attachment.cgi?id=304653
Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Message-ID: <20240531044644.768-3-ravi.bangoria@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
All SNP APs are initially started using the APIC INIT/SIPI sequence in
the guest. This sequence moves the AP MP state from
KVM_MP_STATE_UNINITIALIZED to KVM_MP_STATE_RUNNABLE, so there is no need
to attempt the UNBLOCK.
As it is, the UNBLOCK support in SVM is only enabled when AVIC is
enabled. When AVIC is disabled, AP creation is still successful.
Remove the KVM_REQ_UNBLOCK request from the AP creation code and revert
the changes to the vcpu_unblocking() kvm_x86_ops path.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The hook only handles cleanup work specific to SNP, e.g. RMP table
entries and flushing caches for encrypted guest memory. When run on a
non-SNP-enabled host (currently only possible using
KVM_X86_SW_PROTECTED_VM, e.g. via KVM selftests), the callback is a noop
and will WARN due to the RMP table not being present. It's actually
expected in this case that the RMP table wouldn't be present and that
the hook should be a noop, so drop the WARN_ONCE().
Reported-by: Sean Christopherson <seanjc@google.com>
Closes: https://lore.kernel.org/kvm/ZkU3_y0UoPk5yAeK@google.com/
Fixes: 8eb01900b018 ("KVM: SEV: Implement gmem hook for invalidating private pages")
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Currently there's a consistent pattern of always calling
host_rmp_make_shared() immediately after snp_page_reclaim(), so go ahead
and handle it automatically as part of snp_page_reclaim(). Also rename
it to kvm_rmp_make_shared() to more easily distinguish it as a
KVM-specific variant of the more generic rmp_make_shared() helper.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use KVM's snapshot of the host's XCR0 when stuffing SEV-ES host state
instead of reading XCR0 from hardware. XCR0 is only written during
boot, i.e. won't change while KVM is running (and KVM at large is hosed
if that doesn't hold true).
Link: https://lore.kernel.org/r/20240423221521.2923759-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add "struct kvm_host_values kvm_host" to hold the various host values
that KVM snapshots during initialization. Bundling the host values into
a single struct simplifies adding new MSRs and other features with host
state/values that KVM cares about, and provides a one-stop shop. E.g.
adding a new value requires one line, whereas tracking each value
individual often requires three: declaration, definition, and export.
No functional change intended.
Link: https://lore.kernel.org/r/20240423221521.2923759-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a module parameter than can be used to enable or disable the SEV-SNP
feature. Now that KVM contains the support for the SNP set the GHCB
hypervisor feature flag to indicate that SNP is supported.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-18-michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
With SNP/guest_memfd, private/encrypted memory should not be mappable,
and MMU notifications for HVA-mapped memory will only be relevant to
unencrypted guest memory. Therefore, the rationale behind issuing a
wbinvd_on_all_cpus() in sev_guest_memory_reclaimed() should not apply
for SNP guests and can be ignored.
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
[mdr: Add some clarifications in commit]
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501085210.2213060-17-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
In the case of SEV-SNP, whether or not a 2MB page can be mapped via a
2MB mapping in the guest's nested page table depends on whether or not
any subpages within the range have already been initialized as private
in the RMP table. The existing mixed-attribute tracking in KVM is
insufficient here, for instance:
- gmem allocates 2MB page
- guest issues PVALIDATE on 2MB page
- guest later converts a subpage to shared
- SNP host code issues PSMASH to split 2MB RMP mapping to 4K
- KVM MMU splits NPT mapping to 4K
- guest later converts that shared page back to private
At this point there are no mixed attributes, and KVM would normally
allow for 2MB NPT mappings again, but this is actually not allowed
because the RMP table mappings are 4K and cannot be promoted on the
hypervisor side, so the NPT mappings must still be limited to 4K to
match this.
Implement a kvm_x86_ops.private_max_mapping_level() hook for SEV that
checks for this condition and adjusts the mapping level accordingly.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501085210.2213060-16-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Implement a platform hook to do the work of restoring the direct map
entries of gmem-managed pages and transitioning the corresponding RMP
table entries back to the default shared/hypervisor-owned state.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501085210.2213060-15-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
This will handle the RMP table updates needed to put a page into a
private state before mapping it into an SEV-SNP guest.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-ID: <20240501085210.2213060-14-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP
guests to alter the register state of the APs on their own. This allows
the guest a way of simulating INIT-SIPI.
A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used
so as to avoid updating the VMSA pointer while the vCPU is running.
For CREATE
The guest supplies the GPA of the VMSA to be used for the vCPU with
the specified APIC ID. The GPA is saved in the svm struct of the
target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added
to the vCPU and then the vCPU is kicked.
For CREATE_ON_INIT:
The guest supplies the GPA of the VMSA to be used for the vCPU with
the specified APIC ID the next time an INIT is performed. The GPA is
saved in the svm struct of the target vCPU.
For DESTROY:
The guest indicates it wishes to stop the vCPU. The GPA is cleared
from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is
added to vCPU and then the vCPU is kicked.
The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked
as a result of the event or as a result of an INIT. If a new VMSA is to
be installed, the VMSA guest page is set as the VMSA in the vCPU VMCB
and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new VMSA is not
to be installed, the VMSA is cleared in the vCPU VMCB and the vCPU state
is set to KVM_MP_STATE_HALTED to prevent it from being run.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-13-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When SEV-SNP is enabled in the guest, the hardware places restrictions
on all memory accesses based on the contents of the RMP table. When
hardware encounters RMP check failure caused by the guest memory access
it raises the #NPF. The error code contains additional information on
the access type. See the APM volume 2 for additional information.
When using gmem, RMP faults resulting from mismatches between the state
in the RMP table vs. what the guest expects via its page table result
in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This
means the only expected case that needs to be handled in the kernel is
when the page size of the entry in the RMP table is larger than the
mapping in the nested page table, in which case a PSMASH instruction
needs to be issued to split the large RMP entry into individual 4K
entries so that subsequent accesses can succeed.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-ID: <20240501085210.2213060-12-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|