summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/mmu
AgeCommit message (Collapse)Author
2022-06-24KVM: x86/mmu: Update page stats in __rmap_add()David Matlack
Update the page stats in __rmap_add() rather than at the call site. This will avoid having to manually update page stats when splitting huge pages in a subsequent commit. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-17-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpuDavid Matlack
Allow adding new entries to the rmap and linking shadow pages without a struct kvm_vcpu pointer by moving the implementation of rmap_add() and link_shadow_page() into inner helper functions. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Pass const memslot to rmap_add()David Matlack
Constify rmap_add()'s @slot parameter; it is simply passed on to gfn_to_rmap(), which takes a const memslot. No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-15-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()David Matlack
Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync indirect shadow pages, so it's safe to pass in NULL when looking up direct shadow pages. This will be used for doing eager page splitting, which allocates direct shadow pages from the context of a VM ioctl without access to a vCPU pointer. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-14-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Pass kvm pointer separately from vcpu to ↵David Matlack
kvm_mmu_find_shadow_page() Get the kvm pointer from the caller, rather than deriving it from vcpu->kvm, and plumb the kvm pointer all the way from kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer is only needed to sync indirect shadow pages. In other words, __kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages without a vcpu pointer. This enables eager page splitting, which needs to allocate direct shadow pages during VM ioctls. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-13-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()David Matlack
The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-12-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Pass memory caches to allocate SPs separatelyDavid Matlack
Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it will allocate the various pieces of memory for shadow pages as a parameter, rather than deriving them from the vcpu pointer. This will be useful in a future commit where shadow pages are allocated during VM ioctls for eager page splitting, and thus will use a different set of caches. Preemptively pull the caches out all the way to kvm_mmu_get_shadow_page() since eager page splitting will not be calling kvm_mmu_alloc_shadow_page() directly. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-11-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Move guest PT write-protection to account_shadowed()David Matlack
Move the code that write-protects newly-shadowed guest page tables into account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a more logical place for this code to live. But most importantly, this reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct kvm_vcpu pointer, which will be necessary when creating new shadow pages during VM ioctls for eager page splitting. Note, it is safe to drop the role.level == PG_LEVEL_4K check since account_shadowed() returns early if role.level > PG_LEVEL_4K. No functional change intended. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-10-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pagesDavid Matlack
Rename 2 functions: kvm_mmu_get_page() -> kvm_mmu_get_shadow_page() kvm_mmu_free_page() -> kvm_mmu_free_shadow_page() This change makes it clear that these functions deal with shadow pages rather than struct pages. It also aligns these functions with the naming scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page(). Prefer "shadow_page" over the shorter "sp" since these are core functions and the line lengths aren't terrible. No functional change intended. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-9-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Consolidate shadow page allocation and initializationDavid Matlack
Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under the latter so that all shadow page allocation and initialization happens in one place. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-8-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functionsDavid Matlack
Decompose kvm_mmu_get_page() into separate helper functions to increase readability and prepare for allocating shadow pages without a vcpu pointer. Specifically, pull the guts of kvm_mmu_get_page() into 2 helper functions: kvm_mmu_find_shadow_page() - Walks the page hash checking for any existing mmu pages that match the given gfn and role. kvm_mmu_alloc_shadow_page() Allocates and initializes an entirely new kvm_mmu_page. This currently requries a vcpu pointer for allocation and looking up the memslot but that will be removed in a future commit. No functional change intended. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytesDavid Matlack
The quadrant is only used when gptes are 4 bytes, but mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE page directories regardless. Make this less confusing by only passing in a non-zero quadrant when it is actually necessary. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-6-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Derive shadow MMU page role from parentDavid Matlack
Instead of computing the shadow page role from scratch for every new page, derive most of the information from the parent shadow page. This eliminates the dependency on the vCPU root role to allocate shadow page tables, and reduces the number of parameters to kvm_mmu_get_page(). Preemptively split out the role calculation to a separate function for use in a following commit. Note that when calculating the MMU root role, we can take @role.passthrough, @role.direct, and @role.access directly from @vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still must be overridden for PAE page directories, when shadowing 32-bit guest page tables with PAE page tables. No functional change intended. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root()David Matlack
The "direct" argument is vcpu->arch.mmu->root_role.direct, because unlike non-root page tables, it's impossible to have a direct root in an indirect MMU. So just use that. Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Use a bool for directDavid Matlack
The parameter "direct" can either be true or false, and all of the callers pass in a bool variable or true/false literal, so just use the type bool. No functional change intended. Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPsDavid Matlack
Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for fully direct MMUs") skipped the unsync checks and write flood clearing for full direct MMUs. We can extend this further to skip the checks for all direct shadow pages. Direct shadow pages in indirect MMUs (i.e. shadow paging) are used when shadowing a guest huge page with smaller pages. Such direct shadow pages, like their counterparts in fully direct MMUs, are never marked unsynced or have a non-zero write-flooding count. Checking sp->role.direct also generates better code than checking direct_map because, due to register pressure, direct_map has to get shoved onto the stack and then pulled back off. No functional change intended. Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basisBen Gardon
In some cases, the NX hugepage mitigation for iTLB multihit is not needed for all guests on a host. Allow disabling the mitigation on a per-VM basis to avoid the performance hit of NX hugepages on trusted workloads. In order to disable NX hugepages on a VM, ensure that the userspace actor has permission to reboot the system. Since disabling NX hugepages would allow a guest to crash the system, it is similar to reboot permissions. Ideally, KVM would require userspace to prove it has access to KVM's nx_huge_pages module param, e.g. so that userspace can opt out without needing full reboot permissions. But getting access to the module param file info is difficult because it is buried in layers of sysfs and module glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases. Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20220613212523.3436117-9-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: x86/mmu: Shove refcounted page dependency into host_pfn_mapping_level()Sean Christopherson
Move the check that restricts mapping huge pages into the guest to pfns that are backed by refcounted 'struct page' memory into the helper that actually "requires" a 'struct page', host_pfn_mapping_level(). In addition to deduplicating code, moving the check to the helper eliminates the subtle requirement that the caller check that the incoming pfn is backed by a refcounted struct page, and as an added bonus avoids an extra pfn_to_page() lookup. Note, the is_error_noslot_pfn() check in kvm_mmu_hugepage_adjust() needs to stay where it is, as it guards against dereferencing a NULL memslot in the kvm_slot_dirty_track_enabled() that follows. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()Sean Christopherson
Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() to better reflect what KVM is actually checking, and to eliminate extra pfn_to_page() lookups. The kvm_release_pfn_*() an kvm_try_get_pfn() helpers in particular benefit from "refouncted" nomenclature, as it's not all that obvious why KVM needs to get/put refcounts for some PG_reserved pages (ZERO_PAGE and ZONE_DEVICE). Add a comment to call out that the list of exceptions to PG_reserved is all but guaranteed to be incomplete. The list has mostly been compiled by people throwing noodles at KVM and finding out they stick a little too well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages didn't get freed. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Take a 'struct page', not a pfn in kvm_is_zone_device_page()Sean Christopherson
Operate on a 'struct page' instead of a pfn when checking if a page is a ZONE_DEVICE page, and rename the helper accordingly. Generally speaking, KVM doesn't actually care about ZONE_DEVICE memory, i.e. shouldn't do anything special for ZONE_DEVICE memory. Rather, KVM wants to treat ZONE_DEVICE memory like regular memory, and the need to identify ZONE_DEVICE memory only arises as an exception to PG_reserved pages. In other words, KVM should only ever check for ZONE_DEVICE memory after KVM has already verified that there is a struct page associated with the pfn. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: x86/mmu: Use common logic for computing the 32/64-bit base PA maskSean Christopherson
Use common logic for computing PT_BASE_ADDR_MASK for 32-bit, 64-bit, and EPT paging. Both PAGE_MASK and the new-common logic are supsersets of what is actually needed for 32-bit paging. PAGE_MASK sets bits 63:12 and the former GUEST_PT64_BASE_ADDR_MASK sets bits 51:12, so regardless of which value is used, the result will always be bits 31:12. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: x86/mmu: Truncate paging32's PT_BASE_ADDR_MASK to 32 bitsSean Christopherson
Truncate paging32's PT_BASE_ADDR_MASK to a pt_element_t, i.e. to 32 bits. Ignoring PSE huge pages, the mask is only used in conjunction with gPTEs, which are 32 bits, and so the address is limited to bits 31:12. PSE huge pages encoded PA bits 39:32 in PTE bits 20:13, i.e. need custom logic to handle their funky encoding regardless of PT_BASE_ADDR_MASK. Note, PT_LVL_OFFSET_MASK is somewhat confusing in that it computes the offset of the _gfn_, not of the gpa, i.e. not having bits 63:32 set in PT_BASE_ADDR_MASK is again correct. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: x86/mmu: Use common macros to compute 32/64-bit paging masksPaolo Bonzini
Dedup the code for generating (most of) the per-type PT_* masks in paging_tmpl.h. The relevant macros only vary based on the number of bits per level, and that smidge of info is already provided in a common form as PT_LEVEL_BITS. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEsSean Christopherson
Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs (PT64). SPTE and PT64 are _mostly_ the same, but the few differences are quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is very much specific to SPTEs. Opportunistically (and temporarily) move most guest macros into paging.h to clearly associate them with shadow paging, and to ensure that they're not used as of this commit. A future patch will eliminate them entirely. Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's needed for the quadrant calculation in kvm_mmu_get_page(). The quadrant calculation is hot enough (when using shadow paging with 32-bit guests) that adding a per-context helper is undesirable, and burying the computation in paging_tmpl.h with a forward declaration isn't exactly an improvement. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: x86/mmu: Dedup macros for computing various page table masksSean Christopherson
Provide common helper macros to generate various masks, shifts, etc... for 32-bit vs. 64-bit page tables. Only the inputs differ, the actual calculations are identical. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.hSean Christopherson
Move a handful of one-off macros and helpers for 32-bit PSE paging into paging_tmpl.h and hide them behind "PTTYPE == 32". Under no circumstance should anything but 32-bit shadow paging care about PSE paging. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-15KVM: x86/mmu: Use try_cmpxchg64 in fast_pf_fix_direct_spteUros Bizjak
Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old in fast_pf_fix_direct_spte. cmpxchg returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Message-Id: <20220520144635.63134-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-15KVM: x86/mmu: Use try_cmpxchg64 in tdp_mmu_set_spte_atomicUros Bizjak
Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old in tdp_mmu_set_spte_atomic. cmpxchg returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, remove explicit assignment to iter->old_spte when cmpxchg fails, this is what try_cmpxchg does implicitly. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: David Matlack <dmatlack@google.com> Message-Id: <20220518135111.3535-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-15KVM: x86/mmu: Drop unused CMPXCHG macro from paging_tmpl.hSean Christopherson
Drop the CMPXCHG macro from paging_tmpl.h, it's no longer used now that KVM uses a common uaccess helper to do 8-byte CMPXCHG. Fixes: f122dfe44768 ("KVM: x86: Use __try_cmpxchg_user() to update guest PTE A/D bits") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220613225723.2734132-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-15KVM: X86/MMU: Remove useless mmu_topup_memory_caches() in kvm_mmu_pte_write()Lai Jiangshan
Since the commit c5e2184d1544("KVM: x86/mmu: Remove the defunct update_pte() paging hook"), kvm_mmu_pte_write() no longer uses the rmap cache. So remove mmu_topup_memory_caches() in it. Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220605063417.308311-6-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-15KVM: X86/MMU: Remove unused PT32_DIR_BASE_ADDR_MASK from mmu.cLai Jiangshan
It is unused. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220605063417.308311-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09Merge branch 'kvm-5.20-early'Paolo Bonzini
s390: * add an interface to provide a hypervisor dump for secure guests * improve selftests to show tests x86: * Intel IPI virtualization * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS * PEBS virtualization * Simplify PMU emulation by just using PERF_TYPE_RAW events * More accurate event reinjection on SVM (avoid retrying instructions) * Allow getting/setting the state of the speaker port data bit * Rewrite gfn-pfn cache refresh * Refuse starting the module if VM-Entry/VM-Exit controls are inconsistent * "Notify" VM exit
2022-06-09KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRsYuan Yao
Assign shadow_me_value, not shadow_me_mask, to PAE root entries, a.k.a. shadow PDPTRs, when host memory encryption is supported. The "mask" is the set of all possible memory encryption bits, e.g. MKTME KeyIDs, whereas "value" holds the actual value that needs to be stuffed into host page tables. Using shadow_me_mask results in a failed VM-Entry due to setting reserved PA bits in the PDPTRs, and ultimately causes an OOPS due to physical addresses with non-zero MKTME bits sending to_shadow_page() into the weeds: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state. BUG: unable to handle page fault for address: ffd43f00063049e8 PGD 86dfd8067 P4D 0 Oops: 0000 [#1] PREEMPT SMP RIP: 0010:mmu_free_root_page+0x3c/0x90 [kvm] kvm_mmu_free_roots+0xd1/0x200 [kvm] __kvm_mmu_unload+0x29/0x70 [kvm] kvm_mmu_unload+0x13/0x20 [kvm] kvm_arch_destroy_vm+0x8a/0x190 [kvm] kvm_put_kvm+0x197/0x2d0 [kvm] kvm_vm_release+0x21/0x30 [kvm] __fput+0x8e/0x260 ____fput+0xe/0x10 task_work_run+0x6f/0xb0 do_exit+0x327/0xa90 do_group_exit+0x35/0xa0 get_signal+0x911/0x930 arch_do_signal_or_restart+0x37/0x720 exit_to_user_mode_prepare+0xb2/0x140 syscall_exit_to_user_mode+0x16/0x30 do_syscall_64+0x4e/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: e54f1ff244ac ("KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask") Signed-off-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20220608012015.19566-1-yuan.yao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09Merge tag 'kvm-riscv-fixes-5.19-1' of https://github.com/kvm-riscv/linux ↵Paolo Bonzini
into HEAD KVM/riscv fixes for 5.19, take #1 - Typo fix in arch/riscv/kvm/vmid.c - Remove broken reference pattern from MAINTAINERS entry
2022-06-08KVM: x86/mmu: Comment FNAME(sync_page) to document TLB flushing logicSean Christopherson
Add a comment to FNAME(sync_page) to explain why the TLB flushing logic conspiculously doesn't handle the scenario of guest protections being reduced. Specifically, if synchronizing a SPTE drops execute protections, KVM will not emit a TLB flush, whereas dropping writable or clearing A/D bits does trigger a flush via mmu_spte_update(). Architecturally, until the GPTE is implicitly or explicitly flushed from the guest's perspective, KVM is not required to flush any old, stale translations. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20220513195000.99371-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: x86/mmu: Drop RWX=0 SPTEs during ept_sync_page()Sean Christopherson
All of sync_page()'s existing checks filter out only !PRESENT gPTE, because without execute-only, all upper levels are guaranteed to be at least READABLE. However, if EPT with execute-only support is in use by L1, KVM can create an SPTE that is shadow-present but guest-inaccessible (RWX=0) if the upper level combined permissions are R (or RW) and the leaf EPTE is changed from R (or RW) to X. Because the EPTE is considered present when viewed in isolation, and no reserved bits are set, FNAME(prefetch_invalid_gpte) will consider the GPTE valid, and cause a not-present SPTE to be created. The SPTE is "correct": the guest translation is inaccessible because the combined protections of all levels yield RWX=0, and KVM will just redirect any vmexits to the guest. If EPT A/D bits are disabled, KVM can mistake the SPTE for an access-tracked SPTE, but again such confusion isn't fatal, as the "saved" protections are also RWX=0. However, creating a useless SPTE in general means that KVM messed up something, even if this particular goof didn't manifest as a functional bug. So, drop SPTEs whose new protections will yield a RWX=0 SPTE, and add a WARN in make_spte() to detect creation of SPTEs that will result in RWX=0 protections. Fixes: d95c55687e11 ("kvm: mmu: track read permission explicitly for shadow EPT page tables") Cc: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220513195000.99371-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07Merge branch 'kvm-5.19-early-fixes' into HEADPaolo Bonzini
2022-06-07KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty loggingBen Gardon
Currently disabling dirty logging with the TDP MMU is extremely slow. On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with the shadow MMU. When disabling dirty logging, zap non-leaf parent entries to allow replacement with huge pages instead of recursing and zapping all of the child, leaf entries. This reduces the number of TLB flushes required. and reduces the disable dirty log time with the TDP MMU to ~3 seconds. Opportunistically add a WARN() to catch GFNs that are mapped at a higher level than their max level. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20220525230904.1584480-1-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()Shaoqin Huang
When freeing obsolete previous roots, check prev_roots as intended, not the current root. Signed-off-by: Shaoqin Huang <shaoqin.huang@intel.com> Fixes: 527d5cd7eece ("KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped") Message-Id: <20220607005905.2933378-1-shaoqin.huang@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-26Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "S390: - ultravisor communication device driver - fix TEID on terminating storage key ops RISC-V: - Added Sv57x4 support for G-stage page table - Added range based local HFENCE functions - Added remote HFENCE functions based on VCPU requests - Added ISA extension registers in ONE_REG interface - Updated KVM RISC-V maintainers entry to cover selftests support ARM: - Add support for the ARMv8.6 WFxT extension - Guard pages for the EL2 stacks - Trap and emulate AArch32 ID registers to hide unsupported features - Ability to select and save/restore the set of hypercalls exposed to the guest - Support for PSCI-initiated suspend in collaboration with userspace - GICv3 register-based LPI invalidation support - Move host PMU event merging into the vcpu data structure - GICv3 ITS save/restore fixes - The usual set of small-scale cleanups and fixes x86: - New ioctls to get/set TSC frequency for a whole VM - Allow userspace to opt out of hypercall patching - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr AMD SEV improvements: - Add KVM_EXIT_SHUTDOWN metadata for SEV-ES - V_TSC_AUX support Nested virtualization improvements for AMD: - Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) - Allow AVIC to co-exist with a nested guest running - Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support - PAUSE filtering for nested hypervisors Guest support: - Decoupling of vcpu_is_preempted from PV spinlocks" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (199 commits) KVM: x86: Fix the intel_pt PMI handling wrongly considered from guest KVM: selftests: x86: Sync the new name of the test case to .gitignore Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND x86, kvm: use correct GFP flags for preemption disabled KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer x86/kvm: Alloc dummy async #PF token outside of raw spinlock KVM: x86: avoid calling x86 emulator without a decoded instruction KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave) s390/uv_uapi: depend on CONFIG_S390 KVM: selftests: x86: Fix test failure on arch lbr capable platforms KVM: LAPIC: Trace LAPIC timer expiration on every vmentry KVM: s390: selftest: Test suppression indication on key prot exception KVM: s390: Don't indicate suppression on dirtying, failing memop selftests: drivers/s390x: Add uvdevice tests drivers/s390/char: Add Ultravisor io device MAINTAINERS: Update KVM RISC-V entry to cover selftests support RISC-V: KVM: Introduce ISA extension register RISC-V: KVM: Cleanup stale TLB entries when host CPU changes RISC-V: KVM: Add remote HFENCE functions based on VCPU requests ...
2022-05-20KVM: x86/mmu: fix NULL pointer dereference on guest INVPCIDPaolo Bonzini
With shadow paging enabled, the INVPCID instruction results in a call to kvm_mmu_invpcid_gva. If INVPCID is executed with CR0.PG=0, the invlpg callback is not set and the result is a NULL pointer dereference. Fix it trivially by checking for mmu->invlpg before every call. There are other possibilities: - check for CR0.PG, because KVM (like all Intel processors after P5) flushes guest TLB on CR0.PG changes so that INVPCID/INVLPG are a nop with paging disabled - check for EFER.LMA, because KVM syncs and flushes when switching MMU contexts outside of 64-bit mode All of these are tricky, go for the simple solution. This is CVE-2022-1789. Reported-by: Yongkang Jia <kangel@zju.edu.cn> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Update number of zapped pages even if page list is stableSean Christopherson
When zapping obsolete pages, update the running count of zapped pages regardless of whether or not the list has become unstable due to zapping a shadow page with its own child shadow pages. If the VM is backed by mostly 4kb pages, KVM can zap an absurd number of SPTEs without bumping the batch count and thus without yielding. In the worst case scenario, this can cause a soft lokcup. watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [dirty_log_perf_:13020] RIP: 0010:workingset_activation+0x19/0x130 mark_page_accessed+0x266/0x2e0 kvm_set_pfn_accessed+0x31/0x40 mmu_spte_clear_track_bits+0x136/0x1c0 drop_spte+0x1a/0xc0 mmu_page_zap_pte+0xef/0x120 __kvm_mmu_prepare_zap_page+0x205/0x5e0 kvm_mmu_zap_all_fast+0xd7/0x190 kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10 kvm_page_track_flush_slot+0x5c/0x80 kvm_arch_flush_shadow_memslot+0xe/0x10 kvm_set_memslot+0x1a8/0x5d0 __kvm_set_memory_region+0x337/0x590 kvm_vm_ioctl+0xb08/0x1040 Fixes: fbb158cb88b6 ("KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""") Reported-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220511145122.3133334-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmapsVipin Sharma
Avoid calling handlers on empty rmap entries and skip to the next non empty rmap entry. Empty rmap entries are noop in handlers. Signed-off-by: Vipin Sharma <vipinsh@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220502220347.174664-1-vipinsh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: VMX: Include MKTME KeyID bits in shadow_zero_checkKai Huang
Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should never be set to SPTE. Set shadow_me_value to 0 and shadow_me_mask to include all MKTME KeyID bits to include them to shadow_zero_check. Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_maskKai Huang
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of high bits of physical address bits as 'KeyID' bits. Intel Trust Domain Extentions (TDX) further steals part of MKTME KeyID bits as TDX private KeyID bits. TDX private KeyID bits cannot be set in any mapping in the host kernel since they can only be accessed by software running inside a new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set any legacy MKTME KeyID bits to any mapping either. Therefore, it's not legitimate for KVM to set any KeyID bits in SPTE which maps guest memory. KVM maintains shadow_zero_check bits to represent which bits must be zero for SPTE which maps guest memory. MKTME KeyID bits should be set to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set the sme_me_mask to SPTE, and shadow_me_shadow is excluded from shadow_zero_check. So initializing shadow_me_mask to represent all MKTME keyID bits doesn't work for VMX (as oppositely, they must be set to shadow_zero_check). Introduce a new 'shadow_me_value' to replace existing shadow_me_mask, and repurpose shadow_me_mask as 'all possible memory encryption bits'. The new schematic of them will be: - shadow_me_value: the memory encryption bit(s) that will be set to the SPTE (the original shadow_me_mask). - shadow_me_mask: all possible memory encryption bits (which is a super set of shadow_me_value). - For now, shadow_me_value is supposed to be set by SVM and VMX respectively, and it is a constant during KVM's life time. This perhaps doesn't fit MKTME but for now host kernel doesn't support it (and perhaps will never do). - Bits in shadow_me_mask are set to shadow_zero_check, except the bits in shadow_me_value. Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them. Replace shadow_me_mask with shadow_me_value in almost all code paths, except the one in PT64_PERM_MASK, which is used by need_remote_flush() to determine whether remote TLB flush is needed. This should still use shadow_me_mask as any encryption bit change should need a TLB flush. And for AMD, move initializing shadow_me_value/shadow_me_mask from kvm_mmu_reset_all_pte_masks() to svm_hardware_setup(). Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Rename reset_rsvds_bits_mask()Kai Huang
Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make it clearer that it resets the reserved bits check for guest's page table entries. Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <efdc174b85d55598880064b8bf09245d3791031d.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Expand and clean up page fault statsSean Christopherson
Expand and clean up the page fault stats. The current stats are at best incomplete, and at worst misleading. Differentiate between faults that are actually fixed vs those that result in an MMIO SPTE being created, track faults that are spurious, faults that trigger emulation, faults that that are fixed in the fast path, and last but not least, track the number of faults that are taken. Note, the number of faults that require emulation for write-protected shadow pages can roughly be calculated by subtracting the number of MMIO SPTEs created from the overall number of faults that trigger emulation. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faultsSean Christopherson
Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast path for TDP page faults. The generated code is identical, and the #ifdef makes it dangerously difficult to extend the logic (guess who forgot to add an "else" inside the #ifdef and ran through the page fault handler twice). No functional or binary change intented. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Make all page fault handlers internal to the MMUSean Christopherson
Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all of the page fault handling collateral that was in mmu.h purely for the async #PF handler into mmu_internal.h, where it belongs. This will allow kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to expose those enums outside of the MMU. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns"Sean Christopherson
Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and kvm_faultin_pfn() to signal that the page fault handler should continue doing its thing. Aside from being gross and inefficient, using a boolean return to signal continue vs. stop makes it extremely difficult to add more helpers and/or move existing code to a helper. E.g. hypothetically, if nested MMUs were to gain a separate page fault handler in the future, everything up to the "is self-modifying PTE" check can be shared by all shadow MMUs, but communicating up the stack whether to continue on or stop becomes a nightmare. More concretely, proposed support for private guest memory ran into a similar issue, where it'll be forced to forego a helper in order to yield sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com. No functional change intended. Cc: David Matlack <dmatlack@google.com> Cc: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>