summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-12-01arm64/sysreg: Standardise naming for ID_PFR1_EL1James Morse
To convert the 32bit id registers to use the sysreg generation, they must first have a regular pattern, to match the symbols the script generates. Ensure symbols for the ID_PFR1_EL1 register have an _EL1 suffix, and use lower case in feature names where the arm-arm does the same. No functional change. Signed-off-by: James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20221130171637.718182-10-james.morse@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-12-01arm64/sysreg: Standardise naming for ID_PFR0_EL1James Morse
To convert the 32bit id registers to use the sysreg generation, they must first have a regular pattern, to match the symbols the script generates. Ensure symbols for the ID_PFR0_EL1 register have an _EL1 suffix, and use lower case in feature names where the arm-arm does the same. No functional change. Signed-off-by: James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20221130171637.718182-9-james.morse@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-12-01arm64/sysreg: Standardise naming for ID_ISAR6_EL1James Morse
To convert the 32bit id registers to use the sysreg generation, they must first have a regular pattern, to match the symbols the script generates. Ensure symbols for the ID_ISAR6_EL1 register have an _EL1 suffix. No functional change. Signed-off-by: James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20221130171637.718182-8-james.morse@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-12-01arm64/sysreg: Standardise naming for ID_ISAR5_EL1James Morse
To convert the 32bit id registers to use the sysreg generation, they must first have a regular pattern, to match the symbols the script generates. Ensure symbols for the ID_ISAR5_EL1 register have an _EL1 suffix. No functional change. Signed-off-by: James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20221130171637.718182-7-james.morse@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-12-01arm64/sysreg: Standardise naming for ID_ISAR4_EL1James Morse
To convert the 32bit id registers to use the sysreg generation, they must first have a regular pattern, to match the symbols the script generates. Ensure symbols for the ID_ISAR4_EL1 register have an _EL1 suffix, and use lower-case for feature names where the arm-arm does the same. No functional change. Signed-off-by: James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20221130171637.718182-6-james.morse@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-12-01arm64/sysreg: Standardise naming for ID_ISAR0_EL1James Morse
To convert the 32bit id registers to use the sysreg generation, they must first have a regular pattern, to match the symbols the script generates. Ensure symbols for the ID_ISAR0_EL1 register have an _EL1 suffix, and use lower-case for feature names where the arm-arm does the same. To functional change. Signed-off-by: James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20221130171637.718182-5-james.morse@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-12-01arm64/sysreg: Standardise naming for ID_MMFR5_EL1James Morse
To convert the 32bit id registers to use the sysreg generation, they must first have a regular pattern, to match the symbols the script generates. Ensure symbols for the ID_MMFR5_EL1 register have an _EL1 suffix. No functional change. Signed-off-by: James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20221130171637.718182-4-james.morse@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-12-01arm64/sysreg: Standardise naming for ID_MMFR4_EL1James Morse
To convert the 32bit id registers to use the sysreg generation, they must first have a regular pattern, to match the symbols the script generates. Ensure symbols for the ID_MMFR4_EL1 register have an _EL1 suffix, and use lower case in feature names where the arm-arm does the same. No functional change. Signed-off-by: James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20221130171637.718182-3-james.morse@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-12-01arm64/sysreg: Standardise naming for ID_MMFR0_EL1James Morse
To convert the 32bit id registers to use the sysreg generation, they must first have a regular pattern, to match the symbols the script generates. The scripts would like to follow exactly what is in the arm-arm, which uses lower case for some of these feature names. Ensure symbols for the ID_MMFR0_EL1 register have an _EL1 suffix, and use lower case in feature names where the arm-arm does the same. No functional change. Signed-off-by: James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20221130171637.718182-2-james.morse@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-11-30KVM: x86: Use current rather than snapshotted TSC frequency if it is constantAnton Romanov
Don't snapshot tsc_khz into per-cpu cpu_tsc_khz if the host TSC is constant, in which case the actual TSC frequency will never change and thus capturing TSC during initialization is unnecessary, KVM can simply use tsc_khz. This value is snapshotted from kvm_timer_init->kvmclock_cpu_online->tsc_khz_changed(NULL) On CPUs with constant TSC, but not a hardware-specified TSC frequency, snapshotting cpu_tsc_khz and using that to set a VM's target TSC frequency can lead to VM to think its TSC frequency is not what it actually is if refining the TSC completes after KVM snapshots tsc_khz. The actual frequency never changes, only the kernel's calculation of what that frequency is changes. Ideally, KVM would not be able to race with TSC refinement, or would have a hook into tsc_refine_calibration_work() to get an alert when refinement is complete. Avoiding the race altogether isn't practical as refinement takes a relative eternity; it's deliberately put on a work queue outside of the normal boot sequence to avoid unnecessarily delaying boot. Adding a hook is doable, but somewhat gross due to KVM's ability to be built as a module. And if the TSC is constant, which is likely the case for every VMX/SVM-capable CPU produced in the last decade, the race can be hit if and only if userspace is able to create a VM before TSC refinement completes; refinement is slow, but not that slow. For now, punt on a proper fix, as not taking a snapshot can help some uses cases and not taking a snapshot is arguably correct irrespective of the race with refinement. Signed-off-by: Anton Romanov <romanton@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220608183525.1143682-1-romanton@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-11-30KVM: selftests: Verify userspace can stuff IA32_FEATURE_CONTROL at willSean Christopherson
Verify the KVM allows userspace to set all supported bits in the IA32_FEATURE_CONTROL MSR irrespective of the current guest CPUID, and that all unsupported bits are rejected. Throw the testcase into vmx_msrs_test even though it's not technically a VMX MSR; it's close enough, and the most frequently feature controlled by the MSR is VMX. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220607232353.3375324-4-seanjc@google.com
2022-11-30KVM: VMX: Move MSR_IA32_FEAT_CTL.LOCKED check into "is valid" helperSean Christopherson
Move the check on IA32_FEATURE_CONTROL being locked, i.e. read-only from the guest, into the helper to check the overall validity of the incoming value. Opportunistically rename the helper to make it clear that it returns a bool. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220607232353.3375324-3-seanjc@google.com
2022-11-30KVM: VMX: Allow userspace to set all supported FEATURE_CONTROL bitsSean Christopherson
Allow userspace to set all supported bits in MSR IA32_FEATURE_CONTROL irrespective of the guest CPUID model, e.g. via KVM_SET_MSRS. KVM's ABI is that userspace is allowed to set MSRs before CPUID, i.e. can set MSRs to values that would fault according to the guest CPUID model. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220607232353.3375324-2-seanjc@google.com
2022-11-30KVM: VMX: Make vmread_error_trampoline() uncallable from C codeSean Christopherson
Declare vmread_error_trampoline() as an opaque symbol so that it cannot be called from C code, at least not without some serious fudging. The trampoline always passes parameters on the stack so that the inline VMREAD sequence doesn't need to clobber registers. regparm(0) was originally added to document the stack behavior, but it ended up being confusing because regparm(0) is a nop for 64-bit targets. Opportunustically wrap the trampoline and its declaration in #ifdeffery to make it even harder to invoke incorrectly, to document why it exists, and so that it's not left behind if/when CONFIG_CC_HAS_ASM_GOTO_OUTPUT is true for all supported toolchains. No functional change intended. Cc: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220928232015.745948-1-seanjc@google.com
2022-11-30KVM: nVMX: Reword comments about generating nested CR0/4 read shadowsSean Christopherson
Reword the comments that (attempt to) document nVMX's overrides of the CR0/4 read shadows for L2 after calling vmx_set_cr0/4(). The important behavior that needs to be documented is that KVM needs to override the shadows to account for L1's masks even though the shadows are set by the common helpers (and that setting the shadows first would result in the correct shadows being clobbered). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20220831000721.4066617-1-seanjc@google.com
2022-11-30KVM: x86: Clean up KVM_CAP_X86_USER_SPACE_MSR documentationSean Christopherson
Clean up the KVM_CAP_X86_USER_SPACE_MSR documentation to eliminate misleading and/or inconsistent verbiage, and to actually document what accesses are intercepted by which flags. - s/will/may since not all #GPs are guaranteed to be intercepted - s/deflect/intercept to align with common KVM terminology - s/user space/userspace to align with the majority of KVM docs - Avoid using "trap" terminology, as KVM exits to userspace _before_ stepping, i.e. doesn't exhibit trap-like behavior - Actually document the flags Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220831001706.4075399-4-seanjc@google.com
2022-11-30KVM: x86: Reword MSR filtering docs to more precisely define behaviorSean Christopherson
Reword the MSR filtering documentatiion to more precisely define the behavior of filtering using common virtualization terminology. - Explicitly document KVM's behavior when an MSR is denied - s/handled/allowed as there is no guarantee KVM will "handle" the MSR access - Drop the "fall back" terminology, which incorrectly suggests that there is existing KVM behavior to fall back to - Fix an off-by-one error in the range (the end is exclusive) - Call out the interaction between MSR filtering and KVM_CAP_X86_USER_SPACE_MSR's KVM_MSR_EXIT_REASON_FILTER - Delete the redundant paragraph on what '0' and '1' in the bitmap means, it's covered by the sections on KVM_MSR_FILTER_{READ,WRITE} - Delete the clause on x2APIC MSR behavior depending on APIC base, this is covered by stating that KVM follows architectural behavior when emulating/virtualizing MSR accesses Reported-by: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220831001706.4075399-3-seanjc@google.com
2022-11-30KVM: x86: Delete documentation for READ|WRITE in KVM_X86_SET_MSR_FILTERSean Christopherson
Delete the paragraph that describes the behavior when both KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE are set for a range. There is nothing special about KVM's handling of this combination, whereas explicitly documenting the combination suggests that there is some magic behavior the user needs to be aware of. Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220831001706.4075399-2-seanjc@google.com
2022-11-30KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRSJim Mattson
According to Intel's document on Indirect Branch Restricted Speculation, "Enabling IBRS does not prevent software from controlling the predicted targets of indirect branches of unrelated software executed later at the same predictor mode (for example, between two different user applications, or two different virtual machines). Such isolation can be ensured through use of the Indirect Branch Predictor Barrier (IBPB) command." This applies to both basic and enhanced IBRS. Since L1 and L2 VMs share hardware predictor modes (guest-user and guest-kernel), hardware IBRS is not sufficient to virtualize IBRS. (The way that basic IBRS is implemented on pre-eIBRS parts, hardware IBRS is actually sufficient in practice, even though it isn't sufficient architecturally.) For virtual CPUs that support IBRS, add an indirect branch prediction barrier on emulated VM-exit, to ensure that the predicted targets of indirect branches executed in L1 cannot be controlled by software that was executed in L2. Since we typically don't intercept guest writes to IA32_SPEC_CTRL, perform the IBPB at emulated VM-exit regardless of the current IA32_SPEC_CTRL.IBRS value, even though the IBPB could technically be deferred until L1 sets IA32_SPEC_CTRL.IBRS, if IA32_SPEC_CTRL.IBRS is clear at emulated VM-exit. This is CVE-2022-2196. Fixes: 5c911beff20a ("KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02") Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221019213620.1953281-3-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-11-30KVM: VMX: Guest usage of IA32_SPEC_CTRL is likelyJim Mattson
At this point in time, most guests (in the default, out-of-the-box configuration) are likely to use IA32_SPEC_CTRL. Therefore, drop the compiler hint that it is unlikely for KVM to be intercepting WRMSR of IA32_SPEC_CTRL. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221019213620.1953281-2-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-11-30KVM: nVMX: Inject #GP, not #UD, if "generic" VMXON CR0/CR4 check failsSean Christopherson
Inject #GP for if VMXON is attempting with a CR0/CR4 that fails the generic "is CRx valid" check, but passes the CR4.VMXE check, and do the generic checks _after_ handling the post-VMXON VM-Fail. The CR4.VMXE check, and all other #UD cases, are special pre-conditions that are enforced prior to pivoting on the current VMX mode, i.e. occur before interception if VMXON is attempted in VMX non-root mode. All other CR0/CR4 checks generate #GP and effectively have lower priority than the post-VMXON check. Per the SDM: IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ... THEN #UD; ELSIF not in VMX operation THEN IF (CPL > 0) or (in A20M mode) or (the values of CR0 and CR4 are not supported in VMX operation) THEN #GP(0); ELSIF in VMX non-root operation THEN VMexit; ELSIF CPL > 0 THEN #GP(0); ELSE VMfail("VMXON executed in VMX root operation"); FI; which, if re-written without ELSIF, yields: IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ... THEN #UD IF in VMX non-root operation THEN VMexit; IF CPL > 0 THEN #GP(0) IF in VMX operation THEN VMfail("VMXON executed in VMX root operation"); IF (in A20M mode) or (the values of CR0 and CR4 are not supported in VMX operation) THEN #GP(0); Note, KVM unconditionally forwards VMXON VM-Exits that occur in L2 to L1, i.e. there is no need to check the vCPU is not in VMX non-root mode. Add a comment to explain why unconditionally forwarding such exits is functionally correct. Reported-by: Eric Li <ercli@ucdavis.edu> Fixes: c7d855c2aff2 ("KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20221006001956.329314-1-seanjc@google.com
2022-11-30KVM: SVM: Replace kmap_atomic() with kmap_local_page()Zhao Liu
The use of kmap_atomic() is being deprecated in favor of kmap_local_page()[1]. The main difference between atomic and local mappings is that local mappings don't disable page faults or preemption. There're 2 reasons we can use kmap_local_page() here: 1. SEV is 64-bit only and kmap_local_page() doesn't disable migration in this case, but here the function clflush_cache_range() uses CLFLUSHOPT instruction to flush, and on x86 CLFLUSHOPT is not CPU-local and flushes the page out of the entire cache hierarchy on all CPUs (APM volume 3, chapter 3, CLFLUSHOPT). So there's no need to disable preemption to ensure CPU-local. 2. clflush_cache_range() doesn't need to disable pagefault and the mapping is still valid even if sleeps. This is also true for sched out/in when preempted. In addition, though kmap_local_page() is a thin wrapper around page_address() on 64-bit, kmap_local_page() should still be used here in preference to page_address() since page_address() isn't suitable to be used in a generic function (like sev_clflush_pages()) where the page passed in is not easy to determine the source of allocation. Keeping the kmap* API in place means it can be used for things other than highmem mappings[2]. Therefore, sev_clflush_pages() is a function that should use kmap_local_page() in place of kmap_atomic(). Convert the calls of kmap_atomic() / kunmap_atomic() to kmap_local_page() / kunmap_local(). [1]: https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com [2]: https://lore.kernel.org/lkml/5d667258-b58b-3d28-3609-e7914c99b31b@intel.com/ Suggested-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Suggested-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220928092748.463631-1-zhao1.liu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-11-30KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't validSean Christopherson
Skip the WRMSR fastpath in SVM's VM-Exit handler if the next RIP isn't valid, e.g. because KVM is running with nrips=false. SVM must decode and emulate to skip the WRMSR if the CPU doesn't provide the next RIP. Getting the instruction bytes to decode the WRMSR requires reading guest memory, which in turn means dereferencing memslots, and that isn't safe because KVM doesn't hold SRCU when the fastpath runs. Don't bother trying to enable the fastpath for this case, e.g. by doing only the WRMSR and leaving the "skip" until later. NRIPS is supported on all modern CPUs (KVM has considered making it mandatory), and the next RIP will be valid the vast, vast majority of the time. ============================= WARNING: suspicious RCU usage 6.0.0-smp--4e557fcd3d80-skip #13 Tainted: G O ----------------------------- include/linux/kvm_host.h:954 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by stable/206475: #0: ffff9d9dfebcc0f0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x8b/0x620 [kvm] stack backtrace: CPU: 152 PID: 206475 Comm: stable Tainted: G O 6.0.0-smp--4e557fcd3d80-skip #13 Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 10.48.0 01/27/2022 Call Trace: <TASK> dump_stack_lvl+0x69/0xaa dump_stack+0x10/0x12 lockdep_rcu_suspicious+0x11e/0x130 kvm_vcpu_gfn_to_memslot+0x155/0x190 [kvm] kvm_vcpu_gfn_to_hva_prot+0x18/0x80 [kvm] paging64_walk_addr_generic+0x183/0x450 [kvm] paging64_gva_to_gpa+0x63/0xd0 [kvm] kvm_fetch_guest_virt+0x53/0xc0 [kvm] __do_insn_fetch_bytes+0x18b/0x1c0 [kvm] x86_decode_insn+0xf0/0xef0 [kvm] x86_emulate_instruction+0xba/0x790 [kvm] kvm_emulate_instruction+0x17/0x20 [kvm] __svm_skip_emulated_instruction+0x85/0x100 [kvm_amd] svm_skip_emulated_instruction+0x13/0x20 [kvm_amd] handle_fastpath_set_msr_irqoff+0xae/0x180 [kvm] svm_vcpu_run+0x4b8/0x5a0 [kvm_amd] vcpu_enter_guest+0x16ca/0x22f0 [kvm] kvm_arch_vcpu_ioctl_run+0x39d/0x900 [kvm] kvm_vcpu_ioctl+0x538/0x620 [kvm] __se_sys_ioctl+0x77/0xc0 __x64_sys_ioctl+0x1d/0x20 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 404d5d7bff0d ("KVM: X86: Introduce more exit_fastpath_completion enum values") Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220930234031.1732249-1-seanjc@google.com
2022-11-30KVM: x86: Fail emulation during EMULTYPE_SKIP on any exceptionSean Christopherson
Treat any exception during instruction decode for EMULTYPE_SKIP as a "full" emulation failure, i.e. signal failure instead of queuing the exception. When decoding purely to skip an instruction, KVM and/or the CPU has already done some amount of emulation that cannot be unwound, e.g. on an EPT misconfig VM-Exit KVM has already processeed the emulated MMIO. KVM already does this if a #UD is encountered, but not for other exceptions, e.g. if a #PF is encountered during fetch. In SVM's soft-injection use case, queueing the exception is particularly problematic as queueing exceptions while injecting events can put KVM into an infinite loop due to bailing from VM-Enter to service the newly pending exception. E.g. multiple warnings to detect such behavior fire: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9873 kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm] Modules linked in: kvm_amd ccp kvm irqbypass CPU: 3 PID: 1017 Comm: svm_nested_soft Not tainted 6.0.0-rc1+ #220 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm] Call Trace: kvm_vcpu_ioctl+0x223/0x6d0 [kvm] __x64_sys_ioctl+0x85/0xc0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9987 kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm] Modules linked in: kvm_amd ccp kvm irqbypass CPU: 3 PID: 1017 Comm: svm_nested_soft Tainted: G W 6.0.0-rc1+ #220 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm] Call Trace: kvm_vcpu_ioctl+0x223/0x6d0 [kvm] __x64_sys_ioctl+0x85/0xc0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 ---[ end trace 0000000000000000 ]--- Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20220930233632.1725475-1-seanjc@google.com
2022-11-30KVM: x86: Keep the lock order consistent between SRCU and gpc spinlockPeng Hao
Acquire SRCU before taking the gpc spinlock in wait_pending_event() so as to be consistent with all other functions that acquire both locks. It's not illegal to acquire SRCU inside a spinlock, nor is there deadlock potential, but in general it's preferable to order locks from least restrictive to most restrictive, e.g. if wait_pending_event() needed to sleep for whatever reason, it could do so while holding SRCU, but would need to drop the spinlock. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/CAPm50a++Cb=QfnjMZ2EnCj-Sb9Y4UM-=uOEtHAcjnNLCAAf-dQ@mail.gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-11-30KVM: VMX: Resume guest immediately when injecting #GP on ECREATESean Christopherson
Resume the guest immediately when injecting a #GP on ECREATE due to an invalid enclave size, i.e. don't attempt ECREATE in the host. The #GP is a terminal fault, e.g. skipping the instruction if ECREATE is successful would result in KVM injecting #GP on the instruction following ECREATE. Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions") Cc: stable@vger.kernel.org Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20220930233132.1723330-1-seanjc@google.com
2022-11-30KVM: Skip unnecessary "unmap" if gpc is already valid during refreshSean Christopherson
When refreshing a gfn=>pfn cache, skip straight to unlocking if the cache already valid instead of stuffing the "old" variables to turn the unmapping outro into a nop. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpersSean Christopherson
Drop the @gpa param from the exported check()+refresh() helpers and limit changing the cache's GPA to the activate path. All external users just feed in gpc->gpa, i.e. this is a fancy nop. Allowing users to change the GPA at check()+refresh() is dangerous as those helpers explicitly allow concurrent calls, e.g. KVM could get into a livelock scenario. It's also unclear as to what the expected behavior should be if multiple tasks attempt to refresh with different GPAs. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Do not partially reinitialize gfn=>pfn cache during activationSean Christopherson
Don't partially reinitialize a gfn=>pfn cache when activating the cache, and instead assert that the cache is not valid during activation. Bug the VM if the assertion fails, as use-after-free and/or data corruption is all but guaranteed if KVM ends up with a valid-but-inactive cache. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Drop KVM's API to allow temporarily unmapping gfn=>pfn cacheSean Christopherson
Drop kvm_gpc_unmap() as it has no users and unclear requirements. The API was added as part of the original gfn_to_pfn_cache support, but its sole usage[*] was never merged. Fold the guts of kvm_gpc_unmap() into the deactivate path and drop the API. Omit acquiring refresh_lock as as concurrent calls to kvm_gpc_deactivate() are not allowed (this is not enforced, e.g. via lockdep. due to it being called during vCPU destruction). If/when temporary unmapping makes a comeback, the desirable behavior is likely to restrict temporary unmapping to vCPU-exclusive mappings and require the vcpu->mutex be held to serialize unmap. Use of the refresh_lock to protect unmapping was somewhat specuatively added by commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via mutex") to guard against concurrent unmaps, but the primary use case of the temporary unmap, nested virtualization[*], doesn't actually need or want concurrent unmaps. [*] https://lore.kernel.org/all/20211210163625.2886-7-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh()Michal Luczaj
Make kvm_gpc_refresh() use kvm instance cached in gfn_to_pfn_cache. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: leave kvm_gpc_unmap() as-is] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Clean up hva_to_pfn_retry()Michal Luczaj
Make hva_to_pfn_retry() use kvm instance cached in gfn_to_pfn_cache. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check()Michal Luczaj
Make kvm_gpc_check() use kvm instance cached in gfn_to_pfn_cache. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Store immutable gfn_to_pfn_cache propertiesMichal Luczaj
Move the assignment of immutable properties @kvm, @vcpu, and @usage to the initializer. Make _activate() and _deactivate() use stored values. Note, @len is also effectively immutable for most cases, but not in the case of the Xen runstate cache, which may be split across two pages and the length of the first segment will depend on its address. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: handle @len in a separate patch] Signed-off-by: Sean Christopherson <seanjc@google.com> [dwmw2: acknowledge that @len can actually change for some use cases] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: x86/xen: add support for 32-bit guests in SCHEDOP_pollMetin Kaya
This patch introduces compat version of struct sched_poll for SCHEDOP_poll sub-operation of sched_op hypercall, reads correct amount of data (16 bytes in 32-bit case, 24 bytes otherwise) by using new compat_sched_poll struct, copies it to sched_poll properly, and lets rest of the code run as is. Signed-off-by: Metin Kaya <metikaya@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
2022-11-30KVM: x86: fix uninitialized variable use on KVM_REQ_TRIPLE_FAULTPaolo Bonzini
If a triple fault was fixed by kvm_x86_ops.nested_ops->triple_fault (by turning it into a vmexit), there is no need to leave vcpu_enter_guest(). Any vcpu->requests will be caught later before the actual vmentry, and in fact vcpu_enter_guest() was not initializing the "r" variable. Depending on the compiler's whims, this could cause the x86_64/triple_fault_event_test test to fail. Cc: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 92e7d5c83aff ("KVM: x86: allow L1 to not intercept triple fault") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30KVM: x86: Remove unused argument in gpc_unmap_khva()Michal Luczaj
Remove the unused @kvm argument from gpc_unmap_khva(). Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30KVM: Shorten gfn_to_pfn_cache function namesMichal Luczaj
Formalize "gpc" as the acronym and use it in function names. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundaryDavid Woodhouse
Torture test the cases where the runstate crosses a page boundary, and and especially the case where it's configured in 32-bit mode and doesn't, but then switching to 64-bit mode makes it go onto the second page. To simplify this, make the KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST ioctl also update the guest runstate area. It already did so if the actual runstate changed, as a side-effect of kvm_xen_update_runstate(). So doing it in the plain adjustment case is making it more consistent, as well as giving us a nice way to trigger the update without actually running the vCPU again and changing the values. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30KVM: x86/xen: Allow XEN_RUNSTATE_UPDATE flag behaviour to be configuredDavid Woodhouse
Closer inspection of the Xen code shows that we aren't supposed to be using the XEN_RUNSTATE_UPDATE flag unconditionally. It should be explicitly enabled by guests through the HYPERVISOR_vm_assist hypercall. If we randomly set the top bit of ->state_entry_time for a guest that hasn't asked for it and doesn't expect it, that could make the runtimes fail to add up and confuse the guest. Without the flag it's perfectly safe for a vCPU to read its own vcpu_runstate_info; just not for one vCPU to read *another's*. I briefly pondered adding a word for the whole set of VMASST_TYPE_* flags but the only one we care about for HVM guests is this, so it seemed a bit pointless. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221127122210.248427-3-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30KVM: x86/xen: Compatibility fixes for shared runstate areaDavid Woodhouse
The guest runstate area can be arbitrarily byte-aligned. In fact, even when a sane 32-bit guest aligns the overall structure nicely, the 64-bit fields in the structure end up being unaligned due to the fact that the 32-bit ABI only aligns them to 32 bits. So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE is buggy, because if it's unaligned then we can't update the whole field atomically; the low bytes might be observable before the _UPDATE bit is. Xen actually updates the *byte* containing that top bit, on its own. KVM should do the same. In addition, we cannot assume that the runstate area fits within a single page. One option might be to make the gfn_to_pfn cache cope with regions that cross a page — but getting a contiguous virtual kernel mapping of a discontiguous set of IOMEM pages is a distinctly non-trivial exercise, and it seems this is the *only* current use case for the GPC which would benefit from it. An earlier version of the runstate code did use a gfn_to_hva cache for this purpose, but it still had the single-page restriction because it used the uhva directly — because it needs to be able to do so atomically when the vCPU is being scheduled out, so it used pagefault_disable() around the accesses and didn't just use kvm_write_guest_cached() which has a fallback path. So... use a pair of GPCs for the first and potential second page covering the runstate area. We can get away with locking both at once because nothing else takes more than one GPC lock at a time so we can invent a trivial ordering rule. The common case where it's all in the same page is kept as a fast path, but in both cases, the actual guest structure (compat or not) is built up from the fields in @vx, following preset pointers to the state and times fields. The only difference is whether those pointers point to the kernel stack (in the split case) or to guest memory directly via the GPC. The fast path is also fixed to use a byte access for the XEN_RUNSTATE_UPDATE bit, then the only real difference is the dual memcpy. Finally, Xen also does write the runstate area immediately when it's configured. Flip the kvm_xen_update_runstate() and …_guest() functions and call the latter directly when the runstate area is set. This means that other ioctls which modify the runstate also write it immediately to the guest when they do so, which is also intended. Update the xen_shinfo_test to exercise the pathological case where the XEN_RUNSTATE_UPDATE flag in the top byte of the state_entry_time is actually in a different page to the rest of the 64-bit word. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-29KVM: selftests: Build access_tracking_perf_test for arm64Oliver Upton
Does exactly what it says on the tin. Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221118211503.4049023-3-oliver.upton@linux.dev
2022-11-29KVM: selftests: Have perf_test_util signal when to stop vCPUsOliver Upton
Signal that a test run is complete through perf_test_args instead of having tests open code a similar solution. Ensure that the field resets to false at the beginning of a test run as the structure is reused between test runs, eliminating a couple of bugs: access_tracking_perf_test hangs indefinitely on a subsequent test run, as 'done' remains true. The bug doesn't amount to much right now, as x86 supports a single guest mode. However, this is a precondition of enabling the test for other architectures with >1 guest mode, like arm64. memslot_modification_stress_test has the exact opposite problem, where subsequent test runs complete immediately as 'run_vcpus' remains false. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> [oliver: added commit message, preserve spin_wait_for_next_iteration()] Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221118211503.4049023-2-oliver.upton@linux.dev
2022-11-29Documentation: document the ABI changes for KVM_CAP_ARM_MTEPeter Collingbourne
Document both the restriction on VM_MTE_ALLOWED mappings and the relaxation for shared mappings. Signed-off-by: Peter Collingbourne <pcc@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221104011041.290951-9-pcc@google.com
2022-11-29KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE enabledPeter Collingbourne
Certain VMMs such as crosvm have features (e.g. sandboxing) that depend on being able to map guest memory as MAP_SHARED. The current restriction on sharing MAP_SHARED pages with the guest is preventing the use of those features with MTE. Now that the races between tasks concurrently clearing tags on the same page have been fixed, remove this restriction. Note that this is a relaxation of the ABI. Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221104011041.290951-8-pcc@google.com
2022-11-29KVM: arm64: unify the tests for VMAs in memslots when MTE is enabledPeter Collingbourne
Previously we allowed creating a memslot containing a private mapping that was not VM_MTE_ALLOWED, but would later reject KVM_RUN with -EFAULT. Now we reject the memory region at memslot creation time. Since this is a minor tweak to the ABI (a VMM that created one of these memslots would fail later anyway), no VMM to my knowledge has MTE support yet, and the hardware with the necessary features is not generally available, we can probably make this ABI change at this point. Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221104011041.290951-7-pcc@google.com
2022-11-29arm64: mte: Lock a page for MTE tag initialisationCatalin Marinas
Initialising the tags and setting PG_mte_tagged flag for a page can race between multiple set_pte_at() on shared pages or setting the stage 2 pte via user_mem_abort(). Introduce a new PG_mte_lock flag as PG_arch_3 and set it before attempting page initialisation. Given that PG_mte_tagged is never cleared for a page, consider setting this flag to mean page unlocked and wait on this bit with acquire semantics if the page is locked: - try_page_mte_tagging() - lock the page for tagging, return true if it can be tagged, false if already tagged. No acquire semantics if it returns true (PG_mte_tagged not set) as there is no serialisation with a previous set_page_mte_tagged(). - set_page_mte_tagged() - set PG_mte_tagged with release semantics. The two-bit locking is based on Peter Collingbourne's idea. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Steven Price <steven.price@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221104011041.290951-6-pcc@google.com
2022-11-29mm: Add PG_arch_3 page flagPeter Collingbourne
As with PG_arch_2, this flag is only allowed on 64-bit architectures due to the shortage of bits available. It will be used by the arm64 MTE code in subsequent patches. Signed-off-by: Peter Collingbourne <pcc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Steven Price <steven.price@arm.com> [catalin.marinas@arm.com: added flag preserving in __split_huge_page_tail()] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221104011041.290951-5-pcc@google.com
2022-11-29KVM: arm64: Simplify the sanitise_mte_tags() logicCatalin Marinas
Currently sanitise_mte_tags() checks if it's an online page before attempting to sanitise the tags. Such detection should be done in the caller via the VM_MTE_ALLOWED vma flag. Since kvm_set_spte_gfn() does not have the vma, leave the page unmapped if not already tagged. Tag initialisation will be done on a subsequent access fault in user_mem_abort(). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> [pcc@google.com: fix the page initializer] Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Steven Price <steven.price@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221104011041.290951-4-pcc@google.com
2022-11-29arm64: mte: Fix/clarify the PG_mte_tagged semanticsCatalin Marinas
Currently the PG_mte_tagged page flag mostly means the page contains valid tags and it should be set after the tags have been cleared or restored. However, in mte_sync_tags() it is set before setting the tags to avoid, in theory, a race with concurrent mprotect(PROT_MTE) for shared pages. However, a concurrent mprotect(PROT_MTE) with a copy on write in another thread can cause the new page to have stale tags. Similarly, tag reading via ptrace() can read stale tags if the PG_mte_tagged flag is set before actually clearing/restoring the tags. Fix the PG_mte_tagged semantics so that it is only set after the tags have been cleared or restored. This is safe for swap restoring into a MAP_SHARED or CoW page since the core code takes the page lock. Add two functions to test and set the PG_mte_tagged flag with acquire and release semantics. The downside is that concurrent mprotect(PROT_MTE) on a MAP_SHARED page may cause tag loss. This is already the case for KVM guests if a VMM changes the page protection while the guest triggers a user_mem_abort(). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> [pcc@google.com: fix build with CONFIG_ARM64_MTE disabled] Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Steven Price <steven.price@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221104011041.290951-3-pcc@google.com