summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx.c
AgeCommit message (Collapse)Author
2018-10-17KVM: nVMX: assimilate nested_vmx_entry_failure() into ↵Sean Christopherson
nested_vmx_enter_non_root_mode() Handling all VMExits due to failed consistency checks on VMEnter in nested_vmx_enter_non_root_mode() consolidates all relevant code into a single location, and removing nested_vmx_entry_failure() eliminates a confusing function name and label. For a VMEntry, "fail" and its derivatives has a very specific meaning due to the different behavior of a VMEnter VMFail versus VMExit, i.e. it wasn't obvious that nested_vmx_entry_failure() handled VMExit scenarios. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM: nVMX: move check_vmentry_postreqs() call to ↵Sean Christopherson
nested_vmx_enter_non_root_mode() In preparation of supporting checkpoint/restore for nested state, commit ca0bde28f2ed ("kvm: nVMX: Split VMCS checks from nested_vmx_run()") modified check_vmentry_postreqs() to only perform the guest EFER consistency checks when nested_run_pending is true. But, in the normal nested VMEntry flow, nested_run_pending is only set after check_vmentry_postreqs(), i.e. the consistency check is being skipped. Alternatively, nested_run_pending could be set prior to calling check_vmentry_postreqs() in nested_vmx_run(), but placing the consistency checks in nested_vmx_enter_non_root_mode() allows us to split prepare_vmcs02() and interleave the preparation with the consistency checks without having to change the call sites of nested_vmx_enter_non_root_mode(). In other words, the rest of the consistency check code in nested_vmx_run() will be joining the postreqs checks in future patches. Fixes: ca0bde28f2ed ("kvm: nVMX: Split VMCS checks from nested_vmx_run()") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM: nVMX: rename enter_vmx_non_root_mode to nested_vmx_enter_non_root_modeSean Christopherson
...to be more consistent with the nested VMX nomenclature. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM: nVMX: try to set EFER bits correctly when initializing controlsSean Christopherson
VM_ENTRY_IA32E_MODE and VM_{ENTRY,EXIT}_LOAD_IA32_EFER will be explicitly set/cleared as needed by vmx_set_efer(), but attempt to get the bits set correctly when intializing the control fields. Setting the value correctly can avoid multiple VMWrites. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM: vmx: do not unconditionally clear EFER switchingSean Christopherson
Do not unconditionally call clear_atomic_switch_msr() when updating EFER. This adds up to four unnecessary VMWrites in the case where guest_efer != host_efer, e.g. if the load_on_{entry,exit} bits were already set. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM: nVMX: reset cache/shadows when switching loaded VMCSSean Christopherson
Reset the vm_{entry,exit}_controls_shadow variables as well as the segment cache after loading a new VMCS in vmx_switch_vmcs(). The shadows/cache track VMCS data, i.e. they're stale every time we switch to a new VMCS regardless of reason. This fixes a bug where stale control shadows would be consumed after a nested VMExit due to a failed consistency check. Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM: nVMX: use vm_exit_controls_init() to write exit controls for vmcs02Sean Christopherson
Write VM_EXIT_CONTROLS using vm_exit_controls_init() when configuring vmcs02, otherwise vm_exit_controls_shadow will be stale. EFER in particular can be corrupted if VM_EXIT_LOAD_IA32_EFER is not updated due to an incorrect shadow optimization, which can crash L0 due to EFER not being loaded on exit. This does not occur with the current code base simply because update_transition_efer() unconditionally clears VM_EXIT_LOAD_IA32_EFER before conditionally setting it, and because a nested guest always starts with VM_EXIT_LOAD_IA32_EFER clear, i.e. we'll only ever unnecessarily clear the bit. That is, until someone optimizes update_transition_efer()... Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM: nVMX: move vmcs12 EPTP consistency check to check_vmentry_prereqs()Sean Christopherson
An invalid EPTP causes a VMFail(VMXERR_ENTRY_INVALID_CONTROL_FIELD), not a VMExit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM: nVMX: move host EFER consistency checks to VMFail pathSean Christopherson
Invalid host state related to loading EFER on VMExit causes a VMFail(VMXERR_ENTRY_INVALID_HOST_STATE_FIELD), not a VMExit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM: nVMX: Always reflect #NM VM-exits to L1Jim Mattson
When bit 3 (corresponding to CR0.TS) of the VMCS12 cr0_guest_host_mask field is clear, the VMCS12 guest_cr0 field does not necessarily hold the current value of the L2 CR0.TS bit, so the code that checked for L2's CR0.TS bit being set was incorrect. Moreover, I'm not sure that the CR0.TS check was adequate. (What if L2's CR0.EM was set, for instance?) Fortunately, lazy FPU has gone away, so L0 has lost all interest in intercepting #NM exceptions. See commit bd7e5b0899a4 ("KVM: x86: remove code for lazy FPU handling"). Therefore, there is no longer any question of which hypervisor gets first dibs. The #NM VM-exit should always be reflected to L1. (Note that the corresponding bit must be set in the VMCS12 exception_bitmap field for there to be an #NM VM-exit at all.) Fixes: ccf9844e5d99c ("kvm, vmx: Really fix lazy FPU on nested guest") Reported-by: Abhiroop Dabral <adabral@paloaltonetworks.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Tested-by: Abhiroop Dabral <adabral@paloaltonetworks.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM/VMX: Remve unused function is_external_interrupt().Tianyu Lan
is_external_interrupt() is not used now and so remove it. Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17nVMX x86: Make nested_vmx_check_pml_controls() conciseKrish Sadhukhan
Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM: nVMX: restore host state in nested_vmx_vmexit for VMFailSean Christopherson
A VMEnter that VMFails (as opposed to VMExits) does not touch host state beyond registers that are explicitly noted in the VMFail path, e.g. EFLAGS. Host state does not need to be loaded because VMFail is only signaled for consistency checks that occur before the CPU starts to load guest state, i.e. there is no need to restore any state as nothing has been modified. But in the case where a VMFail is detected by hardware and not by KVM (due to deferring consistency checks to hardware), KVM has already loaded some amount of guest state. Luckily, "loaded" only means loaded to KVM's software model, i.e. vmcs01 has not been modified. So, unwind our software model to the pre-VMEntry host state. Not restoring host state in this VMFail path leads to a variety of failures because we end up with stale data in vcpu->arch, e.g. CR0, CR4, EFER, etc... will all be out of sync relative to vmcs01. Any significant delta in the stale data is all but guaranteed to crash L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong. An alternative to this "soft" reload would be to load host state from vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is wildly inconsistent with respect to the VMX architecture, e.g. an L1 VMM with separate VMExit and VMFail paths would explode. Note that this approach does not mean KVM is 100% accurate with respect to VMX hardware behavior, even at an architectural level (the exact order of consistency checks is microarchitecture specific). But 100% emulation accuracy isn't the goal (with this patch), rather the goal is to be consistent in the information delivered to L1, e.g. a VMExit should not fall-through VMENTER, and a VMFail should not jump to HOST_RIP. This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)", but retains the core aspects of that patch, just in an open coded form due to the need to pull state from vmcs01 instead of vmcs12. Restoring host state resolves a variety of issues introduced by commit "4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)", which remedied the incorrect behavior of treating VMFail like VMExit but in doing so neglected to restore arch state that had been modified prior to attempting nested VMEnter. A sample failure that occurs due to stale vcpu.arch state is a fault of some form while emulating an LGDT (due to emulated UMIP) from L1 after a failed VMEntry to L3, in this case when running the KVM unit test test_tpr_threshold_values in L1. L0 also hits a WARN in this case due to a stale arch.cr4.UMIP. L1: BUG: unable to handle kernel paging request at ffffc90000663b9e PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163 Oops: 0009 [#1] SMP CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G W 4.18.0-rc2+ #2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:native_load_gdt+0x0/0x10 ... Call Trace: load_fixmap_gdt+0x22/0x30 __vmx_load_host_state+0x10e/0x1c0 [kvm_intel] vmx_switch_vmcs+0x2d/0x50 [kvm_intel] nested_vmx_vmexit+0x222/0x9c0 [kvm_intel] vmx_handle_exit+0x246/0x15a0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm] kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm] do_vfs_ioctl+0x9f/0x600 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4f/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 L0: WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel] ... CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76 Hardware name: Intel Corporation Kabylake Client platform/KBL S RIP: 0010:handle_desc+0x28/0x30 [kvm_intel] ... Call Trace: kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm] kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm] do_vfs_ioctl+0x9f/0x5e0 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x49/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure) Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly) Cc: Jim Mattson <jmattson@google.com> Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim KrÄmář <rkrcmar@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17KVM: nVMX: Clear reserved bits of #DB exit qualificationJim Mattson
According to volume 3 of the SDM, bits 63:15 and 12:4 of the exit qualification field for debug exceptions are reserved (cleared to 0). However, the SDM is incorrect about bit 16 (corresponding to DR6.RTM). This bit should be set if a debug exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM region while advanced debugging of RTM transactional regions was enabled. Note that this is the opposite of DR6.RTM, which "indicates (when clear) that a debug exception (#DB) or breakpoint exception (#BP) occurred inside an RTM region while advanced debugging of RTM transactional regions was enabled." There is still an issue with stale DR6 bits potentially being misreported for the current debug exception. DR6 should not have been modified before vectoring the #DB exception, and the "new DR6 bits" should be available somewhere, but it was and they aren't. Fixes: b96fb439774e1 ("KVM: nVMX: fixes to nested virt interrupt injection") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-13KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 uses VPID and EPTLiran Alon
If L1 uses VPID, it expects TLB to not be flushed on L1<->L2 transitions. However, code currently flushes TLB nonetheless if we didn't allocate a vpid02 for L2. As in this case, vmcs02->vpid == vmcs01->vpid == vmx->vpid. But, if L1 uses EPT, TLB entires populated by L2 are tagged with EPTP02 while TLB entries populated by L1 are tagged with EPTP01. Therefore, we can also avoid TLB flush if L1 uses VPID and EPT. Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-13KVM: nVMX: Flush linear and combined mappings on VPID02 related flushesLiran Alon
All VPID12s used on a given L1 vCPU is translated to a single VPID02 (vmx->nested.vpid02 or vmx->vpid). Therefore, on L1->L2 VMEntry, we need to invalidate linear and combined mappings tagged by VPID02 in case L1 uses VPID and vmcs12->vpid was changed since last L1->L2 VMEntry. However, current code invalidates the wrong mappings as it calls __vmx_flush_tlb() with invalidate_gpa parameter set to true which will result in invalidating combined and guest-physical mappings tagged with active EPTP which is EPTP01. Similarly, INVVPID emulation have the exact same issue. Fix both issues by just setting invalidate_gpa parameter to false which will result in invalidating linear and combined mappings tagged with given VPID02 as required. Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-13KVM: nVMX: Use correct VPID02 when emulating L1 INVVPIDLiran Alon
In case L0 didn't allocate vmx->nested.vpid02 for L2, vmcs02->vpid is set to vmx->vpid. Consider this case when emulating L1 INVVPID in L0. Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-13KVM: nVMX: Flush TLB entries tagged by dest EPTP on L1<->L2 transitionsLiran Alon
If L1 and L2 share VPID (because L1 don't use VPID or we haven't allocated a vpid02), we need to flush TLB on L1<->L2 transitions. Before this patch, this TLB flushing was done by vmx_flush_tlb(). If L0 use EPT, this will translate into INVEPT(active_eptp); However, if L1 use EPT, in L1->L2 VMEntry, active EPTP is EPTP01 but TLB entries populated by L2 are tagged with EPTP02. Therefore we should delay vmx_flush_tlb() until active_eptp is EPTP02. To achieve this, instead of directly calling vmx_flush_tlb() we request it to be called by KVM_REQ_TLB_FLUSH which is evaluated after KVM_REQ_LOAD_CR3 which sets the active_eptp to EPTP02 as required. Similarly, on L2->L1 VMExit, active EPTP is EPTP02 but TLB entries populated by L1 are tagged with EPTP01 and therefore we should delay vmx_flush_tlb() until active_eptp is EPTP01. Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-13KVM: vmx: rename KVM_GUEST_CR0_MASK tp KVM_VM_CR0_ALWAYS_OFFSean Christopherson
The KVM_GUEST_CR0_MASK macro tracks CR0 bits that are forced to zero by the VMX architecture, i.e. CR0.{NW,CD} must always be zero in the hardware CR0 post-VMXON. Rename the macro to clarify its purpose, be consistent with KVM_VM_CR0_ALWAYS_ON and avoid confusion with the CR0_GUEST_HOST_MASK field. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-13KVM: vmx: hyper-v: don't pass EPT configuration info to ↵Vitaly Kuznetsov
vmx_hv_remote_flush_tlb() I'm observing random crashes in multi-vCPU L2 guests running on KVM on Hyper-V. I bisected the issue to the commit 877ad952be3d ("KVM: vmx: Add tlb_remote_flush callback support"). Hyper-V TLFS states: "AddressSpace specifies an address space ID (an EPT PML4 table pointer)" So apparently, Hyper-V doesn't expect us to pass naked EPTP, only PML4 pointer should be used. Strip off EPT configuration information before calling into vmx_hv_remote_flush_tlb(). Fixes: 877ad952be3d ("KVM: vmx: Add tlb_remote_flush callback support") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-04kvm: nVMX: fix entry with pending interrupt if APICv is enabledPaolo Bonzini
Commit b5861e5cf2fcf83031ea3e26b0a69d887adf7d21 introduced a check on the interrupt-window and NMI-window CPU execution controls in order to inject an external interrupt vmexit before the first guest instruction executes. However, when APIC virtualization is enabled the host does not need a vmexit in order to inject an interrupt at the next interrupt window; instead, it just places the interrupt vector in RVI and the processor will inject it as soon as possible. Therefore, on machines with APICv it is not enough to check the CPU execution controls: the same scenario can also happen if RVI>vPPR. Fixes: b5861e5cf2fcf83031ea3e26b0a69d887adf7d21 Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-04KVM: VMX: hide flexpriority from guest when disabled at the module levelPaolo Bonzini
As of commit 8d860bbeedef ("kvm: vmx: Basic APIC virtualization controls have three settings"), KVM will disable VIRTUALIZE_APIC_ACCESSES when a nested guest writes APIC_BASE MSR and kvm-intel.flexpriority=0, whereas previously KVM would allow a nested guest to enable VIRTUALIZE_APIC_ACCESSES so long as it's supported in hardware. That is, KVM now advertises VIRTUALIZE_APIC_ACCESSES to a guest but doesn't (always) allow setting it when kvm-intel.flexpriority=0, and may even initially allow the control and then clear it when the nested guest writes APIC_BASE MSR, which is decidedly odd even if it doesn't cause functional issues. Hide the control completely when the module parameter is cleared. reported-by: Sean Christopherson <sean.j.christopherson@intel.com> Fixes: 8d860bbeedef ("kvm: vmx: Basic APIC virtualization controls have three settings") Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-04KVM: VMX: check for existence of secondary exec controls before accessingSean Christopherson
Return early from vmx_set_virtual_apic_mode() if the processor doesn't support VIRTUALIZE_APIC_ACCESSES or VIRTUALIZE_X2APIC_MODE, both of which reside in SECONDARY_VM_EXEC_CONTROL. This eliminates warnings due to VMWRITEs to SECONDARY_VM_EXEC_CONTROL (VMCS field 401e) failing on processors without secondary exec controls. Remove the similar check for TPR shadowing as it is incorporated in the flexpriority_enabled check and the APIC-related code in vmx_update_msr_bitmap() is further gated by VIRTUALIZE_X2APIC_MODE. Reported-by: Gerhard Wiesinger <redhat@wiesinger.com> Fixes: 8d860bbeedef ("kvm: vmx: Basic APIC virtualization controls have three settings") Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGSLiran Alon
L2 IA32_BNDCFGS should be updated with vmcs12->guest_bndcfgs only when VM_ENTRY_LOAD_BNDCFGS is specified in vmcs12->vm_entry_controls. Otherwise, L2 IA32_BNDCFGS should be set to vmcs01->guest_bndcfgs which is L1 IA32_BNDCFGS. Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01KVM: x86: Do not use kvm_x86_ops->mpx_supported() directlyLiran Alon
Commit a87036add092 ("KVM: x86: disable MPX if host did not enable MPX XSAVE features") introduced kvm_mpx_supported() to return true iff MPX is enabled in the host. However, that commit seems to have missed replacing some calls to kvm_x86_ops->mpx_supported() to kvm_mpx_supported(). Complete original commit by replacing remaining calls to kvm_mpx_supported(). Fixes: a87036add092 ("KVM: x86: disable MPX if host did not enable MPX XSAVE features") Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabledLiran Alon
Before this commit, KVM exposes MPX VMX controls to L1 guest only based on if KVM and host processor supports MPX virtualization. However, these controls should be exposed to guest only in case guest vCPU supports MPX. Without this change, a L1 guest running with kernel which don't have commit 691bd4340bef ("kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS") asserts in QEMU on the following: qemu-kvm: error: failed to set MSR 0xd90 to 0x0 qemu-kvm: .../qemu-2.10.0/target/i386/kvm.c:1801 kvm_put_msrs: Assertion 'ret == cpu->kvm_msr_buf->nmsrs failed' This is because L1 KVM kvm_init_msr_list() will see that vmx_mpx_supported() (As it only checks MPX VMX controls support) and therefore KVM_GET_MSR_INDEX_LIST IOCTL will include MSR_IA32_BNDCFGS. However, later when L1 will attempt to set this MSR via KVM_SET_MSRS IOCTL, it will fail because !guest_cpuid_has_mpx(vcpu). Therefore, fix the issue by exposing MPX VMX controls to L1 guest only when vCPU supports MPX. Fixes: 36be0b9deb23 ("KVM: x86: Add nested virtualization support for MPX") Reported-by: Eyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-24KVM: x86: never trap MSR_KERNEL_GS_BASEPaolo Bonzini
KVM has an old optimization whereby accesses to the kernel GS base MSR are trapped when the guest is in 32-bit and not when it is in 64-bit mode. The idea is that swapgs is not available in 32-bit mode, thus the guest has no reason to access the MSR unless in 64-bit mode and 32-bit applications need not pay the price of switching the kernel GS base between the host and the guest values. However, this optimization adds complexity to the code for little benefit (these days most guests are going to be 64-bit anyway) and in fact broke after commit 678e315e78a7 ("KVM: vmx: add dedicated utility to access guest's kernel_gs_base", 2018-08-06); the guest kernel GS base can be corrupted across SMIs and UEFI Secure Boot is therefore broken (a secure boot Linux guest, for example, fails to reach the login prompt about half the time). This patch just removes the optimization; the kernel GS base MSR is now never trapped by KVM, similarly to the FS and GS base MSRs. Fixes: 678e315e78a780dbef384b92339c8414309dbc11 Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20nVMX x86: Check VPID value on vmentry of L2 guestsKrish Sadhukhan
According to section "Checks on VMX Controls" in Intel SDM vol 3C, the following check needs to be enforced on vmentry of L2 guests: If the 'enable VPID' VM-execution control is 1, the value of the of the VPID VM-execution control field must not be 0000H. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2Krish Sadhukhan
According to section "Checks on VMX Controls" in Intel SDM vol 3C, the following check needs to be enforced on vmentry of L2 guests: - Bits 5:0 of the posted-interrupt descriptor address are all 0. - The posted-interrupt descriptor address does not set any bits beyond the processor's physical-address width. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Mark Kanda <mark.kanda@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICvLiran Alon
In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state, it is possible for a vCPU to be blocked while it is in guest-mode. According to Intel SDM 26.6.5 Interrupt-Window Exiting and Virtual-Interrupt Delivery: "These events wake the logical processor if it just entered the HLT state because of a VM entry". Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU should be waken from the HLT state and injected with the interrupt. In addition, if while the vCPU is blocked (while it is in guest-mode), it receives a nested posted-interrupt, then the vCPU should also be waken and injected with the posted interrupt. To handle these cases, this patch enhances kvm_vcpu_has_events() to also check if there is a pending interrupt in L2 virtual APICv provided by L1. That is, it evaluates if there is a pending virtual interrupt for L2 by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1 Evaluation of Pending Interrupts. Note that this also handles the case of nested posted-interrupt by the fact RVI is updated in vmx_complete_nested_posted_interrupt() which is called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() -> kvm_vcpu_running() -> vmx_check_nested_events() -> vmx_complete_nested_posted_interrupt(). Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: VMX: check nested state and CR4.VMXE against SMMPaolo Bonzini
VMX cannot be enabled under SMM, check it when CR4 is set and when nested virtualization state is restored. This should fix some WARNs reported by syzkaller, mostly around alloc_shadow_vmcs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: VMX: use preemption timer to force immediate VMExitSean Christopherson
A VMX preemption timer value of '0' is guaranteed to cause a VMExit prior to the CPU executing any instructions in the guest. Use the preemption timer (if it's supported) to trigger immediate VMExit in place of the current method of sending a self-IPI. This ensures that pending VMExit injection to L1 occurs prior to executing any instructions in the guest (regardless of nesting level). When deferring VMExit injection, KVM generates an immediate VMExit from the (possibly nested) guest by sending itself an IPI. Because hardware interrupts are blocked prior to VMEnter and are unblocked (in hardware) after VMEnter, this results in taking a VMExit(INTR) before any guest instruction is executed. But, as this approach relies on the IPI being received before VMEnter executes, it only works as intended when KVM is running as L0. Because there are no architectural guarantees regarding when IPIs are delivered, when running nested the INTR may "arrive" long after L2 is running e.g. L0 KVM doesn't force an immediate switch to L1 to deliver an INTR. For the most part, this unintended delay is not an issue since the events being injected to L1 also do not have architectural guarantees regarding their timing. The notable exception is the VMX preemption timer[1], which is architecturally guaranteed to cause a VMExit prior to executing any instructions in the guest if the timer value is '0' at VMEnter. Specifically, the delay in injecting the VMExit causes the preemption timer KVM unit test to fail when run in a nested guest. Note: this approach is viable even on CPUs with a broken preemption timer, as broken in this context only means the timer counts at the wrong rate. There are no known errata affecting timer value of '0'. [1] I/O SMIs also have guarantees on when they arrive, but I have no idea if/how those are emulated in KVM. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> [Use a hook for SVM instead of leaving the default in x86.c - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: VMX: modify preemption timer bit only when arming timerSean Christopherson
Provide a singular location where the VMX preemption timer bit is set/cleared so that future usages of the preemption timer can ensure the VMCS bit is up-to-date without having to modify unrelated code paths. For example, the preemption timer can be used to force an immediate VMExit. Cache the status of the timer to avoid redundant VMREAD and VMWRITE, e.g. if the timer stays armed across multiple VMEnters/VMExits. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-20KVM: VMX: immediately mark preemption timer expired only for zero valueSean Christopherson
A VMX preemption timer value of '0' at the time of VMEnter is architecturally guaranteed to cause a VMExit prior to the CPU executing any instructions in the guest. This architectural definition is in place to ensure that a previously expired timer is correctly recognized by the CPU as it is possible for the timer to reach zero and not trigger a VMexit due to a higher priority VMExit being signalled instead, e.g. a pending #DB that morphs into a VMExit. Whether by design or coincidence, commit f4124500c2c1 ("KVM: nVMX: Fully emulate preemption timer") special cased timer values of '0' and '1' to ensure prompt delivery of the VMExit. Unlike '0', a timer value of '1' has no has no architectural guarantees regarding when it is delivered. Modify the timer emulation to trigger immediate VMExit if and only if the timer value is '0', and document precisely why '0' is special. Do this even if calibration of the virtual TSC failed, i.e. VMExit will occur immediately regardless of the frequency of the timer. Making only '0' a special case gives KVM leeway to be more aggressive in ensuring the VMExit is injected prior to executing instructions in the nested guest, and also eliminates any ambiguity as to why '1' is a special case, e.g. why wasn't the threshold for a "short timeout" set to 10, 100, 1000, etc... Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-09-07KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2Liran Alon
Consider the case L1 had a IRQ/NMI event until it executed VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME, L0 needs to evaluate if this pending event should cause an exit from L2 to L1 or delivered directly to L2 (e.g. In case L1 don't intercept EXTERNAL_INTERRUPT). Usually this would be handled by L0 requesting a IRQ/NMI window by setting VMCS accordingly. However, this setting was done on VMCS01 and now VMCS02 is active instead. Thus, when L1 executes VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by requesting a KVM_REQ_EVENT. Note that above scenario exists when L1 KVM is about to enter L2 but requests an "immediate-exit". As in this case, L1 will disable-interrupts and then send a self-IPI before entering L2. Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-30KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction()Sean Christopherson
Lack of the kvm_ prefix gives the impression that it's a VMX or SVM specific function, and there's no conflict that prevents adding the kvm_ prefix. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-30KVM: VMX: Do not allow reexecute_instruction() when skipping MMIO instrSean Christopherson
Re-execution after an emulation decode failure is only intended to handle a case where two or vCPUs race to write a shadowed page, i.e. we should never re-execute an instruction as part of MMIO emulation. As handle_ept_misconfig() is only used for MMIO emulation, it should pass EMULTYPE_NO_REEXECUTE when using the emulator to skip an instr in the fast-MMIO case where VM_EXIT_INSTRUCTION_LEN is invalid. And because the cr2 value passed to x86_emulate_instruction() is only destined for use when retrying or reexecuting, we can simply call emulate_instruction(). Fixes: d391f1207067 ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested") Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-30KVM: nVMX: avoid redundant double assignment of nested_run_pendingVitaly Kuznetsov
nested_run_pending is set 20 lines above and check_vmentry_prereqs()/ check_vmentry_postreqs() don't seem to be resetting it (the later, however, checks it). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Eduardo Valentin <eduval@amazon.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-08-26Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - Correct the L1TF fallout on 32bit and the off by one in the 'too much RAM for protection' calculation. - Add a helpful kernel message for the 'too much RAM' case - Unbreak the VDSO in case that the compiler desides to use indirect jumps/calls and emits retpolines which cannot be resolved because the kernel uses its own thunks, which does not work for the VDSO. Make it use the builtin thunks. - Re-export start_thread() which was unexported when the 32/64bit implementation was unified. start_thread() is required by modular binfmt handlers. - Trivial cleanups * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation/l1tf: Suggest what to do on systems with too much RAM x86/speculation/l1tf: Fix off-by-one error when warning that system has too much RAM x86/kvm/vmx: Remove duplicate l1d flush definitions x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit x86/process: Re-export start_thread() x86/mce: Add notifier_block forward declaration x86/vdso: Fix vDSO build if a retpoline is emitted
2018-08-22KVM: VMX: fixes for vmentry_l1d_flush module parameterPaolo Bonzini
Two bug fixes: 1) missing entries in the l1d_param array; this can cause a host crash if an access attempts to reach the missing entry. Future-proof the get function against any overflows as well. However, the two entries VMENTER_L1D_FLUSH_EPT_DISABLED and VMENTER_L1D_FLUSH_NOT_REQUIRED must not be accepted by the parse function, so disable them there. 2) invalid values must be rejected even if the CPU does not have the bug, so test for them before checking boot_cpu_has(X86_BUG_L1TF) ... and a small refactoring, since the .cmd field is redundant with the index in the array. Reported-by: Bandan Das <bsd@redhat.com> Cc: stable@vger.kernel.org Fixes: a7b9020b06ec6d7c3f3b0d4ef1a9eba12654f4f7 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-22KVM: vmx: Inject #UD for SGX ENCLS instruction in guestSean Christopherson
Virtualization of Intel SGX depends on Enclave Page Cache (EPC) management that is not yet available in the kernel, i.e. KVM support for exposing SGX to a guest cannot be added until basic support for SGX is upstreamed, which is a WIP[1]. Until SGX is properly supported in KVM, ensure a guest sees expected behavior for ENCLS, i.e. all ENCLS #UD. Because SGX does not have a true software enable bit, e.g. there is no CR4.SGXE bit, the ENCLS instruction can be executed[1] by the guest if SGX is supported by the system. Intercept all ENCLS leafs (via the ENCLS- exiting control and field) and unconditionally inject #UD. [1] https://www.spinics.net/lists/kvm/msg171333.html or https://lkml.org/lkml/2018/7/3/879 [2] A guest can execute ENCLS in the sense that ENCLS will not take an immediate #UD, but no ENCLS will ever succeed in a guest without explicit support from KVM (map EPC memory into the guest), unless KVM has a *very* egregious bug, e.g. accidentally mapped EPC memory into the guest SPTEs. In other words this patch is needed only to prevent the guest from seeing inconsistent behavior, e.g. #GP (SGX not enabled in Feature Control MSR) or #PF (leaf operand(s) does not point at EPC memory) instead of #UD on ENCLS. Intercepting ENCLS is not required to prevent the guest from truly utilizing SGX. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20180814163334.25724-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-22x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush()Yi Wang
Substitute spaces with tab. No functional changes. Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Message-Id: <1534398159-48509-1-git-send-email-wang.yi59@zte.com.cn> Cc: stable@vger.kernel.org # L1TF Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-20x86/kvm/vmx: Remove duplicate l1d flush definitionsJosh Poimboeuf
These are already defined higher up in the file. Fixes: 7db92e165ac8 ("x86/kvm: Move l1tf setup function") Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/d7ca03ae210d07173452aeed85ffe344301219a5.1534253536.git.jpoimboe@redhat.com
2018-08-19Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull first set of KVM updates from Paolo Bonzini: "PPC: - minor code cleanups x86: - PCID emulation and CR3 caching for shadow page tables - nested VMX live migration - nested VMCS shadowing - optimized IPI hypercall - some optimizations ARM will come next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits) kvm: x86: Set highest physical address bits in non-present/reserved SPTEs KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c KVM: X86: Implement PV IPIs in linux guest KVM: X86: Add kvm hypervisor init time platform setup callback KVM: X86: Implement "send IPI" hypercall KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs() KVM: x86: Skip pae_root shadow allocation if tdp enabled KVM/MMU: Combine flushing remote tlb in mmu_set_spte() KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup KVM: vmx: move struct host_state usage to struct loaded_vmcs KVM: vmx: compute need to reload FS/GS/LDT on demand KVM: nVMX: remove a misleading comment regarding vmcs02 fields KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state() KVM: vmx: add dedicated utility to access guest's kernel_gs_base KVM: vmx: track host_state.loaded using a loaded_vmcs pointer KVM: vmx: refactor segmentation code in vmx_save_host_state() kvm: nVMX: Fix fault priority for VMX operations kvm: nVMX: Fix fault vector for VMX operation at CPL > 0 ...
2018-08-14Merge branch 'l1tf-final' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Merge L1 Terminal Fault fixes from Thomas Gleixner: "L1TF, aka L1 Terminal Fault, is yet another speculative hardware engineering trainwreck. It's a hardware vulnerability which allows unprivileged speculative access to data which is available in the Level 1 Data Cache when the page table entry controlling the virtual address, which is used for the access, has the Present bit cleared or other reserved bits set. If an instruction accesses a virtual address for which the relevant page table entry (PTE) has the Present bit cleared or other reserved bits set, then speculative execution ignores the invalid PTE and loads the referenced data if it is present in the Level 1 Data Cache, as if the page referenced by the address bits in the PTE was still present and accessible. While this is a purely speculative mechanism and the instruction will raise a page fault when it is retired eventually, the pure act of loading the data and making it available to other speculative instructions opens up the opportunity for side channel attacks to unprivileged malicious code, similar to the Meltdown attack. While Meltdown breaks the user space to kernel space protection, L1TF allows to attack any physical memory address in the system and the attack works across all protection domains. It allows an attack of SGX and also works from inside virtual machines because the speculation bypasses the extended page table (EPT) protection mechanism. The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646 The mitigations provided by this pull request include: - Host side protection by inverting the upper address bits of a non present page table entry so the entry points to uncacheable memory. - Hypervisor protection by flushing L1 Data Cache on VMENTER. - SMT (HyperThreading) control knobs, which allow to 'turn off' SMT by offlining the sibling CPU threads. The knobs are available on the kernel command line and at runtime via sysfs - Control knobs for the hypervisor mitigation, related to L1D flush and SMT control. The knobs are available on the kernel command line and at runtime via sysfs - Extensive documentation about L1TF including various degrees of mitigations. Thanks to all people who have contributed to this in various ways - patches, review, testing, backporting - and the fruitful, sometimes heated, but at the end constructive discussions. There is work in progress to provide other forms of mitigations, which might be less horrible performance wise for a particular kind of workloads, but this is not yet ready for consumption due to their complexity and limitations" * 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits) x86/microcode: Allow late microcode loading with SMT disabled tools headers: Synchronise x86 cpufeatures.h for L1TF additions x86/mm/kmmio: Make the tracer robust against L1TF x86/mm/pat: Make set_memory_np() L1TF safe x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert x86/speculation/l1tf: Invert all not present mappings cpu/hotplug: Fix SMT supported evaluation KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry x86/speculation: Simplify sysfs report of VMX L1TF vulnerability Documentation/l1tf: Remove Yonah processors from not vulnerable list x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr() x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d x86: Don't include linux/irq.h from asm/hardirq.h x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d x86/irq: Demote irq_cpustat_t::__softirq_pending to u16 x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush() x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond' x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush() cpu/hotplug: detect SMT disabled by BIOS ...
2018-08-06KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.cUros Bizjak
Remove open-coded uses of set instructions to use CC_SET()/CC_OUT() in arch/x86/kvm/vmx.c. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> [Mark error paths as unlikely while touching this. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possibleSean Christopherson
The host's FS.base and GS.base rarely change, e.g. ~0.1% of host/guest swaps on my system. Cache the last value written to the VMCS and skip the VMWRITE to the associated VMCS fields when loading host state if the value hasn't changed since the last VMWRITE. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possibleSean Christopherson
On a 64-bit host, FS.sel and GS.sel are all but guaranteed to be 0, which in turn means they'll rarely change. Skip the VMWRITE for the associated VMCS fields when loading host state if the selector hasn't changed since the last VMWRITE. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setupSean Christopherson
The HOST_{FS,GS}_BASE fields are guaranteed to be written prior to VMENTER, by way of vmx_prepare_switch_to_guest(). Initialize the fields to zero for 64-bit kernels instead of pulling the base values from their respective MSRs. In addition to eliminating two RDMSRs, vmx_prepare_switch_to_guest() can safely assume the initial value of the fields is zero in all cases. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06KVM: vmx: move struct host_state usage to struct loaded_vmcsSean Christopherson
Make host_state a property of a loaded_vmcs so that it can be used as a cache of the VMCS fields, e.g. to lazily VMWRITE the corresponding VMCS field. Treating host_state as a cache does not work if it's not VMCS specific as the cache would become incoherent when switching between vmcs01 and vmcs02. Move vmcs_host_cr3 and vmcs_host_cr4 into host_state. Explicitly zero out host_state when allocating a new VMCS for a loaded_vmcs. Unlike the pre-existing vmcs_host_cr{3,4} usage, the segment information is not guaranteed to be (re)initialized when running a new nested VMCS, e.g. HOST_FS_BASE is not written in vmx_set_constant_host_state(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>