summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)Author
2018-08-06KVM: x86: do not load vmcs12 pages while still in SMMPaolo Bonzini
If the vCPU enters system management mode while running a nested guest, RSM starts processing the vmentry while still in SMM. In that case, however, the pages pointed to by the vmcs12 might be incorrectly loaded from SMRAM. To avoid this, delay the handling of the pages until just before the next vmentry. This is done with a new request and a new entry in kvm_x86_ops, which we will be able to reuse for nested VMX state migration. Extracted from a patch by Jim Mattson and KarimAllah Ahmed. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06KVM: x86: ensure all MSRs can always be KVM_GET/SET_MSR'dPaolo Bonzini
Some of the MSRs returned by GET_MSR_INDEX_LIST currently cannot be sent back to KVM_GET_MSR and/or KVM_SET_MSR; either they can never be sent back, or you they are only accepted under special conditions. This makes the API a pain to use. To avoid this pain, this patch makes it so that the result of the get-list ioctl can always be used for host-initiated get and set. Since we don't have a separate way to check for read-only MSRs, this means some Hyper-V MSRs are ignored when written. Arguably they should not even be in the result of GET_MSR_INDEX_LIST, but I am leaving there in case userspace is using the outcome of GET_MSR_INDEX_LIST to derive the support for the corresponding Hyper-V feature. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-05KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentryPaolo Bonzini
When nested virtualization is in use, VMENTER operations from the nested hypervisor into the nested guest will always be processed by the bare metal hypervisor, and KVM's "conditional cache flushes" mode in particular does a flush on nested vmentry. Therefore, include the "skip L1D flush on vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting. Add the relevant Documentation. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-05Merge 4.18-rc7 into master to pick up the KVM dependcyThomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "Miscellaneous bugfixes, plus a small patchlet related to Spectre v2" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvmclock: fix TSC calibration for nested guests KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel. x86/kvmclock: set pvti_cpu0_va after enabling kvmclock x86/kvm/Kconfig: Ensure CRYPTO_DEV_CCP_DD state at minimum matches KVM_AMD kvm: nVMX: Restore exit qual for VM-entry failure due to MSR loading x86/kvm/vmx: don't read current->thread.{fs,gs}base of legacy tasks KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSR
2018-07-15KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSRPaolo Bonzini
This lets userspace read the MSR_IA32_ARCH_CAPABILITIES and check that all requested features are available on the host. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-07-04x86/KVM/VMX: Add L1D flush logicPaolo Bonzini
Add the logic for flushing L1D on VMENTER. The flush depends on the static key being enabled and the new l1tf_flush_l1d flag being set. The flags is set: - Always, if the flush module parameter is 'always' - Conditionally at: - Entry to vcpu_run(), i.e. after executing user space - From the sched_in notifier, i.e. when switching to a vCPU thread. - From vmexit handlers which are considered unsafe, i.e. where sensitive data can be brought into L1D: - The emulator, which could be a good target for other speculative execution-based threats, - The MMU, which can bring host page tables in the L1 cache. - External interrupts - Nested operations that require the MMU (see above). That is vmptrld, vmptrst, vmclear,vmwrite,vmread. - When handling invept,invvpid [ tglx: Split out from combo patch and reduced to a single flag ] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-06-14KVM: x86: fix typo at kvm_arch_hardware_setup commentMarcelo Tosatti
Fix typo in sentence about min value calculation. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-06-12Merge tag 'overflow-v4.18-rc1-part2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull more overflow updates from Kees Cook: "The rest of the overflow changes for v4.18-rc1. This includes the explicit overflow fixes from Silvio, further struct_size() conversions from Matthew, and a bug fix from Dan. But the bulk of it is the treewide conversions to use either the 2-factor argument allocators (e.g. kmalloc(a * b, ...) into kmalloc_array(a, b, ...) or the array_size() macros (e.g. vmalloc(a * b) into vmalloc(array_size(a, b)). Coccinelle was fighting me on several fronts, so I've done a bunch of manual whitespace updates in the patches as well. Summary: - Error path bug fix for overflow tests (Dan) - Additional struct_size() conversions (Matthew, Kees) - Explicitly reported overflow fixes (Silvio, Kees) - Add missing kvcalloc() function (Kees) - Treewide conversions of allocators to use either 2-factor argument variant when available, or array_size() and array3_size() as needed (Kees)" * tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits) treewide: Use array_size in f2fs_kvzalloc() treewide: Use array_size() in f2fs_kzalloc() treewide: Use array_size() in f2fs_kmalloc() treewide: Use array_size() in sock_kmalloc() treewide: Use array_size() in kvzalloc_node() treewide: Use array_size() in vzalloc_node() treewide: Use array_size() in vzalloc() treewide: Use array_size() in vmalloc() treewide: devm_kzalloc() -> devm_kcalloc() treewide: devm_kmalloc() -> devm_kmalloc_array() treewide: kvzalloc() -> kvcalloc() treewide: kvmalloc() -> kvmalloc_array() treewide: kzalloc_node() -> kcalloc_node() treewide: kzalloc() -> kcalloc() treewide: kmalloc() -> kmalloc_array() mm: Introduce kvcalloc() video: uvesafb: Fix integer overflow in allocation UBIFS: Fix potential integer overflow in allocation leds: Use struct_size() in allocation Convert intel uncore to struct_size ...
2018-06-12treewide: kvzalloc() -> kvcalloc()Kees Cook
The kvzalloc() function has a 2-factor argument form, kvcalloc(). This patch replaces cases of: kvzalloc(a * b, gfp) with: kvcalloc(a * b, gfp) as well as handling cases of: kvzalloc(a * b * c, gfp) with: kvzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kvcalloc(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kvzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kvzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kvzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kvzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kvzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kvzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kvzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kvzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kvzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kvzalloc( - sizeof(char) * COUNT + COUNT , ...) | kvzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kvzalloc + kvcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kvzalloc + kvcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kvzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kvzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kvzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kvzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kvzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kvzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kvzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kvzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kvzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kvzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kvzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kvzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kvzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kvzalloc(C1 * C2 * C3, ...) | kvzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kvzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kvzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kvzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kvzalloc(sizeof(THING) * C2, ...) | kvzalloc(sizeof(TYPE) * C2, ...) | kvzalloc(C1 * C2 * C3, ...) | kvzalloc(C1 * C2, ...) | - kvzalloc + kvcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kvzalloc + kvcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kvzalloc + kvcalloc ( - (E1) * E2 + E1, E2 , ...) | - kvzalloc + kvcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kvzalloc + kvcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "Small update for KVM: ARM: - lazy context-switching of FPSIMD registers on arm64 - "split" regions for vGIC redistributor s390: - cleanups for nested - clock handling - crypto - storage keys - control register bits x86: - many bugfixes - implement more Hyper-V super powers - implement lapic_timer_advance_ns even when the LAPIC timer is emulated using the processor's VMX preemption timer. - two security-related bugfixes at the top of the branch" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (79 commits) kvm: fix typo in flag name kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system KVM: x86: introduce linear_{read,write}_system kvm: nVMX: Enforce cpl=0 for VMX instructions kvm: nVMX: Add support for "VMWRITE to any supported field" kvm: nVMX: Restrict VMX capability MSR changes KVM: VMX: Optimize tscdeadline timer latency KVM: docs: nVMX: Remove known limitations as they do not exist now KVM: docs: mmu: KVM support exposing SLAT to guests kvm: no need to check return value of debugfs_create functions kvm: Make VM ioctl do valloc for some archs kvm: Change return type to vm_fault_t KVM: docs: mmu: Fix link to NPT presentation from KVM Forum 2008 kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capability KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} implementation KVM: introduce kvm_make_vcpus_request_mask() API KVM: x86: hyperv: do rep check for each hypercall separately ...
2018-06-12kvm: fix typo in flag nameMichael S. Tsirkin
KVM_X86_DISABLE_EXITS_HTL really refers to exit on halt. Obviously a typo: should be named KVM_X86_DISABLE_EXITS_HLT. Fixes: caa057a2cad ("KVM: X86: Provide a capability to disable HLT intercepts") Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-06-12kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor accessPaolo Bonzini
The functions that were used in the emulation of fxrstor, fxsave, sgdt and sidt were originally meant for task switching, and as such they did not check privilege levels. This is very bad when the same functions are used in the emulation of unprivileged instructions. This is CVE-2018-10853. The obvious fix is to add a new argument to ops->read_std and ops->write_std, which decides whether the access is a "system" access or should use the processor's CPL. Fixes: 129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-06-12KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_systemPaolo Bonzini
Int the next patch the emulator's .read_std and .write_std callbacks will grow another argument, which is not needed in kvm_read_guest_virt and kvm_write_guest_virt_system's callers. Since we have to make separate functions, let's give the currently existing names a nicer interface, too. Fixes: 129a72a0d3c8 ("KVM: x86: Introduce segmented_write_std", 2017-01-12) Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-06-04KVM: VMX: Optimize tscdeadline timer latencyWanpeng Li
'Commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer expiration")' advances the tscdeadline (the timer is emulated by hrtimer) expiration in order that the latency which is incurred by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch adds the advance tscdeadline expiration support to which the tscdeadline timer is emulated by VMX preemption timer to reduce the hypervisor lantency (handle_preemption_timer -> vmentry). The guest can also set an expiration that is very small (for example in Linux if an hrtimer feeds a expiration in the past); in that case we set delta_tsc to 0, leading to an immediately vmexit when delta_tsc is not bigger than advance ns. This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing busy waits. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-06-01kvm: Change return type to vm_fault_tSouptick Joarder
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f422059ae ("mm: change return type to vm_fault_t") Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-05-26Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Radim Krčmář: "PPC: - Close a hole which could possibly lead to the host timebase getting out of sync. - Three fixes relating to PTEs and TLB entries for radix guests. - Fix a bug which could lead to an interrupt never getting delivered to the guest, if it is pending for a guest vCPU when the vCPU gets offlined. s390: - Fix false negatives in VSIE validity check (Cc stable) x86: - Fix time drift of VMX preemption timer when a guest uses LAPIC timer in periodic mode (Cc stable) - Unconditionally expose CPUID.IA32_ARCH_CAPABILITIES to allow migration from hosts that don't need retpoline mitigation (Cc stable) - Fix guest crashes on reboot by properly coupling CR4.OSXSAVE and CPUID.OSXSAVE (Cc stable) - Report correct RIP after Hyper-V hypercall #UD (introduced in -rc6)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: fix #UD address of failed Hyper-V hypercalls kvm: x86: IA32_ARCH_CAPABILITIES is always supported KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed x86/kvm: fix LAPIC timer drift when guest uses periodic mode KVM: s390: vsie: fix < 8k check for the itdba KVM: PPC: Book 3S HV: Do ptesync in radix guest exit path KVM: PPC: Book3S HV: XIVE: Resend re-routed interrupts on CPU priority change KVM: PPC: Book3S HV: Make radix clear pte when unmapping KVM: PPC: Book3S HV: Make radix use correct tlbie sequence in kvmppc_radix_tlbie_page KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry
2018-05-26KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capabilityVitaly Kuznetsov
We need a new capability to indicate support for the newly added HvFlushVirtualAddress{List,Space}{,Ex} hypercalls. Upon seeing this capability, userspace is supposed to announce PV TLB flush features by setting the appropriate CPUID bits (if needed). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-05-26Merge branch 'x86/hyperv' of ↵Radim Krčmář
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip To resolve conflicts with the PV TLB flush series.
2018-05-25KVM: x86: fix #UD address of failed Hyper-V hypercallsRadim Krčmář
If the hypercall was called from userspace or real mode, KVM injects #UD and then advances RIP, so it looks like #UD was caused by the following instruction. This probably won't cause more than confusion, but could give an unexpected access to guest OS' instruction emulator. Also, refactor the code to count hv hypercalls that were handled by the virt userspace. Fixes: 6356ee0c9602 ("x86: Delay skip of emulated hypercall instruction") Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-05-24KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changedWei Huang
The CPUID bits of OSXSAVE (function=0x1) and OSPKE (func=0x7, leaf=0x0) allows user apps to detect if OS has set CR4.OSXSAVE or CR4.PKE. KVM is supposed to update these CPUID bits when CR4 is updated. Current KVM code doesn't handle some special cases when updates come from emulator. Here is one example: Step 1: guest boots Step 2: guest OS enables XSAVE ==> CR4.OSXSAVE=1 and CPUID.OSXSAVE=1 Step 3: guest hot reboot ==> QEMU reset CR4 to 0, but CPUID.OSXAVE==1 Step 4: guest os checks CPUID.OSXAVE, detects 1, then executes xgetbv Step 4 above will cause an #UD and guest crash because guest OS hasn't turned on OSXAVE yet. This patch solves the problem by comparing the the old_cr4 with cr4. If the related bits have been changed, kvm_update_cpuid() needs to be called. Signed-off-by: Wei Huang <wei@redhat.com> Reviewed-by: Bandan Das <bsd@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-05-23KVM: x86: use timespec64 for KVM_HC_CLOCK_PAIRINGArnd Bergmann
The hypercall was added using a struct timespec based implementation, but we should not use timespec in new code. This changes it to timespec64. There is no functional change here since the implementation is only used in 64-bit kernels that use the same definition for timespec and timespec64. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-05-21Merge branch 'speck-v20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Merge speculative store buffer bypass fixes from Thomas Gleixner: - rework of the SPEC_CTRL MSR management to accomodate the new fancy SSBD (Speculative Store Bypass Disable) bit handling. - the CPU bug and sysfs infrastructure for the exciting new Speculative Store Bypass 'feature'. - support for disabling SSB via LS_CFG MSR on AMD CPUs including Hyperthread synchronization on ZEN. - PRCTL support for dynamic runtime control of SSB - SECCOMP integration to automatically disable SSB for sandboxed processes with a filter flag for opt-out. - KVM integration to allow guests fiddling with SSBD including the new software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on AMD. - BPF protection against SSB .. this is just the core and x86 side, other architecture support will come separately. * 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits) bpf: Prevent memory disambiguation attack x86/bugs: Rename SSBD_NO to SSB_NO KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG x86/bugs: Rework spec_ctrl base and mask logic x86/bugs: Remove x86_spec_ctrl_set() x86/bugs: Expose x86_spec_ctrl_base directly x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host} x86/speculation: Rework speculative_store_bypass_update() x86/speculation: Add virtualized speculative store bypass disable support x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL x86/speculation: Handle HT correctly on AMD x86/cpufeatures: Add FEATURE_ZEN x86/cpufeatures: Disentangle SSBD enumeration x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP KVM: SVM: Move spec control call after restore of GS x86/cpu: Make alternative_msr_write work for 32-bit code x86/bugs: Fix the parameters alignment and missing void x86/bugs: Make cpu_show_common() static ...
2018-05-17KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBDTom Lendacky
Expose the new virtualized architectural mechanism, VIRT_SSBD, for using speculative store bypass disable (SSBD) under SVM. This will allow guests to use SSBD on hardware that uses non-architectural mechanisms for enabling SSBD. [ tglx: Folded the migration fixup from Paolo Bonzini ] Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-15KVM: X86: Lower the default timer frequency limit to 200usWanpeng Li
Anthoine reported: The period used by Windows change over time but it can be 1 milliseconds or less. I saw the limit_periodic_timer_frequency print so 500 microseconds is sometimes reached. As suggested by Paolo, lower the default timer frequency limit to a smaller interval of 200 us (5000 Hz) to leave some headroom. This is required due to Windows 10 changing the scheduler tick limit from 1024 Hz to 2048 Hz. Reported-by: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Anthoine Bourgeois <anthoine.bourgeois@blade-group.com> Cc: Darren Kenny <darren.kenny@oracle.com> Cc: Jan Kiszka <jan.kiszka@web.de> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-05-14kvm: vmx: Introduce lapic_mode enumerationJim Mattson
The local APIC can be in one of three modes: disabled, xAPIC or x2APIC. (A fourth mode, "invalid," is included for completeness.) Using the new enumeration can make some of the APIC mode logic easier to read. In kvm_set_apic_base, for instance, it is clear that one cannot transition directly from x2APIC mode to xAPIC mode or directly from APIC disabled to x2APIC mode. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> [Check invalid bits even if msr_info->host_initiated. Reported by Wanpeng Li. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-05-14KVM: X86: Fix reserved bits check for MOV to CR3Wanpeng Li
MSB of CR3 is a reserved bit if the PCIDE bit is not set in CR4. It should be checked when PCIDE bit is not set, however commit 'd1cd3ce900441 ("KVM: MMU: check guest CR3 reserved bits based on its physical address width")' removes the bit 63 checking unconditionally. This patch fixes it by checking bit 63 of CR3 when PCIDE bit is not set in CR4. Fixes: d1cd3ce900441 (KVM: MMU: check guest CR3 reserved bits based on its physical address width) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Reviewed-by: Junaid Shahid <junaids@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-05-11kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabledJunaid Shahid
If the PCIDE bit is not set in CR4, then the MSb of CR3 is a reserved bit. If the guest tries to set it, that should cause a #GP fault. So mask out the bit only when the PCIDE bit is set. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-05-11x86: Delay skip of emulated hypercall instructionMarian Rotariu
The IP increment should be done after the hypercall emulation, after calling the various handlers. In this way, these handlers can accurately identify the the IP of the VMCALL if they need it. This patch keeps the same functionality for the Hyper-V handler which does not use the return code of the standard kvm_skip_emulated_instruction() call. Signed-off-by: Marian Rotariu <mrotariu@bitdefender.com> [Hyper-V hypercalls also need kvm_skip_emulated_instruction() - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-16kvm: x86: move MSR_IA32_TSC handling to x86.cPaolo Bonzini
This is not specific to Intel/AMD anymore. The TSC offset is available in vcpu->arch.tsc_offset. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-16X86/KVM: Properly update 'tsc_offset' to represent the running guestKarimAllah Ahmed
Update 'tsc_offset' on vmentry/vmexit of L2 guests to ensure that it always captures the TSC_OFFSET of the running guest whether it is the L1 or L2 guest. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> [AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-11X86/KVM: Do not allow DISABLE_EXITS_MWAIT when LAPIC ARAT is not availableKarimAllah Ahmed
If the processor does not have an "Always Running APIC Timer" (aka ARAT), we should not give guests direct access to MWAIT. The LAPIC timer would stop ticking in deep C-states, so any host deadlines would not wakeup the host kernel. The host kernel intel_idle driver handles this by switching to broadcast mode when ARAT is not available and MWAIT is issued with a deep C-state that would stop the LAPIC timer. When MWAIT is passed through, we can not tell when MWAIT is issued. So just disable this capability when LAPIC ARAT is not available. I am not even sure if there are any CPUs with VMX support but no LAPIC ARAT or not. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Reported-by: Wanpeng Li <kernellwp@gmail.com> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-04kvm: x86: fix a compile warningPeng Hao
fix a "warning: no previous prototype". Cc: stable@vger.kernel.org Signed-off-by: Peng Hao <peng.hao2@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-04KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"Wanpeng Li
There is no easy way to force KVM to run an instruction through the emulator (by design as that will expose the x86 emulator as a significant attack-surface). However, we do wish to expose the x86 emulator in case we are testing it (e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix" that is designed to raise #UD which KVM will trap and it's #UD exit-handler will match "force emulation prefix" to run instruction after prefix by the x86 emulator. To not expose the x86 emulator by default, we add a module parameter that should be off by default. A simple testcase here: #include <stdio.h> #include <string.h> #define HYPERVISOR_INFO 0x40000000 #define CPUID(idx, eax, ebx, ecx, edx) \ asm volatile (\ "ud2a; .ascii \"kvm\"; cpuid" \ :"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \ :"0"(idx) ); void main() { unsigned int eax, ebx, ecx, edx; char string[13]; CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx); *(unsigned int *)(string + 0) = ebx; *(unsigned int *)(string + 4) = ecx; *(unsigned int *)(string + 8) = edx; string[12] = 0; if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0) printf("kvm guest\n"); else printf("bare hardware\n"); } Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> [Correctly handle usermode exits. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-04KVM: X86: Introduce handle_ud()Wanpeng Li
Introduce handle_ud() to handle invalid opcode, this function will be used by later patches. Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim KrÄmář <rkrcmar@redhat.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-28KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event ↵Liran Alon
pending In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with IDT-vectoring-info which vmx_complete_interrupts() makes sure to reinject before next resume of L2. While handling the VMExit in L0, an IPI could be sent by another L1 vCPU to the L1 vCPU which currently runs L2 and exited to L0. When L0 will reach vcpu_enter_guest() and call inject_pending_event(), it will note that a previous event was re-injected to L2 (by IDT-vectoring-info) and therefore won't check if there are pending L1 events which require exit from L2 to L1. Thus, L0 enters L2 without immediate VMExit even though there are pending L1 events! This commit fixes the issue by making sure to check for L1 pending events even if a previous event was reinjected to L2 and bailing out from inject_pending_event() before evaluating a new pending event in case an event was already reinjected. The bug was observed by the following setup: * L0 is a 64CPU machine which runs KVM. * L1 is a 16CPU machine which runs KVM. * L0 & L1 runs with APICv disabled. (Also reproduced with APICv enabled but easier to analyze below info with APICv disabled) * L1 runs a 16CPU L2 Windows Server 2012 R2 guest. During L2 boot, L1 hangs completely and analyzing the hang reveals that one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI that he has sent for another L1 vCPU. And all other L1 vCPUs are currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck forever (as L1 runs with kernel-preemption disabled). Observing /sys/kernel/debug/tracing/trace_pipe reveals the following series of events: (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: 0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182 ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000 (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f vec 252 (Fixed|edge) (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210 (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION rip 0xffffe00069202690 info 83 0 (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip: 0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason: EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000 (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15 Which can be analyzed as follows: (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2. Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject a pending-interrupt of 0xd2. (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT. (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which notes that interrupt 0xd2 was reinjected and therefore calls vmx_inject_irq() and returns. Without checking for pending L1 events! Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest() before calling inject_pending_event(). (4) L0 resumes L2 without immediate-exit even though there is a pending L1 event (The IPI pending in LAPIC's IRR). We have already reached the buggy scenario but events could be furthered analyzed: (5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during event-delivery. (7) L0 decides to forward the VMExit to L1 for further handling. (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the LAPIC's IRR is not examined and therefore the IPI is still not delivered into L1! Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28KVM: x86: Fix misleading comments on handling pending exceptionsLiran Alon
The reason that exception.pending should block re-injection of NMI/interrupt is not described correctly in comment in code. Instead, it describes why a pending exception should be injected before a pending NMI/interrupt. Therefore, move currently present comment to code-block evaluating a new pending event which explains why exception.pending is evaluated first. In addition, create a new comment describing that exception.pending blocks re-injection of NMI/interrupt because the exception was queued by handling vmexit which was due to NMI/interrupt delivery. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@orcle.com> [Used a comment from Sean J <sean.j.christopherson@intel.com>. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28KVM: x86: Rename interrupt.pending to interrupt.injectedLiran Alon
For exceptions & NMIs events, KVM code use the following coding convention: *) "pending" represents an event that should be injected to guest at some point but it's side-effects have not yet occurred. *) "injected" represents an event that it's side-effects have already occurred. However, interrupts don't conform to this coding convention. All current code flows mark interrupt.pending when it's side-effects have already taken place (For example, bit moved from LAPIC IRR to ISR). Therefore, it makes sense to just rename interrupt.pending to interrupt.injected. This change follows logic of previous commit 664f8e26b00c ("KVM: X86: Fix loss of exception which has not yet been injected") which changed exception to follow this coding convention as well. It is important to note that in case !lapic_in_kernel(vcpu), interrupt.pending usage was and still incorrect. In this case, interrrupt.pending can only be set using one of the following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that QEMU uses them either to re-set an "interrupt.pending" state it has received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR before sending ioctl to KVM. So it seems that indeed "interrupt.pending" in this case is also suppose to represent "interrupt.injected". However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr() is misusing (now named) interrupt.injected in order to return if there is a pending interrupt. This leads to nVMX/nSVM not be able to distinguish if it should exit from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should re-inject an injected interrupt. Therefore, add a FIXME at these functions for handling this issue. This patch introduce no semantics change. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interruptLiran Alon
kvm_inject_realmode_interrupt() is called from one of the injection functions which writes event-injection to VMCS: vmx_queue_exception(), vmx_inject_irq() and vmx_inject_nmi(). All these functions are called just to cause an event-injection to guest. They are not responsible of manipulating the event-pending flag. The only purpose of kvm_inject_realmode_interrupt() should be to emulate real-mode interrupt-injection. This was also incorrect when called from vmx_queue_exception(). Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28x86/kvm: rename HV_X64_MSR_APIC_ASSIST_PAGE to HV_X64_MSR_VP_ASSIST_PAGELadi Prosek
The assist page has been used only for the paravirtual EOI so far, hence the "APIC" in the MSR name. Renaming to match the Hyper-V TLFS where it's called "Virtual VP Assist MSR". Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28KVM: x86: Fix perf timer mode IP reportingAndi Kleen
KVM and perf have a special backdoor mechanism to report the IP for interrupts re-executed after vm exit. This works for the NMIs that perf normally uses. However when perf is in timer mode it doesn't work because the timer interrupt doesn't get this special treatment. This is common when KVM is running nested in another hypervisor which may not implement the PMU, so only timer mode is available. Call the functions to set up the backdoor IP also for non NMI interrupts. I renamed the functions to set up the backdoor IP reporting to be more appropiate for their new use. The SVM change is only compile tested. v2: Moved the functions inline. For the normal interrupt case the before/after functions are now called from x86.c, not arch specific code. For the NMI case we still need to call it in the architecture specific code, because it's already needed in the low level *_run functions. Signed-off-by: Andi Kleen <ak@linux.intel.com> [Removed unnecessary calls from arch handle_external_intr. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-23KVM: x86: move setting of ept_identity_map_addr to vmx.cSean Christopherson
Add kvm_x86_ops->set_identity_map_addr and set ept_identity_map_addr in VMX specific code so that ept_identity_map_addr can be moved out of 'struct kvm_arch' in a future patch. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-21KVM: nVMX: Do not load EOI-exitmap while running L2Liran Alon
When L1 IOAPIC redirection-table is written, a request of KVM_REQ_SCAN_IOAPIC is set on all vCPUs. This is done such that all vCPUs will now recalc their IOAPIC handled vectors and load it to their EOI-exitmap. However, it could be that one of the vCPUs is currently running L2. In this case, load_eoi_exitmap() will be called which would write to vmcs02->eoi_exit_bitmap, which is wrong because vmcs02->eoi_exit_bitmap should always be equal to vmcs12->eoi_exit_bitmap. Furthermore, at this point KVM_REQ_SCAN_IOAPIC was already consumed and therefore we will never update vmcs01->eoi_exit_bitmap. This could lead to remote_irr of some IOAPIC level-triggered entry to remain set forever. Fix this issue by delaying the load of EOI-exitmap to when vCPU is running L1. One may wonder why not just delay entire KVM_REQ_SCAN_IOAPIC processing to when vCPU is running L1. This is done in order to handle correctly the case where LAPIC & IO-APIC of L1 is pass-throughed into L2. In this case, vmcs12->virtual_interrupt_delivery should be 0. In current nVMX implementation, that results in vmcs02->virtual_interrupt_delivery to also be 0. Thus, vmcs02->eoi_exit_bitmap is not used. Therefore, every L2 EOI cause a #VMExit into L0 (either on MSR_WRITE to x2APIC MSR or APIC_ACCESS/APIC_WRITE/EPT_MISCONFIG to APIC MMIO page). In order for such L2 EOI to be broadcasted, if needed, from LAPIC to IO-APIC, vcpu->arch.ioapic_handled_vectors must be updated while L2 is running. Therefore, patch makes sure to delay only the loading of EOI-exitmap but not the update of vcpu->arch.ioapic_handled_vectors. Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: X86: Provide a capability to disable PAUSE interceptsWanpeng Li
Allow to disable pause loop exit/pause filtering on a per VM basis. If some VMs have dedicated host CPUs, they won't be negatively affected due to needlessly intercepted PAUSE instructions. Thanks to Jan H. Schönherr's initial patch. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: X86: Provide a capability to disable HLT interceptsWanpeng Li
If host CPUs are dedicated to a VM, we can avoid VM exits on HLT. This patch adds the per-VM capability to disable them. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: X86: Provide a capability to disable MWAIT interceptsWanpeng Li
Allowing a guest to execute MWAIT without interception enables a guest to put a (physical) CPU into a power saving state, where it takes longer to return from than what may be desired by the host. Don't give a guest that power over a host by default. (Especially, since nothing prevents a guest from using MWAIT even when it is not advertised via CPUID.) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: x86: Add support for VMware backdoor Pseudo-PMCsArbel Moshe
VMware exposes the following Pseudo PMCs: 0x10000: Physical host TSC 0x10001: Elapsed real time in ns 0x10002: Elapsed apparent time in ns For more info refer to: https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf VMware allows access to these Pseduo-PMCs even when read via RDPMC in Ring3 and CR4.PCE=0. Therefore, commit modifies x86 emulator to allow access to these PMCs in this situation. In addition, emulation of these PMCs were added to kvm_pmu_rdpmc(). Signed-off-by: Arbel Moshe <arbel.moshe@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: x86: Emulate only IN/OUT instructions when accessing VMware backdoorLiran Alon
Access to VMware backdoor ports is done by one of the IN/OUT/INS/OUTS instructions. These ports must be allowed access even if TSS I/O permission bitmap don't allow it. To handle this, VMX/SVM will be changed in future commits to intercept #GP which was raised by such access and handle it by calling x86 emulator to emulate instruction. If it was one of these instructions, the x86 emulator already handles it correctly (Since commit "KVM: x86: Always allow access to VMware backdoor I/O ports") by not checking these ports against TSS I/O permission bitmap. One may wonder why checking for specific instructions is necessary as we can just forward all #GPs to the x86 emulator. There are multiple reasons for doing so: 1. We don't want the x86 emulator to be reached easily by guest by just executing an instruction that raises #GP as that exposes the x86 emulator as a bigger attack surface. 2. The x86 emulator is incomplete and therefore certain instructions that can cause #GP cannot be emulated. Such an example is "INT x" (opcode 0xcd) which reaches emulate_int() which can only emulate the instruction if vCPU is in real-mode. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: x86: Add emulation_type to not raise #UD on emulation failureLiran Alon
Next commits are going introduce support for accessing VMware backdoor ports even though guest's TSS I/O permissions bitmap doesn't allow access. This mimic VMware hypervisor behavior. In order to support this, next commits will change VMX/SVM to intercept #GP which was raised by such access and handle it by calling the x86 emulator to emulate instruction. Since commit "KVM: x86: Always allow access to VMware backdoor I/O ports", the x86 emulator handles access to these I/O ports by not checking these ports against the TSS I/O permission bitmap. However, there could be cases that CPU rasies a #GP on instruction that fails to be disassembled by the x86 emulator (Because of incomplete implementation for example). In those cases, we would like the #GP intercept to just forward #GP as-is to guest as if there was no intercept to begin with. However, current emulator code always queues #UD exception in case emulator fails (including disassembly failures) which is not what is wanted in this flow. This commit addresses this issue by adding a new emulation_type flag that will allow the #GP intercept handler to specify that it wishes to be aware when instruction emulation fails and doesn't want #UD exception to be queued. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: x86: Add module parameter for supporting VMware backdoorLiran Alon
Support access to VMware backdoor requires KVM to intercept #GP exceptions from guest which introduce slight performance hit. Therefore, control this support by module parameter. Note that module parameter is exported as it should be consumed by kvm_intel & kvm_amd to determine if they should intercept #GP or not. This commit doesn't change semantics. It is done as a preparation for future commits. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>