summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx/nested.c
AgeCommit message (Collapse)Author
2020-09-11KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL controlChenyi Qiang
A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field in exit_ctls_high. Fixes: 03a8871add95 ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control") Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200828085622.8365-5-chenyi.qiang@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detectedPeter Shier
When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call load_pdptrs to read the possibly updated PDPTEs from the guest physical address referenced by CR3. It loads them into vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. At the subsequent assumed reentry into L2, the mmu will call vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works if the L2 CRn write intercept always resumes L2. The resume path calls vmx_check_nested_events which checks for exceptions, MTF, and expired VMX preemption timers. If vmx_check_nested_events finds any of these conditions pending it will reflect the corresponding exit into L1. Live migration at this point would also cause a missed immediate reentry into L2. After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. When L2 next resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale values stored at last L2 exit. A repro of this bug showed L2 entering triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values. When L2 is in PAE paging mode add a call to ept_load_pdptrs before leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in vcpu->arch.walk_mmu->pdptrs[]. Tested: kvm-unit-tests with new directed test: vmx_mtf_pdpte_test. Verified that test fails without the fix. Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix would repro readily. Ran 14 simultaneous L2s for 140 iterations with no failures. Signed-off-by: Peter Shier <pshier@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200820230545.2411347-1-pshier@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-10objtool: Rename frame.h -> objtool.hJulien Thierry
Header frame.h is getting more code annotations to help objtool analyze object files. Rename the file to objtool.h. [ jpoimboe: add objtool.h to MAINTAINERS ] Signed-off-by: Julien Thierry <jthierry@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
2020-08-06Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "s390: - implement diag318 x86: - Report last CPU for debugging - Emulate smaller MAXPHYADDR in the guest than in the host - .noinstr and tracing fixes from Thomas - nested SVM page table switching optimization and fixes Generic: - Unify shadow MMU cache data structures across architectures" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) KVM: SVM: Fix sev_pin_memory() error handling KVM: LAPIC: Set the TDCR settable bits KVM: x86: Specify max TDP level via kvm_configure_mmu() KVM: x86/mmu: Rename max_page_level to max_huge_page_level KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch KVM: x86: Pull the PGD's level from the MMU instead of recalculating it KVM: VMX: Make vmx_load_mmu_pgd() static KVM: x86/mmu: Add separate helper for shadow NPT root page role calc KVM: VMX: Drop a duplicate declaration of construct_eptp() KVM: nSVM: Correctly set the shadow NPT root level in its MMU role KVM: Using macros instead of magic values MIPS: KVM: Fix build error caused by 'kvm_run' cleanup KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support KVM: VMX: optimize #PF injection when MAXPHYADDR does not match KVM: VMX: Add guest physical address check in EPT violation and misconfig KVM: VMX: introduce vmx_need_pf_intercept KVM: x86: update exception bitmap on CPUID changes KVM: x86: rename update_bp_intercept to update_exception_bitmap ...
2020-07-30KVM: x86: Pull the PGD's level from the MMU instead of recalculating itSean Christopherson
Use the shadow_root_level from the current MMU as the root level for the PGD, i.e. for VMX's EPTP. This eliminates the weird dependency between VMX and the MMU where both must independently calculate the same root level for things to work correctly. Temporarily keep VMX's calculation of the level and use it to WARN if the incoming level diverges. Opportunistically refactor kvm_mmu_load_pgd() to avoid indentation hell, and rename a 'cr3' param in the load_mmu_pgd prototype that managed to survive the cr3 purge. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-27KVM: nVMX: check for invalid hdr.vmx.flagsPaolo Bonzini
hdr.vmx.flags is meant for future extensions to the ABI, rejecting invalid flags is necessary to avoid broken half-loads of the nVMX state. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-27KVM: nVMX: check for required but missing VMCS12 in KVM_SET_NESTED_STATEPaolo Bonzini
A missing VMCS12 was not causing -EINVAL (it was just read with copy_from_user, so it is not a security issue, but it is still wrong). Test for VMCS12 validity and reject the nested state if a VMCS12 is required but not present. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10KVM: VMX: introduce vmx_need_pf_interceptPaolo Bonzini
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-7-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10KVM: nVMX: fixes for preemption timer migrationPaolo Bonzini
Commit 850448f35aaf ("KVM: nVMX: Fix VMX preemption timer migration", 2020-06-01) accidentally broke nVMX live migration from older version by changing the userspace ABI. Restore it and, while at it, ensure that vmx->nested.has_preemption_timer_deadline is always initialized according to the KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE flag. Cc: Makarand Sonare <makarandsonare@google.com> Fixes: 850448f35aaf ("KVM: nVMX: Fix VMX preemption timer migration") Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-09KVM: x86: Rename cpuid_update() callback to vcpu_after_set_cpuid()Xiaoyao Li
The name of callback cpuid_update() is misleading that it's not about updating CPUID settings of vcpu but updating the configurations of vcpu based on the CPUIDs. So rename it to vcpu_after_set_cpuid(). Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200709043426.92712-5-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08KVM: nVMX: Wrap VM-Fail valid path in generic VM-Fail helperSean Christopherson
Add nested_vmx_fail() to wrap VM-Fail paths that _may_ result in VM-Fail Valid to make it clear at the call sites that the Valid flavor isn't guaranteed. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200609015607.6994-1-sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08KVM: x86/mmu: Make .write_log_dirty a nested operationSean Christopherson
Move .write_log_dirty() into kvm_x86_nested_ops to help differentiate it from the non-nested dirty log hooks. And because it's a nested-only operation. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-03KVM: VMX: Use KVM_POSSIBLE_CR*_GUEST_BITS to initialize guest/host masksSean Christopherson
Use the "common" KVM_POSSIBLE_CR*_GUEST_BITS defines to initialize the CR0/CR4 guest host masks instead of duplicating most of the CR4 mask and open coding the CR0 mask. SVM doesn't utilize the masks, i.e. the masks are effectively VMX specific even if they're not named as such. This avoids duplicate code, better documents the guest owned CR0 bit, and eliminates the need for a build-time assertion to keep VMX and x86 synchronized. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703040422.31536-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-13Merge tag 'x86-entry-2020-06-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry updates from Thomas Gleixner: "The x86 entry, exception and interrupt code rework This all started about 6 month ago with the attempt to move the Posix CPU timer heavy lifting out of the timer interrupt code and just have lockless quick checks in that code path. Trivial 5 patches. This unearthed an inconsistency in the KVM handling of task work and the review requested to move all of this into generic code so other architectures can share. Valid request and solved with another 25 patches but those unearthed inconsistencies vs. RCU and instrumentation. Digging into this made it obvious that there are quite some inconsistencies vs. instrumentation in general. The int3 text poke handling in particular was completely unprotected and with the batched update of trace events even more likely to expose to endless int3 recursion. In parallel the RCU implications of instrumenting fragile entry code came up in several discussions. The conclusion of the x86 maintainer team was to go all the way and make the protection against any form of instrumentation of fragile and dangerous code pathes enforcable and verifiable by tooling. A first batch of preparatory work hit mainline with commit d5f744f9a2ac ("Pull x86 entry code updates from Thomas Gleixner") That (almost) full solution introduced a new code section '.noinstr.text' into which all code which needs to be protected from instrumentation of all sorts goes into. Any call into instrumentable code out of this section has to be annotated. objtool has support to validate this. Kprobes now excludes this section fully which also prevents BPF from fiddling with it and all 'noinstr' annotated functions also keep ftrace off. The section, kprobes and objtool changes are already merged. The major changes coming with this are: - Preparatory cleanups - Annotating of relevant functions to move them into the noinstr.text section or enforcing inlining by marking them __always_inline so the compiler cannot misplace or instrument them. - Splitting and simplifying the idtentry macro maze so that it is now clearly separated into simple exception entries and the more interesting ones which use interrupt stacks and have the paranoid handling vs. CR3 and GS. - Move quite some of the low level ASM functionality into C code: - enter_from and exit to user space handling. The ASM code now calls into C after doing the really necessary ASM handling and the return path goes back out without bells and whistels in ASM. - exception entry/exit got the equivivalent treatment - move all IRQ tracepoints from ASM to C so they can be placed as appropriate which is especially important for the int3 recursion issue. - Consolidate the declaration and definition of entry points between 32 and 64 bit. They share a common header and macros now. - Remove the extra device interrupt entry maze and just use the regular exception entry code. - All ASM entry points except NMI are now generated from the shared header file and the corresponding macros in the 32 and 64 bit entry ASM. - The C code entry points are consolidated as well with the help of DEFINE_IDTENTRY*() macros. This allows to ensure at one central point that all corresponding entry points share the same semantics. The actual function body for most entry points is in an instrumentable and sane state. There are special macros for the more sensitive entry points, e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF. They allow to put the whole entry instrumentation and RCU handling into safe places instead of the previous pray that it is correct approach. - The INT3 text poke handling is now completely isolated and the recursion issue banned. Aside of the entry rework this required other isolation work, e.g. the ability to force inline bsearch. - Prevent #DB on fragile entry code, entry relevant memory and disable it on NMI, #MC entry, which allowed to get rid of the nested #DB IST stack shifting hackery. - A few other cleanups and enhancements which have been made possible through this and already merged changes, e.g. consolidating and further restricting the IDT code so the IDT table becomes RO after init which removes yet another popular attack vector - About 680 lines of ASM maze are gone. There are a few open issues: - An escape out of the noinstr section in the MCE handler which needs some more thought but under the aspect that MCE is a complete trainwreck by design and the propability to survive it is low, this was not high on the priority list. - Paravirtualization When PV is enabled then objtool complains about a bunch of indirect calls out of the noinstr section. There are a few straight forward ways to fix this, but the other issues vs. general correctness were more pressing than parawitz. - KVM KVM is inconsistent as well. Patches have been posted, but they have not yet been commented on or picked up by the KVM folks. - IDLE Pretty much the same problems can be found in the low level idle code especially the parts where RCU stopped watching. This was beyond the scope of the more obvious and exposable problems and is on the todo list. The lesson learned from this brain melting exercise to morph the evolved code base into something which can be validated and understood is that once again the violation of the most important engineering principle "correctness first" has caused quite a few people to spend valuable time on problems which could have been avoided in the first place. The "features first" tinkering mindset really has to stop. With that I want to say thanks to everyone involved in contributing to this effort. Special thanks go to the following people (alphabetical order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra, Vitaly Kuznetsov, and Will Deacon" * tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits) x86/entry: Force rcu_irq_enter() when in idle task x86/entry: Make NMI use IDTENTRY_RAW x86/entry: Treat BUG/WARN as NMI-like entries x86/entry: Unbreak __irqentry_text_start/end magic x86/entry: __always_inline CR2 for noinstr lockdep: __always_inline more for noinstr x86/entry: Re-order #DB handler to avoid *SAN instrumentation x86/entry: __always_inline arch_atomic_* for noinstr x86/entry: __always_inline irqflags for noinstr x86/entry: __always_inline debugreg for noinstr x86/idt: Consolidate idt functionality x86/idt: Cleanup trap_init() x86/idt: Use proper constants for table size x86/idt: Add comments about early #PF handling x86/idt: Mark init only functions __init x86/entry: Rename trace_hardirqs_off_prepare() x86/entry: Clarify irq_{enter,exit}_rcu() x86/entry: Remove DBn stacks x86/entry: Remove debug IDT frobbing x86/entry: Optimize local_db_save() for virt ...
2020-06-11Merge branch 'kvm-basic-exit-reason' into HEADPaolo Bonzini
Using a topic branch so that stable branches can simply cherry-pick the patch. Reviewed-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11KVM: nVMX: Consult only the "basic" exit reason when routing nested exitSean Christopherson
Consult only the basic exit reason, i.e. bits 15:0 of vmcs.EXIT_REASON, when determining whether a nested VM-Exit should be reflected into L1 or handled by KVM in L0. For better or worse, the switch statement in nested_vmx_exit_reflected() currently defaults to "true", i.e. reflects any nested VM-Exit without dedicated logic. Because the case statements only contain the basic exit reason, any VM-Exit with modifier bits set will be reflected to L1, even if KVM intended to handle it in L0. Practically speaking, this only affects EXIT_REASON_MCE_DURING_VMENTRY, i.e. a #MC that occurs on nested VM-Enter would be incorrectly routed to L1, as "failed VM-Entry" is the only modifier that KVM can currently encounter. The SMM modifiers will never be generated as KVM doesn't support/employ a SMI Transfer Monitor. Ditto for "exit from enclave", as KVM doesn't yet support virtualizing SGX, i.e. it's impossible to enter an enclave in a KVM guest (L1 or L2). Fixes: 644d711aa0e1 ("KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit") Cc: Jim Mattson <jmattson@google.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200227174430.26371-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11x86/entry: Optimize local_db_save() for virtPeter Zijlstra
Because DRn access is 'difficult' with virt; but the DR7 read is cheaper than a cacheline miss on native, add a virt specific fast path to local_db_save(), such that when breakpoints are not in use to avoid touching DRn entirely. Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200529213321.187833200@infradead.org
2020-06-08KVM: VMX: Properly handle kvm_read/write_guest_virt*() resultVitaly Kuznetsov
Syzbot reports the following issue: WARNING: CPU: 0 PID: 6819 at arch/x86/kvm/x86.c:618 kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618 ... Call Trace: ... RIP: 0010:kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618 ... nested_vmx_get_vmptr+0x1f9/0x2a0 arch/x86/kvm/vmx/nested.c:4638 handle_vmon arch/x86/kvm/vmx/nested.c:4767 [inline] handle_vmon+0x168/0x3a0 arch/x86/kvm/vmx/nested.c:4728 vmx_handle_exit+0x29c/0x1260 arch/x86/kvm/vmx/vmx.c:6067 'exception' we're trying to inject with kvm_inject_emulated_page_fault() comes from: nested_vmx_get_vmptr() kvm_read_guest_virt() kvm_read_guest_virt_helper() vcpu->arch.walk_mmu->gva_to_gpa() but it is only set when GVA to GPA conversion fails. In case it doesn't but we still fail kvm_vcpu_read_guest_page(), X86EMUL_IO_NEEDED is returned and nested_vmx_get_vmptr() calls kvm_inject_emulated_page_fault() with zeroed 'exception'. This happen when the argument is MMIO. Paolo also noticed that nested_vmx_get_vmptr() is not the only place in KVM code where kvm_read/write_guest_virt*() return result is mishandled. VMX instructions along with INVPCID have the same issue. This was already noticed before, e.g. see commit 541ab2aeb282 ("KVM: x86: work around leak of uninitialized stack contents") but was never fully fixed. KVM could've handled the request correctly by going to userspace and performing I/O but there doesn't seem to be a good need for such requests in the first place. Introduce vmx_handle_memory_failure() as an interim solution. Note, nested_vmx_get_vmptr() now has three possible outcomes: OK, PF, KVM_EXIT_INTERNAL_ERROR and callers need to know if userspace exit is needed (for KVM_EXIT_INTERNAL_ERROR) in case of failure. We don't seem to have a good enum describing this tristate, just add "int *ret" to nested_vmx_get_vmptr() interface to pass the information. Reported-by: syzbot+2a7156e11dc199bdbd8a@syzkaller.appspotmail.com Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200605115906.532682-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: selftests: VMX preemption timer migration testMakarand Sonare
When a nested VM with a VMX-preemption timer is migrated, verify that the nested VM and its parent VM observe the VMX-preemption timer exit close to the original expiration deadline. Signed-off-by: Makarand Sonare <makarandsonare@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200526215107.205814-3-makarandsonare@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nVMX: Fix VMX preemption timer migrationPeter Shier
Add new field to hold preemption timer expiration deadline appended to struct kvm_vmx_nested_state_hdr. This is to prevent the first VM-Enter after migration from incorrectly restarting the timer with the full timer value instead of partially decayed timer value. KVM_SET_NESTED_STATE restarts timer using migrated state regardless of whether L1 sets VM_EXIT_SAVE_VMX_PREEMPTION_TIMER. Fixes: cf8b84f48a593 ("kvm: nVMX: Prepare for checkpointing L2 state") Signed-off-by: Peter Shier <pshier@google.com> Signed-off-by: Makarand Sonare <makarandsonare@google.com> Message-Id: <20200526215107.205814-2-makarandsonare@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86: extend struct kvm_vcpu_pv_apf_data with token infoVitaly Kuznetsov
Currently, APF mechanism relies on the #PF abuse where the token is being passed through CR2. If we switch to using interrupts to deliver page-ready notifications we need a different way to pass the data. Extent the existing 'struct kvm_vcpu_pv_apf_data' with token information for page-ready notifications. While on it, rename 'reason' to 'flags'. This doesn't change the semantics as we only have reasons '1' and '2' and these can be treated as bit flags but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery making 'reason' name misleading. The newly introduced apf_put_user_ready() temporary puts both flags and token information, this will be changed to put token only when we switch to interrupt based notifications. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATEPaolo Bonzini
Similar to VMX, the state that is captured through the currently available IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was running at the moment when the process was interrupted to save its state. In particular, the SVM-specific state for nested virtualization includes the L1 saved state (including the interrupt flag), the cached L2 controls, and the GIF. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Change emulated VMX-preemption timer hrtimer to absoluteJim Mattson
Prepare for migration of this hrtimer, by changing it from relative to absolute. (I couldn't get migration to work with a relative timer.) Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-3-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Really make emulated nested preemption timer pinnedJim Mattson
The PINNED bit is ignored by hrtimer_init. It is only considered when starting the timer. When the hrtimer isn't pinned to the same logical processor as the vCPU thread to be interrupted, the emulated VMX-preemption timer often fails to adhere to the architectural specification. Fixes: f15a75eedc18e ("KVM: nVMX: make emulated nested preemption timer pinned") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Remove unused 'ops' param from nested_vmx_hardware_setup()Sean Christopherson
Remove a 'struct kvm_x86_ops' param that got left behind when the nested ops were moved to their own struct. Fixes: 33b22172452f0 ("KVM: x86: move nested-related kvm_x86_ops to a separate struct") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506204653.14683-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Drop superfluous VMREAD of vmcs02.GUEST_SYSENTER_*Sean Christopherson
Don't propagate GUEST_SYSENTER_* from vmcs02 to vmcs12 on nested VM-Exit as the vmcs12 fields are updated in vmx_set_msr(), and writes to the corresponding MSRs are always intercepted by KVM when running L2. Dropping the propagation was intended to be done in the same commit that added vmcs12 writes in vmx_set_msr()[1], but for reasons unknown was only shuffled around[2][3]. [1] https://patchwork.kernel.org/patch/10933215 [2] https://patchwork.kernel.org/patch/10933215/#22682289 [3] https://lore.kernel.org/patchwork/patch/1088643 Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428231025.12766-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15KVM: nVMX: Tweak handling of failure code for nested VM-Enter failureSean Christopherson
Use an enum for passing around the failure code for a failed VM-Enter that results in VM-Exit to provide a level of indirection from the final resting place of the failure code, vmcs.EXIT_QUALIFICATION. The exit qualification field is an unsigned long, e.g. passing around 'u32 exit_qual' throws up red flags as it suggests KVM may be dropping bits when reporting errors to L1. This is a red herring because the only defined failure codes are 0, 2, 3, and 4, i.e. don't come remotely close to overflowing a u32. Setting vmcs.EXIT_QUALIFICATION on entry failure is further complicated by the MSR load list, which returns the (1-based) entry that failed, and the number of MSRs to load is a 32-bit VMCS field. At first blush, it would appear that overflowing a u32 is possible, but the number of MSRs that can be loaded is hardcapped at 4096 (limited by MSR_IA32_VMX_MISC). In other words, there are two completely disparate types of data that eventually get stuffed into vmcs.EXIT_QUALIFICATION, neither of which is an 'unsigned long' in nature. This was presumably the reasoning for switching to 'u32' when the related code was refactored in commit ca0bde28f2ed6 ("kvm: nVMX: Split VMCS checks from nested_vmx_run()"). Using an enum for the failure code addresses the technically-possible- but-will-never-happen scenario where Intel defines a failure code that doesn't fit in a 32-bit integer. The enum variables and values will either be automatically sized (gcc 5.4 behavior) or be subjected to some combination of truncation. The former case will simply work, while the latter will trigger a compile-time warning unless the compiler is being particularly unhelpful. Separating the failure code from the failed MSR entry allows for disassociating both from vmcs.EXIT_QUALIFICATION, which avoids the conundrum where KVM has to choose between 'u32 exit_qual' and tracking values as 'unsigned long' that have no business being tracked as such. To cement the split, set vmcs12->exit_qualification directly from the entry error code or failed MSR index instead of bouncing through a local variable. Opportunistically rename the variables in load_vmcs12_host_state() and vmx_set_nested_state() to call out that they're ignored, set exit_reason on demand on nested VM-Enter failure, and add a comment in nested_vmx_load_msr() to call out that returning 'i + 1' can't wrap. No functional change intended. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200511220529.11402-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Unconditionally validate CR3 during nested transitionsSean Christopherson
Unconditionally check the validity of the incoming CR3 during nested VM-Enter/VM-Exit to avoid invoking kvm_read_cr3() in the common case where the guest isn't using PAE paging. If vmcs.GUEST_CR3 hasn't yet been cached (common case), kvm_read_cr3() will trigger a VMREAD. The VMREAD (~30 cycles) alone is likely slower than nested_cr3_valid() (~5 cycles if vcpu->arch.maxphyaddr gets a cache hit), and the poor exchange only gets worse when retpolines are enabled as the call to kvm_x86_ops.cache_reg() will incur a retpoline (60+ cycles). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Skip IBPB when temporarily switching between vmcs01 and vmcs02Sean Christopherson
Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS switch when temporarily loading vmcs02 to synchronize it to vmcs12, i.e. give copy_vmcs02_to_vmcs12_rare() the same treatment as vmx_switch_vmcs(). Make vmx_vcpu_load() static now that it's only referenced within vmx.c. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506235850.22600-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02Sean Christopherson
Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS switch when running with spectre_v2_user=on/auto if the switch is between two VMCSes in the same guest, i.e. between vmcs01 and vmcs02. The IBPB is intended to prevent one guest from attacking another, which is unnecessary in the nested case as it's the same guest from KVM's perspective. This all but eliminates the overhead observed for nested VMX transitions when running with CONFIG_RETPOLINE=y and spectre_v2_user=on/auto, which can be significant, e.g. roughly 3x on current systems. Reported-by: Alexander Graf <graf@amazon.com> Cc: KarimAllah Raslan <karahmed@amazon.de> Cc: stable@vger.kernel.org Fixes: 15d45071523d ("KVM/x86: Add IBPB support") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200501163117.4655-1-sean.j.christopherson@intel.com> [Invert direction of bool argument. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Prioritize SMI over nested IRQ/NMISean Christopherson
Check for an unblocked SMI in vmx_check_nested_events() so that pending SMIs are correctly prioritized over IRQs and NMIs when the latter events will trigger VM-Exit. This also fixes an issue where an SMI that was marked pending while processing a nested VM-Enter wouldn't trigger an immediate exit, i.e. would be incorrectly delayed until L2 happened to take a VM-Exit. Fixes: 64d6067057d96 ("KVM: x86: stubs for SMM support") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Preserve IRQ/NMI priority irrespective of exiting behaviorSean Christopherson
Short circuit vmx_check_nested_events() if an unblocked IRQ/NMI is pending and needs to be injected into L2, priority between coincident events is not dependent on exiting behavior. Fixes: b6b8a1451fc4 ("KVM: nVMX: Rework interception of IRQs and NMIs") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Report NMIs as allowed when in L2 and Exit-on-NMI is setSean Christopherson
Report NMIs as allowed when the vCPU is in L2 and L2 is being run with Exit-on-NMI enabled, as NMIs are always unblocked from L1's perspective in this case. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Open a window for pending nested VMX preemption timerSean Christopherson
Add a kvm_x86_ops hook to detect a nested pending "hypervisor timer" and use it to effectively open a window for servicing the expired timer. Like pending SMIs on VMX, opening a window simply means requesting an immediate exit. This fixes a bug where an expired VMX preemption timer (for L2) will be delayed and/or lost if a pending exception is injected into L2. The pending exception is rightly prioritized by vmx_check_nested_events() and injected into L2, with the preemption timer left pending. Because no window opened, L2 is free to run uninterrupted. Fixes: f4124500c2c13 ("KVM: nVMX: Fully emulate preemption timer") Reported-by: Jim Mattson <jmattson@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-3-sean.j.christopherson@intel.com> [Check it in kvm_vcpu_has_events too, to ensure that the preemption timer is serviced promptly even if the vCPU is halted and L1 is not intercepting HLT. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Preserve exception priority irrespective of exiting behaviorSean Christopherson
Short circuit vmx_check_nested_events() if an exception is pending and needs to be injected into L2, priority between coincident events is not dependent on exiting behavior. This fixes a bug where a single-step #DB that is not intercepted by L1 is incorrectly dropped due to servicing a VMX Preemption Timer VM-Exit. Injected exceptions also need to be blocked if nested VM-Enter is pending or an exception was already injected, otherwise injecting the exception could overwrite an existing event injection from L1. Technically, this scenario should be impossible, i.e. KVM shouldn't inject its own exception during nested VM-Enter. This will be addressed in a future patch. Note, event priority between SMI, NMI and INTR is incorrect for L2, e.g. SMI should take priority over VM-Exit on NMI/INTR, and NMI that is injected into L2 should take priority over VM-Exit INTR. This will also be addressed in a future patch. Fixes: b6b8a1451fc4 ("KVM: nVMX: Rework interception of IRQs and NMIs") Reported-by: Jim Mattson <jmattson@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13Merge branch 'kvm-amd-fixes' into HEADPaolo Bonzini
2020-05-04KVM: nVMX: Replace a BUG_ON(1) with BUG() to squash clang warningSean Christopherson
Use BUG() in the impossible-to-hit default case when switching on the scope of INVEPT to squash a warning with clang 11 due to clang treating the BUG_ON() as conditional. >> arch/x86/kvm/vmx/nested.c:5246:3: warning: variable 'roots_to_free' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] BUG_ON(1); Reported-by: kbuild test robot <lkp@intel.com> Fixes: ce8fe7b77bd8 ("KVM: nVMX: Free only the affected contexts when emulating INVEPT") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200504153506.28898-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-24KVM: nVMX: Store vmcs.EXIT_QUALIFICATION as an unsigned long, not u32Sean Christopherson
Use an unsigned long for 'exit_qual' in nested_vmx_reflect_vmexit(), the EXIT_QUALIFICATION field is naturally sized, not a 32-bit field. The bug is most easily observed by doing VMXON (or any VMX instruction) in L2 with a negative displacement, in which case dropping the upper bits on nested VM-Exit results in L1 calculating the wrong virtual address for the memory operand, e.g. "vmxon -0x8(%rbp)" yields: Unhandled cpu exception 14 #PF at ip 0000000000400553 rbp=0000000000537000 cr2=0000000100536ff8 Fixes: fbdd50250396d ("KVM: nVMX: Move VM-Fail check out of nested_vmx_exit_reflected()") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423001127.13490-1-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23KVM: nVMX: Drop a redundant call to vmx_get_intr_info()Sean Christopherson
Drop nested_vmx_l1_wants_exit()'s initialization of intr_info from vmx_get_intr_info() that was inadvertantly introduced along with the caching mechanism. EXIT_REASON_EXCEPTION_NMI, the only consumer of intr_info, populates the variable before using it. Fixes: bb53120d67cd ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200421075328.14458-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23KVM: x86: move nested-related kvm_x86_ops to a separate structPaolo Bonzini
Clean up some of the patching of kvm_x86_ops, by moving kvm_x86_ops related to nested virtualization into a separate struct. As a result, these ops will always be non-NULL on VMX. This is not a problem: * check_nested_events is only called if is_guest_mode(vcpu) returns true * get_nested_state treats VMXOFF state the same as nested being disabled * set_nested_state fails if you attempt to set nested state while nesting is disabled * nested_enable_evmcs could already be called on a CPU without VMX enabled in CPUID. * nested_get_evmcs_version was fixed in the previous patch Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Remove non-functional "support" for CR3 target valuesSean Christopherson
Remove all references to cr3_target_value[0-3] and replace the fields in vmcs12 with "dead_space" to preserve the vmcs12 layout. KVM doesn't support emulating CR3-target values, despite a variety of code that implies otherwise, as KVM unconditionally reports '0' for the number of supported CR3-target values. This technically fixes a bug where KVM would incorrectly allow VMREAD and VMWRITE to nonexistent fields, i.e. cr3_target_value[0-3]. Per Intel's SDM, the number of supported CR3-target values reported in VMX_MISC also enumerates the existence of the associated VMCS fields: If a future implementation supports more than 4 CR3-target values, they will be encoded consecutively following the 4 encodings given here. Alternatively, the "bug" could be fixed by actually advertisting support for 4 CR3-target values, but that'd likely just enable kvm-unit-tests given that no one has complained about lack of support for going on ten years, e.g. KVM, Xen and HyperV don't use CR3-target values. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200416000739.9012-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flagsSean Christopherson
Introduce a new "extended register" type, EXIT_INFO_2 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_INTR_INFO. Drop a comment in vmx_recover_nmi_blocking() that is obsoleted by the generic caching mechanism. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flagsSean Christopherson
Introduce a new "extended register" type, EXIT_INFO_1 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_QUALIFICATION. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Drop manual clearing of segment cache on nested VMCS switchSean Christopherson
Drop the call to vmx_segment_cache_clear() in vmx_switch_vmcs() now that the entire register cache is reset when switching the active VMCS, e.g. vmx_segment_cache_test_set() will reset the segment cache due to VCPU_EXREG_SEGMENTS being unavailable. Move vmx_segment_cache_clear() to vmx.c now that it's no longer invoked by the nested code. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Reset register cache (available and dirty masks) on VMCS switchSean Christopherson
Reset the per-vCPU available and dirty register masks when switching between vmcs01 and vmcs02, as the masks track state relative to the current VMCS. The stale masks don't cause problems in the current code base because the registers are either unconditionally written on nested transitions or, in the case of segment registers, have an additional tracker that is manually reset. Note, by dropping (previously implicitly, now explicitly) the dirty mask when switching the active VMCS, KVM is technically losing writes to the associated fields. But, the only regs that can be dirtied (RIP, RSP and PDPTRs) are unconditionally written on nested transitions, e.g. explicit writeback is a waste of cycles, and a WARN_ON would be rather pointless. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Invoke ept_save_pdptrs() if and only if PAE paging is enabledSean Christopherson
Invoke ept_save_pdptrs() when restoring L1's host state on a "late" VM-Fail if and only if PAE paging is enabled. This saves a CALL in the common case where L1 is a 64-bit host, and avoids incorrectly marking the PDPTRs as dirty. WARN if ept_save_pdptrs() is called with PAE disabled now that the nested usage pre-checks is_pae_paging(). Barring a bug in KVM's MMU, attempting to read the PDPTRs with PAE disabled is now impossible. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Rename exit_reason to vm_exit_reason for nested VM-ExitSean Christopherson
Use "vm_exit_reason" for code related to injecting a nested VM-Exit to VM-Exits to make it clear that nested_vmx_vmexit() expects the full exit eason, not just the basic exit reason. The basic exit reason (bits 15:0 of vmcs.VM_EXIT_REASON) is colloquially referred to as simply "exit reason". Note, other flows, e.g. vmx_handle_exit(), are intentionally left as is. A future patch will convert vmx->exit_reason to a union + bit-field, and the exempted flows will interact with the unionized of "exit_reason". Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Cast exit_reason to u16 to check for nested EXTERNAL_INTERRUPTSean Christopherson
Explicitly check only the basic exit reason when emulating an external interrupt VM-Exit in nested_vmx_vmexit(). Checking the full exit reason doesn't currently cause problems, but only because the only exit reason modifier support by KVM is FAILED_VMENTRY, which is mutually exclusive with EXTERNAL_INTERRUPT. Future modifiers, e.g. ENCLAVE_MODE, will coexist with EXTERNAL_INTERRUPT. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Pull exit_reason from vcpu_vmx in nested_vmx_reflect_vmexit()Sean Christopherson
Grab the exit reason from the vcpu struct in nested_vmx_reflect_vmexit() instead of having the exit reason explicitly passed from the caller. This fixes a discrepancy between VM-Fail and VM-Exit handling, as the VM-Fail case is already handled by checking vcpu_vmx, e.g. the exit reason previously passed on the stack is bogus if vmx->fail is set. Not taking the exit reason on the stack also avoids having to document that nested_vmx_reflect_vmexit() requires the full exit reason, as opposed to just the basic exit reason, which is not at all obvious since the only usages of the full exit reason are for tracing and way down in prepare_vmcs12() where it's propagated to vmcs12. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21KVM: nVMX: Drop a superfluous WARN on reflecting EXTERNAL_INTERRUPTSean Christopherson
Drop the WARN in nested_vmx_reflect_vmexit() that fires if KVM attempts to reflect an external interrupt. The WARN is blatantly impossible to hit now that nested_vmx_l0_wants_exit() is called from nested_vmx_reflect_vmexit() unconditionally returns true for EXTERNAL_INTERRUPT. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>