summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-05-13KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit setSean Christopherson
Re-request KVM_REQ_EVENT if vcpu_enter_guest() bails after processing pending requests and an immediate exit was requested. This fixes a bug where a pending event, e.g. VMX preemption timer, is delayed and/or lost if the exit was deferred due to something other than a higher priority _injected_ event, e.g. due to a pending nested VM-Enter. This bug only affects the !injected case as kvm_x86_ops.cancel_injection() sets KVM_REQ_EVENT to redo the injection, but that's purely serendipitous behavior with respect to the deferred event. Note, emulated preemption timer isn't the only event that can be affected, it simply happens to be the only event where not re-requesting KVM_REQ_EVENT is blatantly visible to the guest. Fixes: f4124500c2c13 ("KVM: nVMX: Fully emulate preemption timer") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Open a window for pending nested VMX preemption timerSean Christopherson
Add a kvm_x86_ops hook to detect a nested pending "hypervisor timer" and use it to effectively open a window for servicing the expired timer. Like pending SMIs on VMX, opening a window simply means requesting an immediate exit. This fixes a bug where an expired VMX preemption timer (for L2) will be delayed and/or lost if a pending exception is injected into L2. The pending exception is rightly prioritized by vmx_check_nested_events() and injected into L2, with the preemption timer left pending. Because no window opened, L2 is free to run uninterrupted. Fixes: f4124500c2c13 ("KVM: nVMX: Fully emulate preemption timer") Reported-by: Jim Mattson <jmattson@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-3-sean.j.christopherson@intel.com> [Check it in kvm_vcpu_has_events too, to ensure that the preemption timer is serviced promptly even if the vCPU is halted and L1 is not intercepting HLT. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: nVMX: Preserve exception priority irrespective of exiting behaviorSean Christopherson
Short circuit vmx_check_nested_events() if an exception is pending and needs to be injected into L2, priority between coincident events is not dependent on exiting behavior. This fixes a bug where a single-step #DB that is not intercepted by L1 is incorrectly dropped due to servicing a VMX Preemption Timer VM-Exit. Injected exceptions also need to be blocked if nested VM-Enter is pending or an exception was already injected, otherwise injecting the exception could overwrite an existing event injection from L1. Technically, this scenario should be impossible, i.e. KVM shouldn't inject its own exception during nested VM-Enter. This will be addressed in a future patch. Note, event priority between SMI, NMI and INTR is incorrect for L2, e.g. SMI should take priority over VM-Exit on NMI/INTR, and NMI that is injected into L2 should take priority over VM-Exit INTR. This will also be addressed in a future patch. Fixes: b6b8a1451fc4 ("KVM: nVMX: Rework interception of IRQs and NMIs") Reported-by: Jim Mattson <jmattson@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: SVM: Implement check_nested_events for NMICathy Avery
Migrate nested guest NMI intercept processing to new check_nested_events. Signed-off-by: Cathy Avery <cavery@redhat.com> Message-Id: <20200414201107.22952-2-cavery@redhat.com> [Reorder clauses as NMIs have higher priority than IRQs; inject immediate vmexit as is now done for IRQ vmexits. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: SVM: immediately inject INTR vmexitPaolo Bonzini
We can immediately leave SVM guest mode in svm_check_nested_events now that we have the nested_run_pending mechanism. This makes things easier because we can run the rest of inject_pending_event with GIF=0, and KVM will naturally end up requesting the next interrupt window. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: SVM: leave halted state on vmexitPaolo Bonzini
Similar to VMX, we need to leave the halted state when performing a vmexit. Failure to do so will cause a hang after vmexit. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13KVM: SVM: introduce nested_run_pendingPaolo Bonzini
We want to inject vmexits immediately from svm_check_nested_events, so that the interrupt/NMI window requests happen in inject_pending_event right after it returns. This however has the same issue as in vmx_check_nested_events, so introduce a nested_run_pending flag with the exact same purpose of delaying vmexit injection after the vmentry. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13Merge branch 'kvm-amd-fixes' into HEADPaolo Bonzini
2020-05-13KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.cBabu Moger
Though rdpkru and wrpkru are contingent upon CR4.PKE, the PKRU resource isn't. It can be read with XSAVE and written with XRSTOR. So, if we don't set the guest PKRU value here(kvm_load_guest_xsave_state), the guest can read the host value. In case of kvm_load_host_xsave_state, guest with CR4.PKE clear could potentially use XRSTOR to change the host PKRU value. While at it, move pkru state save/restore to common code and the host_pkru field to kvm_vcpu_arch. This will let SVM support protection keys. Cc: stable@vger.kernel.org Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Message-Id: <158932794619.44260.14508381096663848853.stgit@naples-babu.amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: SVM: Disable AVIC before setting V_IRQSuravee Suthikulpanit
The commit 64b5bd270426 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1") introduced a WARN_ON, which checks if AVIC is enabled when trying to set V_IRQ in the VMCB for enabling irq window. The following warning is triggered because the requesting vcpu (to deactivate AVIC) does not get to process APICv update request for itself until the next #vmexit. WARNING: CPU: 0 PID: 118232 at arch/x86/kvm/svm/svm.c:1372 enable_irq_window+0x6a/0xa0 [kvm_amd] RIP: 0010:enable_irq_window+0x6a/0xa0 [kvm_amd] Call Trace: kvm_arch_vcpu_ioctl_run+0x6e3/0x1b50 [kvm] ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm] ? _copy_to_user+0x26/0x30 ? kvm_vm_ioctl+0xb3e/0xd90 [kvm] ? set_next_entity+0x78/0xc0 kvm_vcpu_ioctl+0x236/0x610 [kvm] ksys_ioctl+0x8a/0xc0 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x58/0x210 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes by sending APICV update request to all other vcpus, and immediately update APIC for itself. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lkml.org/lkml/2020/5/2/167 Fixes: 64b5bd270426 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1") Message-Id: <1588818939-54264-1-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: Introduce kvm_make_all_cpus_request_except()Suravee Suthikulpanit
This allows making request to all other vcpus except the one specified in the parameter. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <1588771076-73790-2-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: VMX: pass correct DR6 for GD userspace exitPaolo Bonzini
When KVM_EXIT_DEBUG is raised for the disabled-breakpoints case (DR7.GD), DR6 was incorrectly copied from the value in the VM. Instead, DR6.BD should be set in order to catch this case. On AMD this does not need any special code because the processor triggers a #DB exception that is intercepted. However, the testcase would fail without the previous patch because both DR6.BS and DR6.BD would be set. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6Paolo Bonzini
There are two issues with KVM_EXIT_DEBUG on AMD, whose root cause is the different handling of DR6 on intercepted #DB exceptions on Intel and AMD. On Intel, #DB exceptions transmit the DR6 value via the exit qualification field of the VMCS, and the exit qualification only contains the description of the precise event that caused a vmexit. On AMD, instead the DR6 field of the VMCB is filled in as if the #DB exception was to be injected into the guest. This has two effects when guest debugging is in use: * the guest DR6 is clobbered * the kvm_run->debug.arch.dr6 field can accumulate more debug events, rather than just the last one that happened (the testcase in the next patch covers this issue). This patch fixes both issues by emulating, so to speak, the Intel behavior on AMD processors. The important observation is that (after the previous patches) the VMCB value of DR6 is only ever observable from the guest is KVM_DEBUGREG_WONT_EXIT is set. Therefore we can actually set vmcb->save.dr6 to any value we want as long as KVM_DEBUGREG_WONT_EXIT is clear, which it will be if guest debugging is enabled. Therefore it is possible to enter the guest with an all-zero DR6, reconstruct the #DB payload from the DR6 we get at exit time, and let kvm_deliver_exception_payload move the newly set bits into vcpu->arch.dr6. Some extra bits may be included in the payload if KVM_DEBUGREG_WONT_EXIT is set, but this is harmless. This may not be the most optimized way to deal with this, but it is simple and, being confined within SVM code, it gets rid of the set_dr6 callback and kvm_update_dr6. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-08KVM: SVM: keep DR6 synchronized with vcpu->arch.dr6Paolo Bonzini
kvm_x86_ops.set_dr6 is only ever called with vcpu->arch.dr6 as the second argument. Ensure that the VMCB value is synchronized to vcpu->arch.dr6 on #DB (both "normal" and nested) and nested vmentry, so that the current value of DR6 is always available in vcpu->arch.dr6. The get_dr6 callback can just access vcpu->arch.dr6 and becomes redundant. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07KVM: nSVM: trap #DB and #BP to userspace if guest debugging is onPaolo Bonzini
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07KVM: selftests: Add KVM_SET_GUEST_DEBUG testPeter Xu
Covers fundamental tests for KVM_SET_GUEST_DEBUG. It is very close to the debug test in kvm-unit-test, but doing it from outside the guest. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200505205000.188252-4-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07KVM: X86: Fix single-step with KVM_SET_GUEST_DEBUGPeter Xu
When single-step triggered with KVM_SET_GUEST_DEBUG, we should fill in the pc value with current linear RIP rather than the cached singlestep address. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200505205000.188252-3-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07KVM: X86: Set RTM for DB_VECTOR too for KVM_EXIT_DEBUGPeter Xu
RTM should always been set even with KVM_EXIT_DEBUG on #DB. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200505205000.188252-2-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07KVM: x86: fix DR6 delivery for various cases of #DB injectionPaolo Bonzini
Go through kvm_queue_exception_p so that the payload is correctly delivered through the exit qualification, and add a kvm_update_dr6 call to kvm_deliver_exception_payload that is needed on AMD. Reported-by: Peter Xu <peterx@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-07KVM: X86: Declare KVM_CAP_SET_GUEST_DEBUG properlyPeter Xu
KVM_CAP_SET_GUEST_DEBUG should be supported for x86 however it's not declared as supported. My wild guess is that userspaces like QEMU are using "#ifdef KVM_CAP_SET_GUEST_DEBUG" to check for the capability instead, but that could be wrong because the compilation host may not be the runtime host. The userspace might still want to keep the old "#ifdef" though to not break the guest debug on old kernels. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200505154750.126300-1-peterx@redhat.com> [Do the same for PPC and s390. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-06KVM: selftests: Fix build for evmcs.hPeter Xu
I got this error when building kvm selftests: /usr/bin/ld: /home/xz/git/linux/tools/testing/selftests/kvm/libkvm.a(vmx.o):/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:222: multiple definition of `current_evmcs'; /tmp/cco1G48P.o:/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:222: first defined here /usr/bin/ld: /home/xz/git/linux/tools/testing/selftests/kvm/libkvm.a(vmx.o):/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:223: multiple definition of `current_vp_assist'; /tmp/cco1G48P.o:/home/xz/git/linux/tools/testing/selftests/kvm/include/evmcs.h:223: first defined here I think it's because evmcs.h is included both in a test file and a lib file so the structs have multiple declarations when linking. After all it's not a good habit to declare structs in the header files. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200504220607.99627-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-06kvm: x86: Use KVM CPU capabilities to determine CR4 reserved bitsPaolo Bonzini
Using CPUID data can be useful for the processor compatibility check, but that's it. Using it to compute guest-reserved bits can have both false positives (such as LA57 and UMIP which we are already handling) and false negatives: in particular, with this patch we don't allow anymore a KVM guest to set CR4.PKE when CR4.PKE is clear on the host. Fixes: b9dd21e104bc ("KVM: x86: simplify handling of PKRU") Reported-by: Jim Mattson <jmattson@google.com> Tested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-06KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB pathSean Christopherson
Clear CF and ZF in the VM-Exit path after doing __FILL_RETURN_BUFFER so that KVM doesn't interpret clobbered RFLAGS as a VM-Fail. Filling the RSB has always clobbered RFLAGS, its current incarnation just happens clear CF and ZF in the processs. Relying on the macro to clear CF and ZF is extremely fragile, e.g. commit 089dd8e53126e ("x86/speculation: Change FILL_RETURN_BUFFER to work with objtool") tweaks the loop such that the ZF flag is always set. Reported-by: Qian Cai <cai@lca.pw> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: stable@vger.kernel.org Fixes: f2fde6a5bcfcf ("KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506035355.2242-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-06docs/virt/kvm: Document configuring and running nested guestsKashyap Chamarthy
This is a rewrite of this[1] Wiki page with further enhancements. The doc also includes a section on debugging problems in nested environments, among other improvements. [1] https://www.linux-kvm.org/page/Nested_Guests Signed-off-by: Kashyap Chamarthy <kchamart@redhat.com> Message-Id: <20200505112839.30534-1-kchamart@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-04kvm: ioapic: Restrict lazy EOI update to edge-triggered interruptsPaolo Bonzini
Commit f458d039db7e ("kvm: ioapic: Lazy update IOAPIC EOI") introduces the following infinite loop: BUG: stack guard page was hit at 000000008f595917 \ (stack is 00000000bdefe5a4..00000000ae2b06f5) kernel stack overflow (double-fault): 0000 [#1] SMP NOPTI RIP: 0010:kvm_set_irq+0x51/0x160 [kvm] Call Trace: irqfd_resampler_ack+0x32/0x90 [kvm] kvm_notify_acked_irq+0x62/0xd0 [kvm] kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm] ioapic_set_irq+0x20e/0x240 [kvm] kvm_ioapic_set_irq+0x5c/0x80 [kvm] kvm_set_irq+0xbb/0x160 [kvm] ? kvm_hv_set_sint+0x20/0x20 [kvm] irqfd_resampler_ack+0x32/0x90 [kvm] kvm_notify_acked_irq+0x62/0xd0 [kvm] kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm] ioapic_set_irq+0x20e/0x240 [kvm] kvm_ioapic_set_irq+0x5c/0x80 [kvm] kvm_set_irq+0xbb/0x160 [kvm] ? kvm_hv_set_sint+0x20/0x20 [kvm] .... The re-entrancy happens because the irq state is the OR of the interrupt state and the resamplefd state. That is, we don't want to show the state as 0 until we've had a chance to set the resamplefd. But if the interrupt has _not_ gone low then ioapic_set_irq is invoked again, causing an infinite loop. This can only happen for a level-triggered interrupt, otherwise irqfd_inject would immediately set the KVM_USERSPACE_IRQ_SOURCE_ID high and then low. Fortunately, in the case of level-triggered interrupts the VMEXIT already happens because TMR is set. Thus, fix the bug by restricting the lazy invocation of the ack notifier to edge-triggered interrupts, the only ones that need it. Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reported-by: borisvk@bstnet.org Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://www.spinics.net/lists/kvm/msg213512.html Fixes: f458d039db7e ("kvm: ioapic: Lazy update IOAPIC EOI") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207489 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-04KVM: x86: Fixes posted interrupt check for IRQs delivery modesSuravee Suthikulpanit
Current logic incorrectly uses the enum ioapic_irq_destination_types to check the posted interrupt destination types. However, the value was set using APIC_DM_XXX macros, which are left-shifted by 8 bits. Fixes by using the APIC_DM_FIXED and APIC_DM_LOWEST instead. Fixes: (fdcf75621375 'KVM: x86: Disable posted interrupts for non-standard IRQs delivery modes') Cc: Alexander Graf <graf@amazon.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <1586239989-58305-1-git-send-email-suravee.suthikulpanit@amd.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-04Merge tag 'kvmarm-fixes-5.7-2' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/arm fixes for Linux 5.7, take #2 - Fix compilation with Clang - Correctly initialize GICv4.1 in the absence of a virtual ITS - Move SP_EL0 save/restore to the guest entry/exit code - Handle PC wrap around on 32bit guests, and narrow all 32bit registers on userspace access
2020-05-04Merge tag 'kvmarm-fixes-5.7-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/arm fixes for Linux 5.7, take #1 - Prevent the userspace API from interacting directly with the HW stage of the virtual GIC - Fix a couple of vGIC memory leaks - Tighten the rules around the use of the 32bit PSCI functions for 64bit guest, as well as the opposite situation (matches the specification)
2020-05-04KVM: SVM: fill in kvm_run->debug.arch.dr[67]Paolo Bonzini
The corresponding code was added for VMX in commit 42dbaa5a057 ("KVM: x86: Virtualize debug registers, 2008-12-15) but never for AMD. Fix this. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-04KVM: nVMX: Replace a BUG_ON(1) with BUG() to squash clang warningSean Christopherson
Use BUG() in the impossible-to-hit default case when switching on the scope of INVEPT to squash a warning with clang 11 due to clang treating the BUG_ON() as conditional. >> arch/x86/kvm/vmx/nested.c:5246:3: warning: variable 'roots_to_free' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] BUG_ON(1); Reported-by: kbuild test robot <lkp@intel.com> Fixes: ce8fe7b77bd8 ("KVM: nVMX: Free only the affected contexts when emulating INVEPT") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200504153506.28898-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-01KVM: arm64: Fix 32bit PC wrap-aroundMarc Zyngier
In the unlikely event that a 32bit vcpu traps into the hypervisor on an instruction that is located right at the end of the 32bit range, the emulation of that instruction is going to increment PC past the 32bit range. This isn't great, as userspace can then observe this value and get a bit confused. Conversly, userspace can do things like (in the context of a 64bit guest that is capable of 32bit EL0) setting PSTATE to AArch64-EL0, set PC to a 64bit value, change PSTATE to AArch32-USR, and observe that PC hasn't been truncated. More confusion. Fix both by: - truncating PC increments for 32bit guests - sanitizing all 32bit regs every time a core reg is changed by userspace, and that PSTATE indicates a 32bit mode. Cc: stable@vger.kernel.org Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-04-30KVM: arm64: vgic-v4: Initialize GICv4.1 even in the absence of a virtual ITSMarc Zyngier
KVM now expects to be able to use HW-accelerated delivery of vSGIs as soon as the guest has enabled thm. Unfortunately, we only initialize the GICv4 context if we have a virtual ITS exposed to the guest. Fix it by always initializing the GICv4.1 context if it is available on the host. Fixes: 2291ff2f2a56 ("KVM: arm64: GICv4.1: Plumb SGI implementation selection in the distributor") Reviewed-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-04-30KVM: arm64: Save/restore sp_el0 as part of __guest_enterMarc Zyngier
We currently save/restore sp_el0 in C code. This is a bit unsafe, as a lot of the C code expects 'current' to be accessible from there (and the opportunity to run kernel code in HYP is specially great with VHE). Instead, let's move the save/restore of sp_el0 to the assembly code (in __guest_enter), making sure that sp_el0 is correct very early on when we exit the guest, and is preserved as long as possible to its host value when we enter the guest. Reviewed-by: Andrew Jones <drjones@redhat.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-04-30KVM: arm64: Delete duplicated label in invalid_vectorFangrui Song
SYM_CODE_START defines \label , so it is redundant to define \label again. A redefinition at the same place is accepted by GNU as (https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=159fbb6088f17a341bcaaac960623cab881b4981) but rejected by the clang integrated assembler. Fixes: 617a2f392c92 ("arm64: kvm: Annotate assembly using modern annoations") Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://github.com/ClangBuiltLinux/linux/issues/988 Link: https://lore.kernel.org/r/20200413231016.250737-1-maskray@google.com
2020-04-24KVM: SVM: do not allow VMRUN inside SMMPaolo Bonzini
VMRUN is not supported inside the SMM handler and the behavior is undefined. Just raise a #UD. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-24kvm: add capability for halt pollingDavid Matlack
KVM_CAP_HALT_POLL is a per-VM capability that lets userspace control the halt-polling time, allowing halt-polling to be tuned or disabled on particular VMs. With dynamic halt-polling, a VM's VCPUs can poll from anywhere from [0, halt_poll_ns] on each halt. KVM_CAP_HALT_POLL sets the upper limit on the poll time. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Jon Cargille <jcargill@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200417221446.108733-1-jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-24KVM: nVMX: Store vmcs.EXIT_QUALIFICATION as an unsigned long, not u32Sean Christopherson
Use an unsigned long for 'exit_qual' in nested_vmx_reflect_vmexit(), the EXIT_QUALIFICATION field is naturally sized, not a 32-bit field. The bug is most easily observed by doing VMXON (or any VMX instruction) in L2 with a negative displacement, in which case dropping the upper bits on nested VM-Exit results in L1 calculating the wrong virtual address for the memory operand, e.g. "vmxon -0x8(%rbp)" yields: Unhandled cpu exception 14 #PF at ip 0000000000400553 rbp=0000000000537000 cr2=0000000100536ff8 Fixes: fbdd50250396d ("KVM: nVMX: Move VM-Fail check out of nested_vmx_exit_reflected()") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423001127.13490-1-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23KVM: nVMX: Drop a redundant call to vmx_get_intr_info()Sean Christopherson
Drop nested_vmx_l1_wants_exit()'s initialization of intr_info from vmx_get_intr_info() that was inadvertantly introduced along with the caching mechanism. EXIT_REASON_EXCEPTION_NMI, the only consumer of intr_info, populates the variable before using it. Fixes: bb53120d67cd ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200421075328.14458-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23Merge branch 'kvm-arm64/vgic-fixes-5.7' into kvmarm-master/masterMarc Zyngier
2020-04-23Merge branch 'kvm-arm64/psci-fixes-5.7' into kvmarm-master/masterMarc Zyngier
2020-04-23KVM: arm64: vgic-its: Fix memory leak on the error path of vgic_add_lpi()Zenghui Yu
If we're going to fail out the vgic_add_lpi(), let's make sure the allocated vgic_irq memory is also freed. Though it seems that both cases are unlikely to fail. Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200414030349.625-3-yuzenghui@huawei.com
2020-04-23KVM: arm64: vgic-v3: Retire all pending LPIs on vcpu destroyZenghui Yu
It's likely that the vcpu fails to handle all virtual interrupts if userspace decides to destroy it, leaving the pending ones stay in the ap_list. If the un-handled one is a LPI, its vgic_irq structure will be eventually leaked because of an extra refcount increment in vgic_queue_irq_unlock(). This was detected by kmemleak on almost every guest destroy, the backtrace is as follows: unreferenced object 0xffff80725aed5500 (size 128): comm "CPU 5/KVM", pid 40711, jiffies 4298024754 (age 166366.512s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 08 01 a9 73 6d 80 ff ff ...........sm... c8 61 ee a9 00 20 ff ff 28 1e 55 81 6c 80 ff ff .a... ..(.U.l... backtrace: [<000000004bcaa122>] kmem_cache_alloc_trace+0x2dc/0x418 [<0000000069c7dabb>] vgic_add_lpi+0x88/0x418 [<00000000bfefd5c5>] vgic_its_cmd_handle_mapi+0x4dc/0x588 [<00000000cf993975>] vgic_its_process_commands.part.5+0x484/0x1198 [<000000004bd3f8e3>] vgic_its_process_commands+0x50/0x80 [<00000000b9a65b2b>] vgic_mmio_write_its_cwriter+0xac/0x108 [<0000000009641ebb>] dispatch_mmio_write+0xd0/0x188 [<000000008f79d288>] __kvm_io_bus_write+0x134/0x240 [<00000000882f39ac>] kvm_io_bus_write+0xe0/0x150 [<0000000078197602>] io_mem_abort+0x484/0x7b8 [<0000000060954e3c>] kvm_handle_guest_abort+0x4cc/0xa58 [<00000000e0d0cd65>] handle_exit+0x24c/0x770 [<00000000b44a7fad>] kvm_arch_vcpu_ioctl_run+0x460/0x1988 [<0000000025fb897c>] kvm_vcpu_ioctl+0x4f8/0xee0 [<000000003271e317>] do_vfs_ioctl+0x160/0xcd8 [<00000000e7f39607>] ksys_ioctl+0x98/0xd8 Fix it by retiring all pending LPIs in the ap_list on the destroy path. p.s. I can also reproduce it on a normal guest shutdown. It is because userspace still send LPIs to vcpu (through KVM_SIGNAL_MSI ioctl) while the guest is being shutdown and unable to handle it. A little strange though and haven't dig further... Reviewed-by: James Morse <james.morse@arm.com> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> [maz: moved the distributor deallocation down to avoid an UAF splat] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200414030349.625-2-yuzenghui@huawei.com
2020-04-23KVM: arm: vgic-v2: Only use the virtual state when userspace accesses ↵Marc Zyngier
pending bits There is no point in accessing the HW when writing to any of the ISPENDR/ICPENDR registers from userspace, as only the guest should be allowed to change the HW state. Introduce new userspace-specific accessors that deal solely with the virtual state. Note that the API differs from that of GICv3, where userspace exclusively uses ISPENDR to set the state. Too bad we can't reuse it. Fixes: 82e40f558de56 ("KVM: arm/arm64: vgic-v2: Handle SGI bits in GICD_I{S,C}PENDR0 as WI") Reviewed-by: James Morse <james.morse@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-04-23KVM: x86: move nested-related kvm_x86_ops to a separate structPaolo Bonzini
Clean up some of the patching of kvm_x86_ops, by moving kvm_x86_ops related to nested virtualization into a separate struct. As a result, these ops will always be non-NULL on VMX. This is not a problem: * check_nested_events is only called if is_guest_mode(vcpu) returns true * get_nested_state treats VMXOFF state the same as nested being disabled * set_nested_state fails if you attempt to set nested state while nesting is disabled * nested_enable_evmcs could already be called on a CPU without VMX enabled in CPUID. * nested_get_evmcs_version was fixed in the previous patch Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23KVM: eVMCS: check if nesting is enabledPaolo Bonzini
In the next patch nested_get_evmcs_version will be always set in kvm_x86_ops for VMX, even if nesting is disabled. Therefore, check whether VMX (aka nesting) is available in the function, the caller will not do the check anymore. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23KVM: x86: check_nested_events is never NULLPaolo Bonzini
Both Intel and AMD now implement it, so there is no need to check if the callback is implemented. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-22KVM: arm: vgic: Only use the virtual state when userspace accesses enable bitsMarc Zyngier
There is no point in accessing the HW when writing to any of the ISENABLER/ICENABLER registers from userspace, as only the guest should be allowed to change the HW state. Introduce new userspace-specific accessors that deal solely with the virtual state. Reported-by: James Morse <james.morse@arm.com> Tested-by: James Morse <james.morse@arm.com> Reviewed-by: James Morse <james.morse@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-04-22KVM: arm: vgic: Synchronize the whole guest on GIC{D,R}_I{S,C}ACTIVER readMarc Zyngier
When a guest tries to read the active state of its interrupts, we currently just return whatever state we have in memory. This means that if such an interrupt lives in a List Register on another CPU, we fail to obsertve the latest active state for this interrupt. In order to remedy this, stop all the other vcpus so that they exit and we can observe the most recent value for the state. This is similar to what we are doing for the write side of the same registers, and results in new MMIO handlers for userspace (which do not need to stop the guest, as it is supposed to be stopped already). Reported-by: Julien Grall <julien@xen.org> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2020-04-21Merge tag 'kvm-ppc-fixes-5.7-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-master PPC KVM fix for 5.7 - Fix a regression introduced in the last merge window, which results in guests in HPT mode dying randomly.
2020-04-21Merge tag 'kvm-s390-master-5.7-2' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master KVM: s390: Fix for 5.7 and maintainer update - Silence false positive lockdep warning - add Claudio as reviewer