summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2017-08-25Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/x86: Export some PMU attributes in caps/ directoryAndi Kleen
It can be difficult to figure out for user programs what features the x86 CPU PMU driver actually supports. Currently it requires grepping in dmesg, but dmesg is not always available. This adds a caps directory to /sys/bus/event_source/devices/cpu/, similar to the caps already used on intel_pt, which can be used to discover the available capabilities cleanly. Three capabilities are defined: - pmu_name: Underlying CPU name known to the driver - max_precise: Max precise level supported - branches: Known depth of LBR. Example: % grep . /sys/bus/event_source/devices/cpu/caps/* /sys/bus/event_source/devices/cpu/caps/branches:32 /sys/bus/event_source/devices/cpu/caps/max_precise:3 /sys/bus/event_source/devices/cpu/caps/pmu_name:skylake Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170822185201.9261-3-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/x86: Only show format attributes when supportedAndi Kleen
Only show the Intel format attributes in sysfs when the feature is actually supported with the current model numbers. This allows programs to probe what format attributes are available, and give a sensible error message to users if they are not. This handles near all cases for intel attributes since Nehalem, except the (obscure) case when the model number if known, but PEBS is disabled in PERF_CAPABILITIES. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170822185201.9261-2-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/x86: Fix data source decoding for SkylakeAndi Kleen
Skylake changed the encoding of the PEBS data source field. Some combinations are not available anymore, but some new cases e.g. for L4 cache hit are added. Fix up the conversion table for Skylake, similar as had been done for Nehalem. On Skylake server the encoding for L4 actually means persistent memory. Handle this case too. To properly describe it in the abstracted perf format I had to add some new fields. Since a hit can have only one level add a new field that is an enumeration, not a bit field to describe the level. It can describe any level. Some numbers are also used to describe PMEM and LFB. Also add a new generic remote flag that can be combined with the generic level to signify a remote cache. And there is an extension field for the snoop indication to handle the Forward state. I didn't add a generic flag for hops because it's not needed for Skylake. I changed the existing encodings for older CPUs to also fill in the new level and remote fields. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: http://lkml.kernel.org/r/20170816222156.19953-3-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25perf/x86: Move Nehalem PEBS code to flagAndi Kleen
Minor cleanup: use an explicit x86_pmu flag to handle the missing Lock / TLB information on Nehalem, instead of always checking the model number for each PEBS sample. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: http://lkml.kernel.org/r/20170816222156.19953-2-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25x86/mm: Fix use-after-free of ldt_structEric Biggers
The following commit: 39a0526fb3f7 ("x86/mm: Factor out LDT init from context init") renamed init_new_context() to init_new_context_ldt() and added a new init_new_context() which calls init_new_context_ldt(). However, the error code of init_new_context_ldt() was ignored. Consequently, if a memory allocation in alloc_ldt_struct() failed during a fork(), the ->context.ldt of the new task remained the same as that of the old task (due to the memcpy() in dup_mm()). ldt_struct's are not intended to be shared, so a use-after-free occurred after one task exited. Fix the bug by making init_new_context() pass through the error code of init_new_context_ldt(). This bug was found by syzkaller, which encountered the following splat: BUG: KASAN: use-after-free in free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116 Read of size 4 at addr ffff88006d2cb7c8 by task kworker/u9:0/3710 CPU: 1 PID: 3710 Comm: kworker/u9:0 Not tainted 4.13.0-rc4-next-20170811 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 print_address_description+0x73/0x250 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x24e/0x340 mm/kasan/report.c:409 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429 free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116 free_ldt_struct arch/x86/kernel/ldt.c:173 [inline] destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171 destroy_context arch/x86/include/asm/mmu_context.h:157 [inline] __mmdrop+0xe9/0x530 kernel/fork.c:889 mmdrop include/linux/sched/mm.h:42 [inline] exec_mmap fs/exec.c:1061 [inline] flush_old_exec+0x173c/0x1ff0 fs/exec.c:1291 load_elf_binary+0x81f/0x4ba0 fs/binfmt_elf.c:855 search_binary_handler+0x142/0x6b0 fs/exec.c:1652 exec_binprm fs/exec.c:1694 [inline] do_execveat_common.isra.33+0x1746/0x22e0 fs/exec.c:1816 do_execve+0x31/0x40 fs/exec.c:1860 call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431 Allocated by task 3700: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551 kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3627 kmalloc include/linux/slab.h:493 [inline] alloc_ldt_struct+0x52/0x140 arch/x86/kernel/ldt.c:67 write_ldt+0x7b7/0xab0 arch/x86/kernel/ldt.c:277 sys_modify_ldt+0x1ef/0x240 arch/x86/kernel/ldt.c:307 entry_SYSCALL_64_fastpath+0x1f/0xbe Freed by task 3700: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524 __cache_free mm/slab.c:3503 [inline] kfree+0xca/0x250 mm/slab.c:3820 free_ldt_struct.part.2+0xdd/0x150 arch/x86/kernel/ldt.c:121 free_ldt_struct arch/x86/kernel/ldt.c:173 [inline] destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171 destroy_context arch/x86/include/asm/mmu_context.h:157 [inline] __mmdrop+0xe9/0x530 kernel/fork.c:889 mmdrop include/linux/sched/mm.h:42 [inline] __mmput kernel/fork.c:916 [inline] mmput+0x541/0x6e0 kernel/fork.c:927 copy_process.part.36+0x22e1/0x4af0 kernel/fork.c:1931 copy_process kernel/fork.c:1546 [inline] _do_fork+0x1ef/0xfb0 kernel/fork.c:2025 SYSC_clone kernel/fork.c:2135 [inline] SyS_clone+0x37/0x50 kernel/fork.c:2129 do_syscall_64+0x26c/0x8c0 arch/x86/entry/common.c:287 return_from_SYSCALL_64+0x0/0x7a Here is a C reproducer: #include <asm/ldt.h> #include <pthread.h> #include <signal.h> #include <stdlib.h> #include <sys/syscall.h> #include <sys/wait.h> #include <unistd.h> static void *fork_thread(void *_arg) { fork(); } int main(void) { struct user_desc desc = { .entry_number = 8191 }; syscall(__NR_modify_ldt, 1, &desc, sizeof(desc)); for (;;) { if (fork() == 0) { pthread_t t; srand(getpid()); pthread_create(&t, NULL, fork_thread, NULL); usleep(rand() % 10000); syscall(__NR_exit_group, 0); } wait(NULL); } } Note: the reproducer takes advantage of the fact that alloc_ldt_struct() may use vmalloc() to allocate a large ->entries array, and after commit: 5d17a73a2ebe ("vmalloc: back off when the current task is killed") it is possible for userspace to fail a task's vmalloc() by sending a fatal signal, e.g. via exit_group(). It would be more difficult to reproduce this bug on kernels without that commit. This bug only affected kernels with CONFIG_MODIFY_LDT_SYSCALL=y. Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> [v4.6+] Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Fixes: 39a0526fb3f7 ("x86/mm: Factor out LDT init from context init") Link: http://lkml.kernel.org/r/20170824175029.76040-1-ebiggers3@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.statePaolo Bonzini
The host pkru is restored right after vcpu exit (commit 1be0e61), so KVM_GET_XSAVE will return the host PKRU value instead. Fix this by using the guest PKRU explicitly in fill_xsave and load_xsave. This part is based on a patch by Junkang Fu. The host PKRU data may also not match the value in vcpu->arch.guest_fpu.state, because it could have been changed by userspace since the last time it was saved, so skip loading it in kvm_load_guest_fpu. Reported-by: Junkang Fu <junkang.fjk@alibaba-inc.com> Cc: Yang Zhang <zy107165@alibaba-inc.com> Fixes: 1be0e61c1f255faaeab04a390e00c8b9b9042870 Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-25KVM: x86: simplify handling of PKRUPaolo Bonzini
Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a simple comparison against the host value of the register. The write of PKRU in addition can be skipped if the guest has not enabled the feature. Once we do this, we need not test OSPKE in the host anymore, because guest_CR4.PKE=1 implies host_CR4.PKE=1. The static PKU test is kept to elide the code on older CPUs. Suggested-by: Yang Zhang <zy107165@alibaba-inc.com> Fixes: 1be0e61c1f255faaeab04a390e00c8b9b9042870 Cc: stable@vger.kernel.org Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-25KVM: x86: block guest protection keys unless the host has them enabledPaolo Bonzini
If the host has protection keys disabled, we cannot read and write the guest PKRU---RDPKRU and WRPKRU fail with #GP(0) if CR4.PKE=0. Block the PKU cpuid bit in that case. This ensures that guest_CR4.PKE=1 implies host_CR4.PKE=1. Fixes: 1be0e61c1f255faaeab04a390e00c8b9b9042870 Cc: stable@vger.kernel.org Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-25Merge branch 'x86/asm' into x86/apicThomas Gleixner
Pick up dependent changes to avoid merge conflicts
2017-08-24um: Fix check for _xstate for older hostsFlorian Fainelli
Commit 0a987645672e ("um: Allow building and running on older hosts") attempted to check for PTRACE_{GET,SET}REGSET under the premise that these ptrace(2) parameters were directly linked with the presence of the _xstate structure. After Richard's commit 61e8d462457f ("um: Correctly check for PTRACE_GETRESET/SETREGSET") which properly included linux/ptrace.h instead of asm/ptrace.h, we could get into the original build failure that I reported: arch/x86/um/user-offsets.c: In function 'foo': arch/x86/um/user-offsets.c:54: error: invalid application of 'sizeof' to incomplete type 'struct _xstate' On this particular host, we do have PTRACE_GETREGSET and PTRACE_SETREGSET defined in linux/ptrace.h, but not the structure _xstate that should be pulled from the following include chain: signal.h -> bits/sigcontext.h. Correctly fix this by checking for FP_XSTATE_MAGIC1 which is the correct way to see if struct _xstate is available or not on the host. Fixes: 61e8d462457f ("um: Correctly check for PTRACE_GETRESET/SETREGSET") Fixes: 0a987645672e ("um: Allow building and running on older hosts") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2017-08-24KVM: nVMX: Fix trying to cancel vmlauch/vmresumeWanpeng Li
------------[ cut here ]------------ WARNING: CPU: 7 PID: 3861 at /home/kernel/ssd/kvm/arch/x86/kvm//vmx.c:11299 nested_vmx_vmexit+0x176e/0x1980 [kvm_intel] CPU: 7 PID: 3861 Comm: qemu-system-x86 Tainted: G W OE 4.13.0-rc4+ #11 RIP: 0010:nested_vmx_vmexit+0x176e/0x1980 [kvm_intel] Call Trace: ? kvm_multiple_exception+0x149/0x170 [kvm] ? handle_emulation_failure+0x79/0x230 [kvm] ? load_vmcs12_host_state+0xa80/0xa80 [kvm_intel] ? check_chain_key+0x137/0x1e0 ? reexecute_instruction.part.168+0x130/0x130 [kvm] nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel] ? nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel] vmx_queue_exception+0x197/0x300 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x1b0c/0x2c90 [kvm] ? kvm_arch_vcpu_runnable+0x220/0x220 [kvm] ? preempt_count_sub+0x18/0xc0 ? restart_apic_timer+0x17d/0x300 [kvm] ? kvm_lapic_restart_hv_timer+0x37/0x50 [kvm] ? kvm_arch_vcpu_load+0x1d8/0x350 [kvm] kvm_vcpu_ioctl+0x4e4/0x910 [kvm] ? kvm_vcpu_ioctl+0x4e4/0x910 [kvm] ? kvm_dev_ioctl+0xbe0/0xbe0 [kvm] The flag "nested_run_pending", which can override the decision of which should run next, L1 or L2. nested_run_pending=1 means that we *must* run L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending is especially intended to avoid switching to L1 in the injection decision-point. This can be handled just like the other cases in vmx_check_nested_events, instead of having a special case in vmx_queue_exception. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: X86: Fix loss of exception which has not yet been injectedWanpeng Li
vmx_complete_interrupts() assumes that the exception is always injected, so it can be dropped by kvm_clear_exception_queue(). However, an exception cannot be injected immediately if it is: 1) originally destined to a nested guest; 2) trapped to cause a vmexit; 3) happening right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true. This patch applies to exceptions the same algorithm that is used for NMIs, replacing exception.reinject with "exception.injected" (equivalent to nmi_injected). exception.pending now represents an exception that is queued and whose side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet. If exception.pending is true, the exception might result in a nested vmexit instead, too (in which case the side effects must not be applied). exception.injected instead represents an exception that is going to be injected into the guest at the next vmentry. Reported-by: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: VMX: use kvm_event_needs_reinjectionWanpeng Li
Use kvm_event_needs_reinjection() encapsulation. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: MMU: speedup update_permission_bitmaskPaolo Bonzini
update_permission_bitmask currently does a 128-iteration loop to, essentially, compute a constant array. Computing the 8 bits in parallel reduces it to 16 iterations, and is enough to speed it up substantially because many boolean operations in the inner loop become constants or simplify noticeably. Because update_permission_bitmask is actually the top item in the profile for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand clock cycles, or up to 30%: before after cpuid 35173 25954 vmcall 35122 27079 inl_from_pmtimer 52635 42675 inl_from_qemu 53604 44599 inl_from_kernel 38498 30798 outl_to_kernel 34508 28816 wr_tsc_adjust_msr 34185 26818 rd_tsc_adjust_msr 37409 27049 mmio-no-eventfd:pci-mem 50563 45276 mmio-wildcard-eventfd:pci-mem 34495 30823 mmio-datamatch-eventfd:pci-mem 35612 31071 portio-no-eventfd:pci-io 44925 40661 portio-wildcard-eventfd:pci-io 29708 27269 portio-datamatch-eventfd:pci-io 31135 27164 (I wrote a small C program to compare the tables for all values of CR0.WP, CR4.SMAP and CR4.SMEP, and they match). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: MMU: Expose the LA57 feature to VM.Yu Zhang
This patch exposes 5 level page table feature to the VM. At the same time, the canonical virtual address checking is extended to support both 48-bits and 57-bits address width. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: MMU: Add 5 level EPT & Shadow page table support.Yu Zhang
Extends the shadow paging code, so that 5 level shadow page table can be constructed if VM is running in 5 level paging mode. Also extends the ept code, so that 5 level ept table can be constructed if maxphysaddr of VM exceeds 48 bits. Unlike the shadow logic, KVM should still use 4 level ept table for a VM whose physical address width is less than 48 bits, even when the VM is running in 5 level paging mode. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> [Unconditionally reset the MMU context in kvm_cpuid_update. Changing MAXPHYADDR invalidates the reserved bit bitmasks. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL.Yu Zhang
Now we have 4 level page table and 5 level page table in 64 bits long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL, then we can use PT64_ROOT_5LEVEL for 5 level page table, it's helpful to make the code more clear. Also PT64_ROOT_MAX_LEVEL is defined as 4, so that we can just redefine it to 5 whenever a replacement is needed for 5 level paging. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: MMU: check guest CR3 reserved bits based on its physical address width.Yu Zhang
Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the reserved bits in CR3. Yet the length of reserved bits in guest CR3 should be based on the physical address width exposed to the VM. This patch changes CR3 check logic to calculate the reserved bits at runtime. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: x86: Add return value to kvm_cpuid().Yu Zhang
Return false in kvm_cpuid() when it fails to find the cpuid entry. Also, this routine(and its caller) is optimized with a new argument - check_limit, so that the check_cpuid_limit() fall back can be avoided. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24kvm: vmx: Raise #UD on unsupported XSAVES/XRSTORSPaolo Bonzini
A guest may not be configured to support XSAVES/XRSTORS, even when the host does. If the guest does not support XSAVES/XRSTORS, clear the secondary execution control so that the processor will raise #UD. Also clear the "allowed-1" bit for XSAVES/XRSTORS exiting in the IA32_VMX_PROCBASED_CTLS2 MSR, and pass through VMCS12's control in the VMCS02. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24kvm: vmx: Raise #UD on unsupported RDSEEDJim Mattson
A guest may not be configured to support RDSEED, even when the host does. If the guest does not support RDSEED, intercept the instruction and synthesize #UD. Also clear the "allowed-1" bit for RDSEED exiting in the IA32_VMX_PROCBASED_CTLS2 MSR. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24kvm: vmx: Raise #UD on unsupported RDRANDJim Mattson
A guest may not be configured to support RDRAND, even when the host does. If the guest does not support RDRAND, intercept the instruction and synthesize #UD. Also clear the "allowed-1" bit for RDRAND exiting in the IA32_VMX_PROCBASED_CTLS2 MSR. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: VMX: cache secondary exec controlsPaolo Bonzini
Currently, secondary execution controls are divided in three groups: - static, depending mostly on the module arguments or the processor (vmx_secondary_exec_control) - static, depending on CPUID (vmx_cpuid_update) - dynamic, depending on nested VMX or local APIC state Because walking CPUID is expensive, prepare_vmcs02 is using only the first group. This however is unnecessarily complicated. Just cache the static secondary execution controls, and then prepare_vmcs02 does not need to compute them every time. Computation of all static secondary execution controls is now kept in a single function, vmx_compute_secondary_exec_control. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-24x86/lguest: Remove lguest supportJuergen Gross
Lguest seems to be rather unused these days. It has seen only patches ensuring it still builds the last two years and its official state is "Odd Fixes". Remove it in order to be able to clean up the paravirt code. Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: boris.ostrovsky@oracle.com Cc: lguest@lists.ozlabs.org Cc: rusty@rustcorp.com.au Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20170816173157.8633-3-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-24x86/paravirt/xen: Remove xen_patch()Juergen Gross
Xen's paravirt patch function xen_patch() does some special casing for irq_ops functions to apply relocations when those functions can be patched inline instead of calls. Unfortunately none of the special case function replacements is small enough to be patched inline, so the special case never applies. As xen_patch() will call paravirt_patch_default() in all cases it can be just dropped. xen-asm.h doesn't seem necessary without xen_patch() as the only thing left in it would be the definition of XEN_EFLAGS_NMI used only once. So move that definition and remove xen-asm.h. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boris.ostrovsky@oracle.com Cc: lguest@lists.ozlabs.org Cc: rusty@rustcorp.com.au Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20170816173157.8633-2-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-23KVM: SVM: Enable Virtual GIF featureJanakarajan Natarajan
Enable the Virtual GIF feature. This is done by setting bit 25 at position 60h in the vmcb. With this feature enabled, the processor uses bit 9 at position 60h as the virtual GIF when executing STGI/CLGI instructions. Since the execution of STGI by the L1 hypervisor does not cause a return to the outermost (L0) hypervisor, the enable_irq_window and enable_nmi_window are modified. The IRQ window will be opened even if GIF is not set, under the assumption that on resuming the L1 hypervisor the IRQ will be held pending until the processor executes the STGI instruction. For the NMI window, the STGI intercept is set. This will assist in opening the window only when GIF=1. Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-23KVM: SVM: Add Virtual GIF feature definitionJanakarajan Natarajan
Add a new cpufeature definition for Virtual GIF. Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-23x86/ioapic: Print the IRTE's index field correctly when enabling INTRraymond pang
When enabling interrupt remap, IOAPIC's RTE contains the interrupt_index field of IRTE. This field is composed of the ->index and the ->index2 members of 'struct IR_IO_APIC_route_entry' - but what we print out currently only uses ->index. Fix it. Signed-off-by: Raymond Pang <raymondpangxd@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: joro@8bytes.org Cc: linux-arch@vger.kernel.org Link: http://lkml.kernel.org/r/CAHG4imNDzpDyOVi7MByVrLQ%3DQFuOVqpzJ5F-Xs5z6OZphubj-Q@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Herbert Xu
Merge the crypto tree to resolve the conflict between the temporary and long-term fixes in algif_skcipher.
2017-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-08-21x86/CPU: Align CR3 definesBorislav Petkov
Align them vertically for better readability and use BIT_ULL() macro. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: http://lkml.kernel.org/r/20170821080651.4527-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-21x86/build: Use cc-option to validate stack alignment parameterMatthias Kaehlcke
With the following commit: 8f91869766c0 ("x86/build: Fix stack alignment for CLang") cc-option is only used to determine the name of the stack alignment option supported by the compiler, but not to verify that the actual parameter <option>=N is valid in combination with the other CFLAGS. This causes problems (as reported by the kbuild robot) with older GCC versions which only support stack alignment on a boundary of 16 bytes or higher. Also use (__)cc_option to add the stack alignment option to CFLAGS to make sure only valid options are added. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bernhard.Rosenkranzer@linaro.org Cc: Greg Hackmann <ghackmann@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Michael Davidson <md@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hines <srhines@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dianders@chromium.org Fixes: 8f91869766c0 ("x86/build: Fix stack alignment for CLang") Link: http://lkml.kernel.org/r/20170817182047.176752-1-mka@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-20Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Another pile of small fixes and updates for x86: - Plug a hole in the SMAP implementation which misses to clear AC on NMI entry - Fix the norandmaps/ADDR_NO_RANDOMIZE logic so the command line parameter works correctly again - Use the proper accessor in the startup64 code for next_early_pgt to prevent accessing of invalid addresses and faulting in the early boot code. - Prevent CPU hotplug lock recursion in the MTRR code - Unbreak CPU0 hotplugging - Rename overly long CPUID bits which got introduced in this cycle - Two commits which mark data 'const' and restrict the scope of data and functions to file scope by making them 'static'" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Constify attribute_group structures x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt' x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks x86: Fix norandmaps/ADDR_NO_RANDOMIZE x86/mtrr: Prevent CPU hotplug lock recursion x86: Mark various structures and functions as 'static' x86/cpufeature, kvm/svm: Rename (shorten) the new "virtualized VMSAVE/VMLOAD" CPUID flag x86/smpboot: Unbreak CPU0 hotplug x86/asm/64: Clear AC on NMI entries
2017-08-20Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Two fixes for the perf subsystem: - Fix an inconsistency of RDPMC mm struct tagging across exec() which causes RDPMC to fault. - Correct the timestamp mechanics across IOC_DISABLE/ENABLE which causes incorrect timestamps and total time calculations" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix time on IOC_ENABLE perf/x86: Fix RDPMC vs. mm_struct tracking
2017-08-20Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull watchdog fix from Thomas Gleixner: "A fix for the hardlockup watchdog to prevent false positives with extreme Turbo-Modes which make the perf/NMI watchdog fire faster than the hrtimer which is used to verify. Slightly larger than the minimal fix, which just would increase the hrtimer frequency, but comes with extra overhead of more watchdog timer interrupts and thread wakeups for all users. With this change we restrict the overhead to the extreme Turbo-Mode systems" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/watchdog: Prevent false positives with turbo modes
2017-08-18mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changesKees Cook
Moving the x86_64 and arm64 PIE base from 0x555555554000 to 0x000100000000 broke AddressSanitizer. This is a partial revert of: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE") 02445990a96e ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB") The AddressSanitizer tool has hard-coded expectations about where executable mappings are loaded. The motivation for changing the PIE base in the above commits was to avoid the Stack-Clash CVEs that allowed executable mappings to get too close to heap and stack. This was mainly a problem on 32-bit, but the 64-bit bases were moved too, in an effort to proactively protect those systems (proofs of concept do exist that show 64-bit collisions, but other recent changes to fix stack accounting and setuid behaviors will minimize the impact). The new 32-bit PIE base is fine for ASan (since it matches the ET_EXEC base), so only the 64-bit PIE base needs to be reverted to let x86 and arm64 ASan binaries run again. Future changes to the 64-bit PIE base on these architectures can be made optional once a more dynamic method for dealing with AddressSanitizer is found. (e.g. always loading PIE into the mmap region for marked binaries.) Link: http://lkml.kernel.org/r/20170807201542.GA21271@beast Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE") Fixes: 02445990a96e ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Kostya Serebryany <kcc@google.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18kernel/watchdog: fix Kconfig constraints for perf hardlockup watchdogNicholas Piggin
Commit 05a4a9527931 ("kernel/watchdog: split up config options") lost the perf-based hardlockup detector's dependency on PERF_EVENTS, which can result in broken builds with some powerpc configurations. Restore the dependency. Add it in for x86 too, despite x86 always selecting PERF_EVENTS it seems reasonable to make the dependency explicit. Link: http://lkml.kernel.org/r/20170810114452.6673-1-npiggin@gmail.com Fixes: 05a4a9527931 ("kernel/watchdog: split up config options") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18KVM: VMX: always require WB memory type for EPTDavid Hildenbrand
We already always set that type but don't check if it is supported. Also for nVMX, we only support WB for now. Let's just require it. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18KVM: VMX: cleanup EPTP definitionsDavid Hildenbrand
Don't use shifts, tag them correctly as EPTP and use better matching names (PWL vs. GAW). Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18KVM: SVM: delete avic_vm_id_bitmap (2 megabyte static array)Denys Vlasenko
With lightly tweaked defconfig: text data bss dec hex filename 11259661 5109408 2981888 19350957 12745ad vmlinux.before 11259661 5109408 884736 17253805 10745ad vmlinux.after Only compile-tested. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: pbonzini@redhat.com Cc: rkrcmar@redhat.com Cc: tglx@linutronix.de Cc: mingo@redhat.com Cc: hpa@zytor.com Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18KVM: x86: fix use of L1 MMIO areas in nested guestsPaolo Bonzini
There is currently some confusion between nested and L1 GPAs. The assignment to "direct" in kvm_mmu_page_fault tries to fix that, but it is not enough. What this patch does is fence off the MMIO cache completely when using shadow nested page tables, since we have neither a GVA nor an L1 GPA to put in the cache. This also allows some simplifications in kvm_mmu_page_fault and FNAME(page_fault). The EPT misconfig likewise does not have an L1 GPA to pass to kvm_io_bus_write, so that must be skipped for guest mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> [Changed comment to say "GPAs" instead of "L1's physical addresses", as per David's review. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18KVM: x86: Avoid guest page table walk when gpa_available is setBrijesh Singh
When a guest causes a page fault which requires emulation, the vcpu->arch.gpa_available flag is set to indicate that cr2 contains a valid GPA. Currently, emulator_read_write_onepage() makes use of gpa_available flag to avoid a guest page walk for a known MMIO regions. Lets not limit the gpa_available optimization to just MMIO region. The patch extends the check to avoid page walk whenever gpa_available flag is set. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> [Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the new code. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> [Moved "ret < 0" to the else brach, as per David's review. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18KVM: x86: simplify ept_misconfigPaolo Bonzini
Calling handle_mmio_page_fault() has been unnecessary since commit e9ee956e311d ("KVM: x86: MMU: Move handle_mmio_page_fault() call to kvm_mmu_page_fault()", 2016-02-22). handle_mmio_page_fault() can now be made static. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18kernel/watchdog: Prevent false positives with turbo modesThomas Gleixner
The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <Kan.liang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dzickus@redhat.com Cc: prarit@redhat.com Cc: ak@linux.intel.com Cc: babu.moger@oracle.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: acme@redhat.com Cc: stable@vger.kernel.org Cc: atomlin@redhat.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-18x86: Constify attribute_group structuresArvind Yadav
attribute_groups are not supposed to change at runtime and none of the groups is modified. Mark the non-const structs as const. [ tglx: Folded into one big patch ] Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: tony.luck@intel.com Cc: bp@alien8.de Link: http://lkml.kernel.org/r/1500550238-15655-2-git-send-email-arvind.yadav.cs@gmail.com
2017-08-18Merge branch 'x86/asm' into locking/coreIngo Molnar
We need the ASM_UNREACHABLE() macro for a dependent patch. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17Merge tag 'pm-4.13-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two issues related to exposing the current CPU frequency to user space on x86. Specifics: - Disable interrupts around reading IA32_APERF and IA32_MPERF in aperfmperf_snapshot_khz() (introduced recently) to avoid excessive delays between the reads that may result from interrupt handling (Doug Smythies). - Fix the computation of the CPU frequency to be reported through the pstate_sample tracepoint in intel_pstate (Doug Smythies)" * tag 'pm-4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: x86: Disable interrupts during MSRs reading cpufreq: intel_pstate: report correct CPU frequencies during trace