summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2016-08-10x86/mm/pkeys: Fix compact mode by removing protection keys' XSAVE buffer ↵Dave Hansen
manipulation The Memory Protection Keys "rights register" (PKRU) is XSAVE-managed, and is saved/restored along with the FPU state. When kernel code accesses FPU regsisters, it does a delicate dance with preempt. Otherwise, the context switching code can get confused as to whether the most up-to-date state is in the registers themselves or in the XSAVE buffer. But, PKRU is not a normal FPU register. Using it does not generate the normal device-not-available (#NM) exceptions which means we can not manage it lazily, and the kernel completley disallows using lazy mode when it is enabled. The dance with preempt *only* occurs when managing the FPU lazily. Since we never manage PKRU lazily, we do not have to do the dance with preempt; we can access it directly. Doing it this way saves a ton of complicated code (and is faster too). Further, the XSAVES reenabling failed to patch a bit of code in fpu__xfeature_set_state() the checked for compacted buffers. That check caused fpu__xfeature_set_state() to silently refuse to work when the kernel is using compacted XSAVE buffers. This broke execute-only and future pkey_mprotect() support when using compact XSAVE buffers. But, removing fpu__xfeature_set_state() gets rid of this issue, in addition to the nice cleanup and speedup. This fixes the same thing as a fix that Sai posted: https://lkml.org/lkml/2016/7/25/637 The fix that he posted is a much more obviously correct, but I think we should just do this instead. Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-Cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/20160727232040.7D060DAD@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/build: Reduce the W=1 warnings noise when compiling x86 syscall tablesValdis Kletnieks
Building an X86_64 kernel with W=1 throws a total of 9,948 lines of warnings of this form for both 32-bit and 64-bit syscall tables. Given that the entire rest of the build for my config only generates 8,375 lines of output, this is a big reduction in the warnings generated. The warnings follow this pattern: ./arch/x86/include/generated/asm/syscalls_32.h:885:21: warning: initialized field overwritten [-Woverride-init] __SYSCALL_I386(379, compat_sys_pwritev2, ) ^ arch/x86/entry/syscall_32.c:13:46: note: in definition of macro '__SYSCALL_I386' #define __SYSCALL_I386(nr, sym, qual) [nr] = sym, ^~~ ./arch/x86/include/generated/asm/syscalls_32.h:885:21: note: (near initialization for 'ia32_sys_call_table[379]') __SYSCALL_I386(379, compat_sys_pwritev2, ) ^ arch/x86/entry/syscall_32.c:13:46: note: in definition of macro '__SYSCALL_I386' #define __SYSCALL_I386(nr, sym, qual) [nr] = sym, Since we intentionally build the syscall tables this way, ignore that one warning in the two files. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/7464.1470021890@turing-police.cc.vt.edu Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/platform/UV: Fix kernel panic running RHEL kdump kernel on UV systemsMike Travis
The latest UV kernel support panics when RHEL7 kexec's the kdump kernel to make a dumpfile. This patch fixes the problem by turning off all UV support if NUMA is off. Tested-by: Frank Ramsay <framsay@sgi.com> Tested-by: John Estabrook <estabrook@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Reviewed-by: Dimitri Sivanich <sivanich@sgi.com> Reviewed-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160801184050.577755634@asylum.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/platform/UV: Fix problem with UV4 BIOS providing incorrect PXM valuesMike Travis
There are some circumstances where the UV4 BIOS cannot provide the correct Proximity Node values to associate with specific Sockets and Physical Nodes. The decision was made to remove these values from BIOS and for the kernel to get these values from the standard ACPI tables. Tested-by: Frank Ramsay <framsay@sgi.com> Tested-by: John Estabrook <estabrook@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Reviewed-by: Dimitri Sivanich <sivanich@sgi.com> Reviewed-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160801184050.414210079@asylum.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/platform/UV: Fix bug with iounmap() of the UV4 EFI System Table causing ↵Mike Travis
a crash Save the uv_systab::size field before doing the iounmap() of the struct pointer, to avoid a NULL dereference crash. Tested-by: Frank Ramsay <framsay@sgi.com> Tested-by: John Estabrook <estabrook@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Reviewed-by: Dimitri Sivanich <sivanich@sgi.com> Reviewed-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160801184050.250424783@asylum.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/platform/UV: Fix problem with UV4 Socket IDs not being contiguousMike Travis
The UV4 Socket IDs are not guaranteed to equate to Node values which can cause the GAM (Global Addressable Memory) table lookups to fail. Fix this by using an independent index into the GAM table instead of the Socket ID to reference the base address. Tested-by: Frank Ramsay <framsay@sgi.com> Tested-by: John Estabrook <estabrook@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Reviewed-by: Dimitri Sivanich <sivanich@sgi.com> Reviewed-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160801184050.048755337@asylum.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/entry: Clarify the RF saving/restoring situation with SYSCALL/SYSRETBorislav Petkov
Clarify why exactly RF cannot be restored properly by SYSRET to avoid confusion. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160803171429.GA2590@nazgul.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/mm: Disable preemption during CR3 read+writeSebastian Andrzej Siewior
There's a subtle preemption race on UP kernels: Usually current->mm (and therefore mm->pgd) stays the same during the lifetime of a task so it does not matter if a task gets preempted during the read and write of the CR3. But then, there is this scenario on x86-UP: TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by: -> mmput() -> exit_mmap() -> tlb_finish_mmu() -> tlb_flush_mmu() -> tlb_flush_mmu_tlbonly() -> tlb_flush() -> flush_tlb_mm_range() -> __flush_tlb_up() -> __flush_tlb() -> __native_flush_tlb() At this point current->mm is NULL but current->active_mm still points to the "old" mm. Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its own mm so CR3 has changed. Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's mm and so CR3 remains unchanged. Once taskA gets active it continues where it was interrupted and that means it writes its old CR3 value back. Everything is fine because userland won't need its memory anymore. Now the fun part: Let's preempt taskA one more time and get back to taskB. This time switch_mm() won't do a thing because oldmm (->active_mm) is the same as mm (as per context_switch()). So we remain with a bad CR3 / PGD and return to userland. The next thing that happens is handle_mm_fault() with an address for the execution of its code in userland. handle_mm_fault() realizes that it has a PTE with proper rights so it returns doing nothing. But the CPU looks at the wrong PGD and insists that something is wrong and faults again. And again. And one more time… This pagefault circle continues until the scheduler gets tired of it and puts another task on the CPU. It gets little difficult if the task is a RT task with a high priority. The system will either freeze or it gets fixed by the software watchdog thread which usually runs at RT-max prio. But waiting for the watchdog will increase the latency of the RT task which is no good. Fix this by disabling preemption across the critical code section. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de [ Prettified the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10selftests/powerpc: Specify we expect to build with std=gnu99Michael Ellerman
We have some tests that assume we're using std=gnu99, which is fine on most compilers, but some old compilers use a different default. So make it explicit that we want to use std=gnu99. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10ALSA: hda - Manage power well properly for resumeTakashi Iwai
For SKL and later Intel chips, we control the power well per codec basis via link_power callback since the commit [03b135cebc47: ALSA: hda - remove dependency on i915 power well for SKL]. However, there are a few exceptional cases where the gfx registers are accessed from the audio driver: namely the wakeup override bit toggling at (both system and runtime) resume. This seems causing a kernel warning when accessed during the power well down (and likely resulting in the bogus register accesses). This patch puts the proper power up / down sequence around the resume code so that the wakeup bit is fiddled properly while the power is up. (The other callback, sync_audio_rate, is used only in the PCM callback, so it's guaranteed in the power-on.) Also, by this proper power up/down, the instantaneous flip of wakeup bit in the resume callback that was introduced by the commit [033ea349a7cd: ALSA: hda - Fix Skylake codec timeout] becomes superfluous, as snd_hdac_display_power() already does it. So we can clean it up together. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96214 Fixes: 03b135cebc47 ('ALSA: hda - remove dependency on i915 power well for SKL') Cc: <stable@vger.kernel.org> # v4.2+ Tested-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2016-08-10powerpc/vdso: Fix build rules to rebuild vdsos correctlyNicholas Piggin
When using if_changed, we need to add FORCE as a dependency (see Documentation/kbuild/makefiles.txt) otherwise we don't get command line change checking amongst other things. This has resulted in vdsos not being rebuilt when switching between big and little endian. The vdso64/32ld commands have to be changed around to avoid pulling FORCE into the linker command line (code copied from x86). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10powerpc/Makefile: Use cflags-y/aflags-y for setting endian optionsMichael Ellerman
When we introduced the little endian support, we added the endian flags to CC directly using override. I don't know the history of why we did that, I suspect no one does. Although this mostly works, it has one bug, which is that CROSS32CC doesn't get -mbig-endian. That means when the compiler is little endian by default and the user is building big endian, vdso32 is incorrectly compiled as little endian and the kernel fails to build. Instead we can add the endian flags to cflags-y/aflags-y, and then append those to KBUILD_CFLAGS/KBUILD_AFLAGS. This has the advantage of being 1) less ugly, 2) the documented way of adding flags in the arch Makefile and 3) it fixes building vdso32 with a LE toolchain. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10x86/mm/KASLR: Increase BRK pages for KASLR memory randomizationThomas Garnier
Default implementation expects 6 pages maximum are needed for low page allocations. If KASLR memory randomization is enabled, the worse case of e820 layout would require 12 pages (no large pages). It is due to the PUD level randomization and the variable e820 memory layout. This bug was found while doing extensive testing of KASLR memory randomization on different type of hardware. Signed-off-by: Thomas Garnier <thgarnie@google.com> Cc: Aleksey Makarov <aleksey.makarov@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lv Zheng <lv.zheng@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: kernel-hardening@lists.openwall.com Fixes: 021182e52fe0 ("Enable KASLR for physical mapping memory regions") Link: http://lkml.kernel.org/r/1470762665-88032-2-git-send-email-thgarnie@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/mm/KASLR: Fix physical memory calculation on KASLR memory randomizationThomas Garnier
Initialize KASLR memory randomization after max_pfn is initialized. Also ensure the size is rounded up. It could create problems on machines with more than 1Tb of memory on certain random addresses. Signed-off-by: Thomas Garnier <thgarnie@google.com> Cc: Aleksey Makarov <aleksey.makarov@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lv Zheng <lv.zheng@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: kernel-hardening@lists.openwall.com Fixes: 021182e52fe0 ("Enable KASLR for physical mapping memory regions") Link: http://lkml.kernel.org/r/1470762665-88032-1-git-send-email-thgarnie@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/hpet: Fix /dev/rtc breakage caused by RTC cleanupArnd Bergmann
Ville Syrjälä reports "The first time I run hwclock after rebooting I get this: open("/dev/rtc", O_RDONLY) = 3 ioctl(3, PHN_SET_REGS or RTC_UIE_ON, 0) = 0 select(4, [3], NULL, NULL, {10, 0}) = 0 (Timeout) ioctl(3, PHN_NOT_OH or RTC_UIE_OFF, 0) = 0 close(3) = 0 On all subsequent runs I get this: open("/dev/rtc", O_RDONLY) = 3 ioctl(3, PHN_SET_REGS or RTC_UIE_ON, 0) = -1 EINVAL (Invalid argument) ioctl(3, RTC_RD_TIME, 0x7ffd76b3ae70) = -1 EINVAL (Invalid argument) close(3) = 0" This was caused by a stupid typo in a patch that should have been a simple rename to move around contents of a header file, but accidentally wrote zeroes into the rtc rather than reading from it: 463a86304cae ("char/genrtc: x86: remove remnants of asm/rtc.h") Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com> Cc: Borislav Petkov <bp@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: rtc-linux@googlegroups.com Fixes: 463a86304cae ("char/genrtc: x86: remove remnants of asm/rtc.h") Link: http://lkml.kernel.org/r/20160809195528.1604312-1-arnd@arndb.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10Merge branch 'linus' into timers/urgent, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86, kasan, ftrace: Put APIC interrupt handlers into .irqentry.textAlexander Potapenko
Dmitry Vyukov has reported unexpected KASAN stackdepot growth: https://github.com/google/kasan/issues/36 ... which is caused by the APIC handlers not being present in .irqentry.text: When building with CONFIG_FUNCTION_GRAPH_TRACER=y or CONFIG_KASAN=y, put the APIC interrupt handlers into the .irqentry.text section. This is needed because both KASAN and function graph tracer use __irqentry_text_start and __irqentry_text_end to determine whether a function is an IRQ entry point. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: aryabinin@virtuozzo.com Cc: kasan-dev@googlegroups.com Cc: kcc@google.com Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/1468575763-144889-1-git-send-email-glider@google.com [ Minor edits. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10locking/pvqspinlock: Fix a bug in qstat_read()Pan Xinhui
It's obviously wrong to set stat to NULL. So lets remove it. Otherwise it is always zero when we check the latency of kick/wake. Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <Waiman.Long@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1468405414-3700-1-git-send-email-xinhui.pan@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10locking/pvqspinlock: Fix double hash raceWanpeng Li
When the lock holder vCPU is racing with the queue head: CPU 0 (lock holder) CPU1 (queue head) =================== ================= spin_lock(); spin_lock(); pv_kick_node(): pv_wait_head_or_lock(): if (!lp) { lp = pv_hash(lock, pn); xchg(&l->locked, _Q_SLOW_VAL); } WRITE_ONCE(pn->state, vcpu_halted); cmpxchg(&pn->state, vcpu_halted, vcpu_hashed); WRITE_ONCE(l->locked, _Q_SLOW_VAL); (void)pv_hash(lock, pn); In this case, lock holder inserts the pv_node of queue head into the hash table and set _Q_SLOW_VAL unnecessary. This patch avoids it by restoring/setting vcpu_hashed state after failing adaptive locking spinning. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <Waiman.Long@hpe.com> Link: http://lkml.kernel.org/r/1468484156-4521-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10locking/qrwlock: Fix write unlock bug on big endian systemspan xinhui
This patch aims to get rid of endianness in queued_write_unlock(). We want to set __qrwlock->wmode to NULL, however the address is not &lock->cnts in big endian machine. That causes queued_write_unlock() write NULL to the wrong field of __qrwlock. So implement __qrwlock_write_byte() which returns the correct __qrwlock->wmode address. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hpe.com Cc: arnd@arndb.de Cc: boqun.feng@gmail.com Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1468835259-4486-1-git-send-email-xinhui.pan@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10Merge branch 'linus' into locking/urgent, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10sched/deadline: Fix lock pinning warning during CPU hotplugWanpeng Li
The following warning can be triggered by hot-unplugging the CPU on which an active SCHED_DEADLINE task is running on: WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3531 lock_release+0x690/0x6a0 releasing a pinned lock Call Trace: dump_stack+0x99/0xd0 __warn+0xd1/0xf0 ? dl_task_timer+0x1a1/0x2b0 warn_slowpath_fmt+0x4f/0x60 ? sched_clock+0x13/0x20 lock_release+0x690/0x6a0 ? enqueue_pushable_dl_task+0x9b/0xa0 ? enqueue_task_dl+0x1ca/0x480 _raw_spin_unlock+0x1f/0x40 dl_task_timer+0x1a1/0x2b0 ? push_dl_task.part.31+0x190/0x190 WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3649 lock_unpin_lock+0x181/0x1a0 unpinning an unpinned lock Call Trace: dump_stack+0x99/0xd0 __warn+0xd1/0xf0 warn_slowpath_fmt+0x4f/0x60 lock_unpin_lock+0x181/0x1a0 dl_task_timer+0x127/0x2b0 ? push_dl_task.part.31+0x190/0x190 As per the comment before this code, its safe to drop the RQ lock here, and since we (potentially) change rq, unpin and repin to avoid the splat. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [ Rewrote changelog. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@unitn.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1470274940-17976-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10sched/cputime: Mitigate performance regression in times()/clock_gettime()Giovanni Gherdovich
Commit: 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") fixed a problem whereby clock_nanosleep() followed by clock_gettime() could allow a task to wake early. It addressed the problem by calling the scheduling classes update_curr() when the cputimer starts. Said change induced a considerable performance regression on the syscalls times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some debuggers and applications that monitor their own performance that accidentally depend on the performance of these specific calls. This patch mitigates the performace loss by prefetching data in the CPU cache, as stalls due to cache misses appear to be where most time is spent in our benchmarks. Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge box with 32 logical cores and 2 NUMA nodes. The test is repeated with a variable number of threads, from 2 to 4*num_cpus; the results are in seconds and correspond to the average of 10 runs; the percentage gain is computed with (before-after)/before so a positive value is an improvement (it's faster). The improvement varies between a few percents for 5-20 threads and more than 10% for 2 or >20 threads. pound_clock_gettime: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.48 3.06 ( 11.83%) 5 3.33 3.25 ( 2.40%) 8 3.37 3.26 ( 3.30%) 12 3.32 3.37 ( -1.60%) 21 4.01 3.90 ( 2.74%) 30 3.63 3.36 ( 7.41%) 48 3.71 3.11 ( 16.27%) 79 3.75 3.16 ( 15.74%) 110 3.81 3.25 ( 14.80%) 128 3.88 3.31 ( 14.76%) pound_times: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.65 3.25 ( 11.03%) 5 3.45 3.17 ( 7.92%) 8 3.52 3.22 ( 8.69%) 12 3.29 3.36 ( -2.04%) 21 4.07 3.92 ( 3.78%) 30 3.87 3.40 ( 12.17%) 48 3.79 3.16 ( 16.61%) 79 3.88 3.28 ( 15.42%) 110 3.90 3.38 ( 13.35%) 128 4.00 3.38 ( 15.45%) pound_clock_gettime and pound_clock_gettime are two benchmarks included in the MMTests framework. They launch a given number of threads which repeatedly call times() or clock_gettimes(). The results above can be reproduced with cloning MMTests from github.com and running the "poundtime" workload: $ git clone https://github.com/gormanm/mmtests.git $ cd mmtests $ cp configs/config-global-dhp__workload_poundtime config $ ./run-mmtests.sh --run-monitor $(uname -r) The above will run "poundtime" measuring the kernel currently running on the machine; Once a new kernel is installed and the machine rebooted, running again $ cd mmtests $ ./run-mmtests.sh --run-monitor $(uname -r) will produce results to compare with. A comparison table will be output with: $ cd mmtests/work/log $ ../../compare-kernels.sh the table will contain a lot of entries; grepping for "Amean" (as in "arithmetic mean") will give the tables presented above. The source code for the two benchmarks is reported at the end of this changelog for clairity. The cache misses addressed by this patch were found using a combination of `perf top`, `perf record` and `perf annotate`. The incriminated lines were found to be struct sched_entity *curr = cfs_rq->curr; and delta_exec = now - curr->exec_start; in the function update_curr() from kernel/sched/fair.c. This patch prefetches the data from memory just before update_curr is called in the interested execution path. A comparison of the total number of cycles before and after the patch follows; the data is obtained using `perf stat -r 10 -ddd <program>` running over the same sequence of number of threads used above (a positive gain is an improvement): threads cycles before cycles after gain 2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88% 5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85% 8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74% 12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74% 21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89% 30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92% 48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10% 79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33% 110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21% 128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99% A comparison of cache miss vs total cache loads ratios, before and after the patch (again from the `perf stat -r 10 -ddd <program>` tables): threads L1 misses/total*100 L1 misses/total*100 gain before after 2 7.43 +-4.90% 7.36 +-4.70% 0.94% 5 13.09 +-4.74% 13.52 +-3.73% -3.28% 8 13.79 +-5.61% 12.90 +-3.27% 6.45% 12 11.57 +-2.44% 8.71 +-1.40% 24.72% 21 12.39 +-3.92% 9.97 +-1.84% 19.53% 30 13.91 +-2.53% 11.73 +-2.28% 15.67% 48 13.71 +-1.59% 12.32 +-1.97% 10.14% 79 14.44 +-0.66% 13.40 +-1.06% 7.20% 110 15.86 +-0.50% 14.46 +-0.59% 8.83% 128 16.51 +-0.32% 15.06 +-0.78% 8.78% As a final note, the following shows the evolution of performance figures in the "poundtime" benchmark and pinpoints commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a major source of degradation, mostly unaddressed to this day (figures expressed in seconds). pound_clock_gettime: threads parent of 6e998916dfe3 4.7-rc7 6e998916dfe3 itself 2 2.23 3.68 ( -64.56%) 3.48 (-55.48%) 5 2.83 3.78 ( -33.42%) 3.33 (-17.43%) 8 2.84 4.31 ( -52.12%) 3.37 (-18.76%) 12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%) 21 3.14 4.63 ( -47.36%) 4.01 (-27.71%) 30 3.28 5.75 ( -75.37%) 3.63 (-10.80%) 48 3.02 6.05 (-100.56%) 3.71 (-22.99%) 79 2.88 6.30 (-118.90%) 3.75 (-30.26%) 110 2.95 6.46 (-119.00%) 3.81 (-29.24%) 128 3.05 6.42 (-110.08%) 3.88 (-27.04%) pound_times: threads parent of 6e998916dfe3 4.7-rc7 6e998916dfe3 itself 2 2.27 3.73 ( -64.71%) 3.65 (-61.14%) 5 2.78 3.77 ( -35.56%) 3.45 (-23.98%) 8 2.79 4.41 ( -57.71%) 3.52 (-26.05%) 12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%) 21 3.10 4.61 ( -48.74%) 4.07 (-31.34%) 30 3.33 5.75 ( -72.53%) 3.87 (-16.01%) 48 2.96 6.06 (-105.04%) 3.79 (-28.10%) 79 2.88 6.24 (-116.83%) 3.88 (-34.81%) 110 2.98 6.37 (-114.08%) 3.90 (-31.12%) 128 3.10 6.35 (-104.61%) 4.00 (-28.87%) The source code of the two benchmarks follows. To compile the two: NR_THREADS=42 for FILE in pound_times pound_clock_gettime; do gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE done ==== BEGIN pound_times.c ==== struct tms start; void *pound (void *threadid) { struct tms end; int oldutime = 0; int utime; int i; for (i = 0; i < 5000000 / NUM_THREADS; i++) { times(&end); utime = ((int)end.tms_utime - (int)start.tms_utime); if (oldutime > utime) { printf("utime decreased, was %d, now %d!\n", oldutime, utime); } oldutime = utime; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long i; times(&start); for (i = 0; i < NUM_THREADS; i++) { pthread_create (&th[i], NULL, pound, (void *)i); } pthread_exit(NULL); return 0; } ==== END pound_times.c ==== ==== BEGIN pound_clock_gettime.c ==== void *pound (void *threadid) { struct timespec ts; int rc, i; unsigned long prev = 0, this = 0; for (i = 0; i < 5000000 / NUM_THREADS; i++) { rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); if (rc < 0) perror("clock_gettime"); this = (ts.tv_sec * 1000000000) + ts.tv_nsec; if (0 && this < prev) printf("%lu ns timewarp at iteration %d\n", prev - this, i); prev = this; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long rc, i; pid_t pgid; for (i = 0; i < NUM_THREADS; i++) { rc = pthread_create(&th[i], NULL, pound, (void *)i); if (rc < 0) perror("pthread_create"); } pthread_exit(NULL); return 0; } ==== END pound_clock_gettime.c ==== Suggested-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10sched/fair: Fix typo in sync_throttle()Xunlei Pang
We should update cfs_rq->throttled_clock_task, not pcfs_rq->throttle_clock_task. The effects of this bug was probably occasionally erratic group scheduling, particularly in cgroups-intense workloads. Signed-off-by: Xunlei Pang <xlpang@redhat.com> [ Added changelog. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 55e16d30bd99 ("sched/fair: Rework throttle_count sync") Link: http://lkml.kernel.org/r/1468050862-18864-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10sched/deadline: Fix wrap-around in DL heapTommaso Cucinotta
Current code in cpudeadline.c has a bug in re-heapifying when adding a new element at the end of the heap, because a deadline value of 0 is temporarily set in the new elem, then cpudl_change_key() is called with the actual elem deadline as param. However, the function compares the new deadline to set with the one previously in the elem, which is 0. So, if current absolute deadlines grew so much to have negative values as s64, the comparison in cpudl_change_key() makes the wrong decision. Instead, as from dl_time_before(), the kernel should handle correctly abs deadlines wrap-arounds. This patch fixes the problem with a minimally invasive change that forces cpudl_change_key() to heapify up in this case. Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Luca Abeni <luca.abeni@unitn.it> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1468921493-10054-2-git-send-email-tommaso.cucinotta@sssup.it Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10perf/core: Set cgroup in CPU contexts for new cgroup eventsDavid Carrillo-Cisneros
There's a perf stat bug easy to observer on a machine with only one cgroup: $ perf stat -e cycles -I 1000 -C 0 -G / # time counts unit events 1.000161699 <not counted> cycles / 2.000355591 <not counted> cycles / 3.000565154 <not counted> cycles / 4.000951350 <not counted> cycles / We'd expect some output there. The underlying problem is that there is an optimization in perf_cgroup_sched_{in,out}() that skips the switch of cgroup events if the old and new cgroups in a task switch are the same. This optimization interacts with the current code in two ways that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a cgroup event matches the current task. These are: 1. On creation of the first cgroup event in a CPU: In current code, cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the aforesaid optimization, perf_cgroup_sched_in will run until the next cgroup switches in that CPU. This may happen late or never happen, depending on system's number of cgroups, CPU load, etc. 2. On deletion of the last cgroup event in a cpuctx: In list_del_event, cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in because cpuctx->cgrp == NULL until a cgroup switch occurs and perf_cgroup_sched_in is executed (updating cpuctx->cgrp). This patch fixes both problems by setting cpuctx->cgrp in list_add_event, mirroring what list_del_event does when removing a cgroup event from CPU context, as introduced in: commit 68cacd29167b ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()") With this patch, cpuctx->cgrp is always set/clear when installing/removing the first/last cgroup event in/from the CPU context. With cpuctx->cgrp correctly set, event_filter_match works as intended when events are sched in/out. After the fix, the output is as expected: $ perf stat -e cycles -I 1000 -a -G / # time counts unit events 1.004699159 627342882 cycles / 2.007397156 615272690 cycles / 3.010019057 616726074 cycles / Signed-off-by: David Carrillo-Cisneros <davidcc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1470124092-113192-1-git-send-email-davidcc@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10perf/core: Fix sideband list-iteration vs. event ordering NULL pointer ↵Peter Zijlstra
deference crash Vegard Nossum reported that perf fuzzing generates a NULL pointer dereference crash: > Digging a bit deeper into this, it seems the event itself is getting > created by perf_event_open() and it gets added to the pmu_event_list > through: > > perf_event_open() > - perf_event_alloc() > - account_event() > - account_pmu_sb_event() > - attach_sb_event() > > so at this point the event is being attached but its ->ctx is still > NULL. It seems like ->ctx is set just a bit later in > perf_event_open(), though. > > But before that, __schedule() comes along and creates a stack trace > similar to the one above: > > __schedule() > - __perf_event_task_sched_out() > - perf_iterate_sb() > - perf_iterate_sb_cpu() > - event_filter_match() > - perf_cgroup_match() > - __get_cpu_context() > - (dereference ctx which is NULL) > > So I guess the question is... should the event be attached (= put on > the list) before ->ctx gets set? Or should the cgroup code check for a > NULL ->ctx? The latter seems like the simplest solution. Moving the list-add later creates a bit of a mess. Reported-by: Vegard Nossum <vegard.nossum@gmail.com> Tested-by: Vegard Nossum <vegard.nossum@gmail.com> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: f2fb6bef9251 ("perf/core: Optimize side-band event delivery") Link: http://lkml.kernel.org/r/20160804123724.GN6862@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/timers/apic: Inform TSC deadline clockevent device about recalibrationNicolai Stange
This patch eliminates a source of imprecise APIC timer interrupts, which imprecision may result in double interrupts or even late interrupts. The TSC deadline clockevent devices' configuration and registration happens before the TSC frequency calibration is refined in tsc_refine_calibration_work(). This results in the TSC clocksource and the TSC deadline clockevent devices being configured with slightly different frequencies: the former gets the refined one and the latter are configured with the inaccurate frequency detected earlier by means of the "Fast TSC calibration using PIT". Within the APIC code, introduce the notifier function lapic_update_tsc_freq() which reconfigures all per-CPU TSC deadline clockevent devices with the current tsc_khz. Call it from the TSC code after TSC calibration refinement has happened. Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Christopher S. Hall <christopher.s.hall@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20160714152255.18295-3-nicstange@gmail.com [ Pushed #ifdef CONFIG_X86_LOCAL_APIC into header, improved changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC ↵Nicolai Stange
clockevents frequency roundoff error I noticed the following bug/misbehavior on certain Intel systems: with a single task running on a NOHZ CPU on an Intel Haswell, I recognized that I did not only get the one expected local_timer APIC interrupt, but two per second at minimum. (!) Further tracing showed that the first one precedes the programmed deadline by up to ~50us and hence, it did nothing except for reprogramming the TSC deadline clockevent device to trigger shortly thereafter again. The reason for this is imprecise calibration, the timeout we program into the APIC results in 'too short' timer interrupts. The core (hr)timer code notices this (because it has a precise ktime source and sees the short interrupt) and fixes it up by programming an additional very short interrupt period. This is obviously suboptimal. The reason for the imprecise calibration is twofold, and this patch fixes the first reason: In setup_APIC_timer(), the registered clockevent device's frequency is calculated by first dividing tsc_khz by TSC_DIVISOR and multiplying it with 1000 afterwards: (tsc_khz / TSC_DIVISOR) * 1000 The multiplication with 1000 is done for converting from kHz to Hz and the division by TSC_DIVISOR is carried out in order to make sure that the final result fits into an u32. However, with the order given in this calculation, the roundoff error introduced by the division gets magnified by a factor of 1000 by the following multiplication. To fix it, reversing the order of the division and the multiplication a la: (tsc_khz * 1000) / TSC_DIVISOR ... reduces the roundoff error already. Furthermore, if TSC_DIVISOR divides 1000, associativity holds: (tsc_khz * 1000) / TSC_DIVISOR = tsc_khz * (1000 / TSC_DIVISOR) and thus, the roundoff error even vanishes and the whole operation can be carried out within 32 bits. The powers of two that divide 1000 are 2, 4 and 8. A value of 8 for TSC_DIVISOR still allows for TSC frequencies up to 2^32 / 10^9ns * 8 = 34.4GHz which is way larger than anything to expect in the next years. Thus we also replace the current TSC_DIVISOR value of 32 by 8. Reverse the order of the divison and the multiplication in the calculation of the registered clockevent device's frequency. Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Christopher S. Hall <christopher.s.hall@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Len Brown <len.brown@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20160714152255.18295-2-nicstange@gmail.com [ Improved changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10powerpc/32: Fix crash during static key initBenjamin Herrenschmidt
We cannot do those initializations from apply_feature_fixups() as this function runs in a very restricted environment on 32-bit where the kernel isn't running at its linked address and the PTRRELOC() macro must be used for any global accesss. Instead, split them into a separtate steup_feature_keys() function which is called in a more suitable spot on ppc32. Fixes: 309b315b6ec6 ("powerpc: Call jump_label_init() in apply_feature_fixups()") Reported-and-tested-by: Christian Kujau <lists@nerdbynature.de> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10powerpc: Update obsolete comment in setup_32.c about early_init()Benjamin Herrenschmidt
We don't identify the machine type anymore... Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10powerpc: Print the kernel load address at the end of prom_init()Benjamin Herrenschmidt
This makes it easier to debug crashes that happen very early before the kernel takes over Open Firmware by allowing us to relate the OF reported crashing addresses to offsets within the kernel. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10powerpc/ptrace: Fix coredump since ptrace TM changesCyril Bur
Commit 8d460f6156cd ("powerpc/process: Add the function flush_tmregs_to_thread") added flush_tmregs_to_thread() and included the assumption that it would only be called for a task which is not current. Although this is correct for ptrace, when generating a core dump, some of the routines which call flush_tmregs_to_thread() are called. This leads to a WARNing such as: Not expecting ptrace on self: TM regs may be incorrect ------------[ cut here ]------------ WARNING: CPU: 123 PID: 7727 at arch/powerpc/kernel/process.c:1088 flush_tmregs_to_thread+0x78/0x80 CPU: 123 PID: 7727 Comm: libvirtd Not tainted 4.8.0-rc1-gcc6x-g61e8a0d #1 task: c000000fe631b600 task.stack: c000000fe63b0000 NIP: c00000000001a1a8 LR: c00000000001a1a4 CTR: c000000000717780 REGS: c000000fe63b3420 TRAP: 0700 Not tainted (4.8.0-rc1-gcc6x-g61e8a0d) MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 28004222 XER: 20000000 ... NIP [c00000000001a1a8] flush_tmregs_to_thread+0x78/0x80 LR [c00000000001a1a4] flush_tmregs_to_thread+0x74/0x80 Call Trace: flush_tmregs_to_thread+0x74/0x80 (unreliable) vsr_get+0x64/0x1a0 elf_core_dump+0x604/0x1430 do_coredump+0x5fc/0x1200 get_signal+0x398/0x740 do_signal+0x54/0x2b0 do_notify_resume+0x98/0xb0 ret_from_except_lite+0x70/0x74 So fix flush_tmregs_to_thread() to detect the case where it is called on current, and a transaction is active, and in that case flush the TM regs to the thread_struct. This patch also moves flush_tmregs_to_thread() into ptrace.c as it is only called from that file. Fixes: 8d460f6156cd ("powerpc/process: Add the function flush_tmregs_to_thread") Signed-off-by: Cyril Bur <cyrilbur@gmail.com> [mpe: Flesh out change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10powerpc/32: Fix csum_partial_copy_generic()Christophe Leroy
Commit 7aef4136566b0 ("powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user()") introduced a bug when destination address is odd and initial csum is not null In that (rare) case the initial csum value has to be rotated one byte as well as the resulting value is This patch also fixes related comments Fixes: 7aef4136566b0 ("powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user()") Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10cxl: Set psl_fir_cntl to production environment valueFrederic Barrat
Switch the setting of psl_fir_cntl from debug to production environment recommended value. It mostly affects the PSL behavior when an error is raised in psl_fir1/2. Tested with cxlflash. Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Uma Krishnan <ukrishn@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-09mm, writeback: flush plugged IO in wakeup_flusher_threads()Konstantin Khlebnikov
I've found funny live-lock between raid10 barriers during resync and memory controller hard limits. Inside mpage_readpages() task holds on to its plug bio which blocks the barrier in raid10. Its memory cgroup have no free memory thus the task goes into reclaimer but all reclaimable pages are dirty and cannot be written because raid10 is rebuilding and stuck on the barrier. Common flush of such IO in schedule() never happens, because the caller doesn't go to sleep. Lock is 'live' because changing memory limit or killing tasks which holds that stuck bio unblock whole progress. That was what happened in 3.18.x but I see no difference in upstream logic. Theoretically this might happen even without memory cgroup. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-09Merge tag 'perf-urgent-for-mingo-20160809' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/urgent fixes from Arnaldo Carvalho de Melo: User visible fixes: - Fix the lookup for a kernel module in 'perf probe', fixing for instance, the erroneous return of "[raid10]" when looking for "[raid1]" (Konstantin Khlebnikov) - Disable counters in a group before reading them in 'perf stat', to avoid skew (Mark Rutland) - Fix adding probes to function aliases in systems using kaslr (Masami Hiramatsu) - Trip libtraceevent trace_seq buffers, removing unnecessary memory usage that could bring a system using tracepoint events with 'perf top' to a crawl, as the trace_seq buffers start at a whooping 4 KB, which is very rarely used in perf's usecases, so realloc it to the really used space as a last measure after using libtraceevent functions to format the fields of tracepoint events (Arnaldo Carvalho de Melo) - Fix 'perf probe' location when using DWARF on ppc64le (Ravi Bangoria) - Allow specifying signedness casts to a 'perf probe' variable, to shorten the number of steps to see signed values that otherwise would always appear as hex values (Naohiro Aota) Documentation fixes: - Add 'bpf-output' field to 'perf script' usage message (Brendan Gregg) Infrastructure fixes: - Sync kernel header files: cpufeatures.h, {disabled,required}-features.h, bpf.h and vmx.h, so that we get a clean build, without warnings about files being different from the kernel counterparts. A verification of the need or desirability of changes in tools/ based on what was done in the kernel changesets was made and documented in the respective file sync changesets (Arnaldo Carvalho de Melo) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-09Revert "printk: create pr_<level> functions"Linus Torvalds
This reverts commit 874f9c7da9a4acbc1b9e12ca722579fb50e4d142. Geert Uytterhoeven reports: "This change seems to have an (unintendent?) side-effect. Before, pr_*() calls without a trailing newline characters would be printed with a newline character appended, both on the console and in the output of the dmesg command. After this commit, no new line character is appended, and the output of the next pr_*() call of the same type may be appended, like in: - Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000 - Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM) + Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)" Joe Perches says: "No, that is not intentional. The newline handling code inside vprintk_emit is a bit involved and for now I suggest a revert until this has all the same behavior as earlier" Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Requested-by: Joe Perches <joe@perches.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-09Merge tag 'trace-v4.8-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fix tick_stop tracepoint symbols for user export. Luiz Capitulino noticed that the tick_stop tracepoint wasn't being parsed properly by the tracing user space tools. This was due to the TRACE_DEFINE_ENUM() being set to a define, when it should have been set to the enum itself. The define was of the MASK that used the BIT to shift. The BIT was the enum and by adding that, everything gets converted nicely. The MASK is still kept just in case it gets converted to an enum in the future" * tag 'trace-v4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix tick_stop tracepoint symbols for user export
2016-08-09Merge tag 'gcc-plugin-infrastructure-v4.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull gcc plugin improvements from Kees Cook: "Several fixes/improvements for the gcc plugin infrastructure: - fix a problem with gcc plugins interfering with cc-option tests. - abort more gracefully when gcc plugin headers or compiler support is missing. - improve the gcc plugin rule generation to be more dynamic, pass arguments, and build from subdirectories" * tag 'gcc-plugin-infrastructure-v4.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: gcc-plugins: Add support for plugin subdirectories gcc-plugins: Automate make rule generation gcc-plugins: Add support for passing plugin arguments gcc-plugins: abort builds cleanly when not supported kbuild: no gcc-plugins during cc-option tests
2016-08-09Merge tag 'platform-drivers-x86-v4.8-3' of ↵Linus Torvalds
git://git.infradead.org/users/dvhart/linux-platform-drivers-x86 Pull x86 platform driver update from Darren Hart: "dell-wmi: ignore battery remove/insert event" * tag 'platform-drivers-x86-v4.8-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86: dell-wmi: Ignore WMI event 0xe00e
2016-08-09Merge tag 'drm-fixes-for-4.8-rc2' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm fixes from Dave Airlie: "This contains a bunch of amdgpu fixes, and some i915 regression fixes. It also contains some fixes for an older regression with some EDID changes and some 6bpc panels. Then there are the lockdep, cirrus and rcar-du regression fixes from this window" * tag 'drm-fixes-for-4.8-rc2' of git://people.freedesktop.org/~airlied/linux: drm/cirrus: Fix NULL pointer dereference when registering the fbdev drm/edid: Set 8 bpc color depth for displays with "DFP 1.x compliant TMDS". drm/i915/dp: Revert "drm/i915/dp: fall back to 18 bpp when sink capability is unknown" drm/edid: Add 6 bpc quirk for display AEO model 0. drm: Paper over locking inversion after registration rework drm: rcar-du: Link HDMI encoder with bridge drm/ttm: Wait for a BO to become idle before unbinding it from GTT drm/i915/fbdev: Check for the framebuffer before use drm/amdgpu: update golden setting of polaris10 drm/amdgpu: update golden setting of stoney drm/amdgpu: update golden setting of polaris11 drm/amdgpu: update golden setting of carrizo drm/amdgpu: update golden setting of iceland drm/amd/amdgpu: change pptable output format from ASCII to binary drm/amdgpu/ci: add mullins to default case for smc ucode drm/amdgpu/gmc7: add missing mullins case drm/i915: Never fully mask the the EI up rps interrupt on SNB/IVB drm/i915: Wait up to 3ms for the pcu to ack the cdclk change request on SKL
2016-08-09ipr: Fix sync scsi scanBrian King
Commit b195d5e2bffd ("ipr: Wait to do async scan until scsi host is initialized") fixed async scan for ipr, but broke sync scan for ipr. This fixes sync scan back up. Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Reported-and-tested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-09mm: memcontrol: only mark charged pages with PageKmemcgVladimir Davydov
To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg, which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated with __GFP_ACCOUNT, including those that aren't actually charged to any cgroup, i.e. allocated from the root cgroup context. To avoid overhead in case cgroups are not used, we only do that if memcg_kmem_enabled() is true. The latter is set iff there are kmem-enabled memory cgroups (online or offline). The root cgroup is not considered kmem-enabled. As a result, if a page is allocated with __GFP_ACCOUNT for the root cgroup when there are kmem-enabled memory cgroups and is freed after all kmem-enabled memory cgroups were removed, e.g. # no memory cgroups has been created yet, create one mkdir /sys/fs/cgroup/memory/test # run something allocating pages with __GFP_ACCOUNT, e.g. # a program using pipe dmesg | tail # remove the memory cgroup rmdir /sys/fs/cgroup/memory/test we'll get bad page state bug complaining about page->_mapcount != -1: BUG: Bad page state in process swapper/0 pfn:1fd945c page:ffffea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0 flags: 0x1000000000000000() To avoid that, let's mark with PageKmemcg only those pages that are actually charged to and hence pin a non-root memory cgroup. Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths") Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-09drivers/perf: arm-pmu: Fix handling of SPI lacking "interrupt-affinity" propertyMarc Zyngier
Patch 19a469a58720 ("drivers/perf: arm-pmu: Handle per-interrupt affinity mask") added support for partitionned PPI setups, but inadvertently broke setups using SPIs without the "interrupt-affinity" property (which is the case for UP platforms). This patch restore the broken functionnality by testing whether the interrupt is percpu or not instead of relying on the using_spi flag that really means "SPI *and* interrupt-affinity property". Acked-by: Mark Rutland <mark.rutland@arm.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 19a469a58720 ("drivers/perf: arm-pmu: Handle per-interrupt affinity mask") Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-08-09drivers/perf: arm-pmu: convert arm_pmu_mutex to spinlockSudeep Holla
arm_pmu_mutex is never held long and we don't want to sleep while the lock is being held as it's executed in the context of hotplug notifiers. So it can be converted to a simple spinlock instead. Without this patch we get the following warning: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 in_atomic(): 1, irqs_disabled(): 128, pid: 0, name: swapper/2 no locks held by swapper/2/0. irq event stamp: 381314 hardirqs last enabled at (381313): _raw_spin_unlock_irqrestore+0x7c/0x88 hardirqs last disabled at (381314): cpu_die+0x28/0x48 softirqs last enabled at (381294): _local_bh_enable+0x28/0x50 softirqs last disabled at (381293): irq_enter+0x58/0x78 CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.7.0 #12 Call trace: dump_backtrace+0x0/0x220 show_stack+0x24/0x30 dump_stack+0xb4/0xf0 ___might_sleep+0x1d8/0x1f0 __might_sleep+0x5c/0x98 mutex_lock_nested+0x54/0x400 arm_perf_starting_cpu+0x34/0xb0 cpuhp_invoke_callback+0x88/0x3d8 notify_cpu_starting+0x78/0x98 secondary_start_kernel+0x108/0x1a8 This patch converts the mutex to spinlock to eliminate the above warnings. This constraints pmu->reset to be non-blocking call which is the case with all the ARM PMU backends. Cc: Stephen Boyd <sboyd@codeaurora.org> Fixes: 37b502f121ad ("arm/perf: Fix hotplug state machine conversion") Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-08-09ceph: initialize pathbase in the !dentry case in encode_caps_cb()Ilya Dryomov
pathbase is the base inode; set it to 0 if we've got no path. Coverity-id: 146348 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-09rbd: nuke the 32-bit pool id checkIlya Dryomov
ceph_file_layout::pool_id is now s64. rbd_add_get_pool_id() and ceph_pg_poolid_by_name() both return an int, so it's bogus anyway. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
2016-08-09perf probe ppc64le: Fix probe location when using DWARFRavi Bangoria
Powerpc has Global Entry Point and Local Entry Point for functions. LEP catches call from both the GEP and the LEP. Symbol table of ELF contains GEP and Offset from which we can calculate LEP, but debuginfo does not have LEP info. Currently, perf prioritize symbol table over dwarf to probe on LEP for ppc64le. But when user tries to probe with function parameter, we fall back to using dwarf(i.e. GEP) and when function called via LEP, probe will never hit. For example: $ objdump -d vmlinux ... do_sys_open(): c0000000002eb4a0: e8 00 4c 3c addis r2,r12,232 c0000000002eb4a4: 60 00 42 38 addi r2,r2,96 c0000000002eb4a8: a6 02 08 7c mflr r0 c0000000002eb4ac: d0 ff 41 fb std r26,-48(r1) $ sudo ./perf probe do_sys_open $ sudo cat /sys/kernel/debug/tracing/kprobe_events p:probe/do_sys_open _text+3060904 $ sudo ./perf probe 'do_sys_open filename:string' $ sudo cat /sys/kernel/debug/tracing/kprobe_events p:probe/do_sys_open _text+3060896 filename_string=+0(%gpr4):string For second case, perf probed on GEP. So when function will be called via LEP, probe won't hit. $ sudo ./perf record -a -e probe:do_sys_open ls [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.195 MB perf.data ] To resolve this issue, let's not prioritize symbol table, let perf decide what it wants to use. Perf is already converting GEP to LEP when it uses symbol table. When perf uses debuginfo, let it find LEP offset form symbol table. This way we fall back to probe on LEP for all cases. After patch: $ sudo ./perf probe 'do_sys_open filename:string' $ sudo cat /sys/kernel/debug/tracing/kprobe_events p:probe/do_sys_open _text+3060904 filename_string=+0(%gpr4):string $ sudo ./perf record -a -e probe:do_sys_open ls [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.197 MB perf.data (11 samples) ] Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1470723805-5081-2-git-send-email-ravi.bangoria@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-08-09perf probe: Add function to post process kernel trace eventsRavi Bangoria
Instead of inline code, introduce function to post process kernel probe trace events. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1470723805-5081-1-git-send-email-ravi.bangoria@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>