summaryrefslogtreecommitdiff
path: root/arch/powerpc/kernel
AgeCommit message (Collapse)Author
2017-05-03powerpc/book3s/mce: Move add_taint() later in virtual modeMahesh Salgaonkar
machine_check_early() gets called in real mode. The very first time when add_taint() is called, it prints a warning which ends up calling opal call (that uses OPAL_CALL wrapper) for writing it to console. If we get a very first machine check while we are in opal we are doomed. OPAL_CALL overwrites the PACASAVEDMSR in r13 and in this case when we are done with MCE handling the original opal call will use this new MSR on it's way back to opal_return. This usually leads to unexpected behaviour or the kernel to panic. Instead move the add_taint() call later in the virtual mode where it is safe to call. This is broken with current FW level. We got lucky so far for not getting very first MCE hit while in OPAL. But easily reproducible on Mambo. Fixes: 27ea2c420cad ("powerpc: Set the correct kernel taint on machine check errors.") Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function bodyMichael Ellerman
The entire body of unregister_cpu_online() is inside an #ifdef CONFIG_HOTPLUG_CPU block. This is ugly and means we create an empty function when hotplug is disabled for no reason. Instead move the #ifdef out of the function body and define the function to be NULL in the else case. This means we'll pass NULL to cpuhp_setup_state(), but that's fine because it accepts NULL to mean there is no teardown callback, which is exactly what we want. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-03powerpc/smp: Document irq enable/disable after migrating IRQsMichael Ellerman
This code was until recently completely undocumented and even now the comment is not very verbose. We've already had one patch sent to remove the IRQ enable/disable because it's "paradoxical and unnecessary". So document it thoroughly to save anyone else from puzzling over it. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatch updates from Jiri Kosina: - a per-task consistency model is being added for architectures that support reliable stack dumping (extending this, currently rather trivial set, is currently in the works). This extends the nature of the types of patches that can be applied by live patching infrastructure. The code stems from the design proposal made [1] back in November 2014. It's a hybrid of SUSE's kGraft and RH's kpatch, combining advantages of both: it uses kGraft's per-task consistency and syscall barrier switching combined with kpatch's stack trace switching. There are also a number of fallback options which make it quite flexible. Most of the heavy lifting done by Josh Poimboeuf with help from Miroslav Benes and Petr Mladek [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz - module load time patch optimization from Zhou Chengming - a few assorted small fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: add missing printk newlines livepatch: Cancel transition a safe way for immediate patches livepatch: Reduce the time of finding module symbols livepatch: make klp_mutex proper part of API livepatch: allow removal of a disabled patch livepatch: add /proc/<pid>/patch_state livepatch: change to a per-task consistency model livepatch: store function sizes livepatch: use kstrtobool() in enabled_store() livepatch: move patching functions into patch.c livepatch: remove unnecessary object loaded check livepatch: separate enabled and patched states livepatch/s390: add TIF_PATCH_PENDING thread flag livepatch/s390: reorganize TIF thread flag bits livepatch/powerpc: add TIF_PATCH_PENDING thread flag livepatch/x86: add TIF_PATCH_PENDING thread flag livepatch: create temporary klp_update_patch_state() stub x86/entry: define _TIF_ALLWORK_MASK flags explicitly stacktrace/x86: add function for detecting reliable stack traces
2017-05-02Merge tag 'pstore-v4.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "This has a large internal refactoring along with several smaller fixes. - constify compression structures; Bhumika Goyal - restore powerpc dumping; Ankit Kumar - fix more bugs in the rarely exercises module unloading logic - reorganize filesystem locking to fix problems noticed by lockdep - refactor internal pstore APIs to make development and review easier: - improve error reporting - add kernel-doc structure and function comments - avoid insane argument passing by using a common record structure" * tag 'pstore-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits) pstore: Solve lockdep warning by moving inode locks pstore: Fix flags to enable dumps on powerpc pstore: Remove unused vmalloc.h in pmsg pstore: simplify write_user_compat() pstore: Remove write_buf() callback pstore: Replace arguments for write_buf_user() API pstore: Replace arguments for write_buf() API pstore: Replace arguments for erase() API pstore: Do not duplicate record metadata pstore: Allocate records on heap instead of stack pstore: Pass record contents instead of copying pstore: Always allocate buffer for decompression pstore: Replace arguments for write() API pstore: Replace arguments for read() API pstore: Switch pstore_mkfile to pass record pstore: Move record decompression to function pstore: Extract common arguments into structure pstore: Add kernel-doc for struct pstore_info pstore: Improve register_pstore() error reporting pstore: Avoid race in module unloading ...
2017-05-02powerpc/eeh: Clean up and document event handling functionsRussell Currey
Remove unnecessary tags in eeh_handle_normal_event(), and add function comments for eeh_handle_normal_event() and eeh_handle_special_event(). The only functional difference is that in the case of a PE reaching the maximum number of failures, rather than one message telling you of this and suggesting you reseat the device, there are two separate messages. Suggested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Russell Currey <ruscur@russell.cc> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-02powerpc/eeh: Avoid use after free in eeh_handle_special_event()Russell Currey
eeh_handle_special_event() is called when an EEH event is detected but can't be narrowed down to a specific PE. This function looks through every PE to find one in an erroneous state, then calls the regular event handler eeh_handle_normal_event() once it knows which PE has an error. However, if eeh_handle_normal_event() found that the PE cannot possibly be recovered, it will free it, rendering the passed PE stale. This leads to a use after free in eeh_handle_special_event() as it attempts to clear the "recovering" state on the PE after eeh_handle_normal_event() returns. Thus, make sure the PE is valid when attempting to clear state in eeh_handle_special_event(). Fixes: 8a6b1bc70dbb ("powerpc/eeh: EEH core to handle special event") Cc: stable@vger.kernel.org # v3.11+ Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Russell Currey <ruscur@russell.cc> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-05-01Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: "The main changes in this cycle were: - unwinder fixes and enhancements - improve ftrace interaction with the unwinder - optimize the code footprint of WARN() and related debugging constructs - ... plus misc updates, cleanups and fixes" * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) x86/unwind: Dump all stacks in unwind_dump() x86/unwind: Silence more entry-code related warnings x86/ftrace: Fix ebp in ftrace_regs_caller that screws up unwinder x86/unwind: Remove unused 'sp' parameter in unwind_dump() x86/unwind: Prepend hex mask value with '0x' in unwind_dump() x86/unwind: Properly zero-pad 32-bit values in unwind_dump() x86/unwind: Ensure stack pointer is aligned debug: Avoid setting BUGFLAG_WARNING twice x86/unwind: Silence entry-related warnings x86/unwind: Read stack return address in update_stack_state() x86/unwind: Move common code into update_stack_state() debug: Fix __bug_table[] in arch linker scripts debug: Add _ONCE() logic to report_bug() x86/debug: Define BUG() again for !CONFIG_BUG x86/debug: Implement __WARN() using UD0 x86/ftrace: Use Makefile logic instead of #ifdef for compiling ftrace_*.o x86/ftrace: Add -mfentry support to x86_32 with DYNAMIC_FTRACE set x86/ftrace: Clean up ftrace_regs_caller x86/ftrace: Add stack frame pointer to ftrace_caller x86/ftrace: Move the ftrace specific code out of entry_32.S ...
2017-05-01Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - another round of rq-clock handling debugging, robustization and fixes - PELT accounting improvements - CPU hotplug related ->cpus_allowed affinity handling fixes all around the tree - ... plus misc fixes, cleanups and updates" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits) sched/x86: Update reschedule warning text crypto: N2 - Replace racy task affinity logic cpufreq/sparc-us2e: Replace racy task affinity logic cpufreq/sparc-us3: Replace racy task affinity logic cpufreq/sh: Replace racy task affinity logic cpufreq/ia64: Replace racy task affinity logic ACPI/processor: Replace racy task affinity logic ACPI/processor: Fix error handling in __acpi_processor_start() sparc/sysfs: Replace racy task affinity logic powerpc/smp: Replace open coded task affinity logic ia64/sn/hwperf: Replace racy task affinity logic ia64/salinfo: Replace racy task affinity logic workqueue: Provide work_on_cpu_safe() ia64/topology: Remove cpus_allowed manipulation sched/fair: Move the PELT constants into a generated header sched/fair: Increase PELT accuracy for small tasks sched/fair: Fix comments sched/Documentation: Add 'sched-pelt' tool sched/fair: Fix corner case in __accumulate_sum() sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags() ...
2017-05-01Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "The timer departement delivers: - more year 2038 rework - a massive rework of the arm achitected timer - preparatory patches to allow NTP correction of clock event devices to avoid early expiry - the usual pile of fixes and enhancements all over the place" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (91 commits) timer/sysclt: Restrict timer migration sysctl values to 0 and 1 arm64/arch_timer: Mark errata handlers as __maybe_unused Clocksource/mips-gic: Remove redundant non devicetree init MIPS/Malta: Probe gic-timer via devicetree clocksource: Use GENMASK_ULL in definition of CLOCKSOURCE_MASK acpi/arm64: Add SBSA Generic Watchdog support in GTDT driver clocksource: arm_arch_timer: add GTDT support for memory-mapped timer acpi/arm64: Add memory-mapped timer support in GTDT driver clocksource: arm_arch_timer: simplify ACPI support code. acpi/arm64: Add GTDT table parse driver clocksource: arm_arch_timer: split MMIO timer probing. clocksource: arm_arch_timer: add structs to describe MMIO timer clocksource: arm_arch_timer: move arch_timer_needs_of_probing into DT init call clocksource: arm_arch_timer: refactor arch_timer_needs_probing clocksource: arm_arch_timer: split dt-only rate handling x86/uv/time: Set ->min_delta_ticks and ->max_delta_ticks unicore32/time: Set ->min_delta_ticks and ->max_delta_ticks um/time: Set ->min_delta_ticks and ->max_delta_ticks tile/time: Set ->min_delta_ticks and ->max_delta_ticks score/time: Set ->min_delta_ticks and ->max_delta_ticks ...
2017-04-30powerpc/64e: Fix hang when debugging programs with relocated kernelLiuHailong
Debug interrupts can be taken during interrupt entry, since interrupt entry does not automatically turn them off. The kernel will check whether the faulting instruction is between [interrupt_base_book3e, __end_interrupts], and if so clear MSR[DE] and return. However, when the kernel is built with CONFIG_RELOCATABLE, it can't use LOAD_REG_IMMEDIATE(r14,interrupt_base_book3e) and LOAD_REG_IMMEDIATE(r15,__end_interrupts), as they ignore relocation. Thus, if the kernel is actually running at a different address than it was built at, the address comparison will fail, and the exception entry code will hang at kernel_dbg_exc. r2(toc) is also not usable here, as r2 still holds data from the interrupted context, so LOAD_REG_ADDR() doesn't work either. So we use the *name@got* to get the EV of two labels directly. Test programs test.c shows as follows: int main(int argc, char *argv[]) { if (access("/proc/sys/kernel/perf_event_paranoid", F_OK) == -1) printf("Kernel doesn't have perf_event support\n"); } Steps to reproduce the bug, for example: 1) ./gdb ./test 2) (gdb) b access 3) (gdb) r 4) (gdb) s Signed-off-by: Liu Hailong <liu.hailong6@zte.com.cn> Signed-off-by: Jiang Xuexin <jiang.xuexin@zte.com.cn> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Reviewed-by: Liu Song <liu.song11@zte.com.cn> Reviewed-by: Huang Jian <huang.jian@zte.com.cn> [scottwood: cleaned up commit message, and specified bad behavior as a hang rather than an oops to correspond to mainline kernel behavior] Fixes: 1cb6e0649248 ("powerpc/book3e: support CONFIG_RELOCATABLE") Cc: <stable@vger.kernel.org> # 4.4.x- Signed-off-by: Scott Wood <oss@buserror.net>
2017-04-28Merge branch 'pci/resource-mmap' into nextBjorn Helgaas
* pci/resource-mmap: ia64: Use generic pci_mmap_resource_range() ia64: Remove redundant checks for WC in pci_mmap_page_range() ia64: Remove redundant valid_mmap_phys_addr_range() from pci_mmap_page_range() PCI: Add I/O BAR support to generic pci_mmap_resource_range() x86/PCI: Use generic pci_mmap_resource_range() unicore32/PCI: Use generic pci_mmap_resource_range() sh/PCI: Use generic pci_mmap_resource_range() parisc: Use generic pci_mmap_resource_range() mn10300/PCI: Use generic pci_mmap_resource_range() MIPS: PCI: Use generic pci_mmap_resource_range() cris/PCI: Use generic pci_mmap_resource_range() ARM/PCI: Use generic pci_mmap_resource_range() PCI: Add pci_mmap_resource_range() and use it for ARM64 PCI: Add BAR index argument to pci_mmap_page_range() PCI: Use BAR index in sysfs attr->private instead of resource pointer PCI: Add arch_can_pci_mmap_io() on architectures which can mmap() I/O space PCI: Move multiple declarations of pci_mmap_page_range() to <linux/pci.h> PCI: Add arch_can_pci_mmap_wc() macro xtensa/PCI: Do not mmap PCI BARs to userspace as write-through PCI: Only allow WC mmap on prefetchable resources PCI: Fix another sanity check bug in /proc/pci mmap PCI: Fix pci_mmap_fits() for HAVE_PCI_RESOURCE_TO_USER platforms
2017-04-28powerpc: Add struct smp_ops_t.cause_nmi_ipi operationNicholas Piggin
Have the NMI IPI code use this op when the platform defines it. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc: Add NMI IPI infrastructureNicholas Piggin
Add a simple NMI IPI system that handles concurrency and reentrancy. The platform does not have to implement a true non-maskable interrupt, the default is to simply use the debugger break IPI message. This has now been co-opted for a general IPI message, and users (debugger and crash) have been reimplemented on top of the NMI system. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Incorporate incremental fixes from Nick] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc: Mark system reset as an NMI with nmi_enter/exit()Nicholas Piggin
System reset is a non-maskable interrupt from Linux's point of view (occurs under local_irq_disable()), so it should use nmi_enter/exit. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/64s: Dedicated system reset interrupt stackNicholas Piggin
The system reset interrupt is used for crash/debug situations, so it is desirable to have as little impact on the normal state of the system as possible. Currently it uses the current kernel stack to process the exception. This stores into the stack which may be involved with the crash. The stack pointer may be corrupted, or it may have overflowed. Avoid or minimise these problems by creating a dedicated NMI stack for the system reset interrupt to use. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/64s: Disallow system reset vs system reset reentrancyNicholas Piggin
In preparation for using a dedicated stack for system reset interrupts, prevent a nested system reset from recovering, in order to simplify code that is called in crash/debug path. This allows a system reset interrupt to just use the base stack pointer. Keep an in_nmi nesting counter similarly to the in_mce counter. Consider the interrrupt non-recoverable if it is taken inside another system reset. Interrupt nesting could be allowed similarly to MCE, but system reset is a special case that's not for normal operation, so simplicity wins until there is requirement for nested system reset interrupts. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/64s: Fix system reset vs general interrupt reentrancyNicholas Piggin
The system reset interrupt can occur when MSR_EE=0, and it currently uses the PACA_EXGEN save area. Some PACA_EXGEN interrupts have a window where MSR_RI=1 and MSR_EE=0 when the save area is still in use. A system reset interrupt in this window can lead to undetected corruption when the save area gets overwritten. This patch introduces PACA_EXNMI save area for system reset exceptions, which closes this corruption window. It's also helpful to retain the EXGEN state for debugging situations, even if not considering the recoverability aspect. This patch also moves the PACA_EXMC area down to a less frequently used part of the paca with the new save area. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/64s: Exception macro for stack frame and initial register saveNicholas Piggin
This code is common to a few exceptions, and another user will be added. This causes a trivial change to generated code: - 604: std r9,416(r1) - 608: mfspr r11,314 - 60c: std r11,368(r1) - 610: mfspr r12,315 + 604: mfspr r11,314 + 608: mfspr r12,315 + 60c: std r9,416(r1) + 610: std r11,368(r1) machine_check_powernv_early could also use this, but that requires non trivial changes to generated code, so that's for another patch. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28powerpc/64s: Add exception macro that does not enable RINicholas Piggin
Subsequent patches will add more non-RI variant exceptions, so create a macro for it rather than open-code it. This does not change generated instructions. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-28Merge branch 'topic/ppc-kvm' into nextMichael Ellerman
Merge the topic branch we were sharing with kvm-ppc, Paul has also merged it.
2017-04-28Merge remote-tracking branch 'remotes/powerpc/topic/xive' into kvm-ppc-nextPaul Mackerras
This merges in the powerpc topic/xive branch to bring in the code for the in-kernel XICS interrupt controller emulation to use the new XIVE (eXternal Interrupt Virtualization Engine) hardware in the POWER9 chip directly, rather than via a XICS emulation in firmware. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-04-27pstore: Fix flags to enable dumps on powerpcAnkit Kumar
After commit c950fd6f201a kernel registers pstore write based on flag set. Pstore write for powerpc is broken as flags(PSTORE_FLAGS_DMESG) is not set for powerpc architecture. On panic, kernel doesn't write message to /fs/pstore/dmesg*(Entry doesn't gets created at all). This patch enables pstore write for powerpc architecture by setting PSTORE_FLAGS_DMESG flag. Fixes: c950fd6f201a ("pstore: Split pstore fragile flags") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Ankit Kumar <ankit@linux.vnet.ibm.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2017-04-27powerpc/ftrace/64: Split further based on -mprofile-kernelNaveen N. Rao
Split ftrace_64.S further retaining the core ftrace 64-bit aspects in ftrace_64.S and moving ftrace_caller() and ftrace_graph_caller() into separate files based on -mprofile-kernel. The livepatch routines are all now contained within the mprofile file. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27powerpc: Split ftrace bits into a separate fileNaveen N. Rao
entry_*.S now includes a lot more than just kernel entry/exit code. As a first step at cleaning this up, let's split out the ftrace bits into separate files. Also move all related tracing code into a new trace/ subdirectory. No functional changes. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-27KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controllerBenjamin Herrenschmidt
This patch makes KVM capable of using the XIVE interrupt controller to provide the standard PAPR "XICS" style hypercalls. It is necessary for proper operations when the host uses XIVE natively. This has been lightly tested on an actual system, including PCI pass-through with a TG3 device. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Cleanup pr_xxx(), unsplit pr_xxx() strings, etc., fix build failures by adding KVM_XIVE which depends on KVM_XICS and XIVE, and adding empty stubs for the kvm_xive_xxx() routines, fixup subject, integrate fixes from Paul for building PR=y HV=n] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-26powerpc/64s: Revert setting of LPCR[LPES] on POWER9Nicholas Piggin
The XIVE enablement patches included a change to set the LPES (Logical Partitioning Environment Selector) bit (bit # 3) in LPCR (Logical Partitioning Control Register) on POWER9 hosts. This bit sets external interrupts to guest delivery mode, which uses SRR0/1. The host's EE interrupt handler is written to expect HSRR0/1 (for earlier CPUs). This should be fine because XIVE is configured not to deliver EEs to the host (Hypervisor Virtulization Interrupt is used instead) so the EE handler should never be executed. However a bug in interrupt controller code, hardware, or odd configuration of a simulator could result in the host getting an EE incorrectly. Keeping the EE delivery mode matching the host EE handler prevents strange crashes due to using the wrong exception registers. KVM will configure the LPCR to set LPES prior to running a guest so that EEs are delivered to the guest using SRR0/1. Fixes: 08a1e650cc ("powerpc: Fixup LPCR:PECE and HEIC setting on POWER9") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Massage change log to avoid referring to LPES0 which is now renamed LPES] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-25powerpc/sysfs: Fix reference leak of cpu device_nodes present at bootTyrel Datwyler
For CPUs present at boot each logical CPU acquires a reference to the associated device node of the core. This happens in register_cpu() which is called by topology_init(). The result of this is that we end up with a reference held by each thread of the core. However, these references are never freed if the CPU core is DLPAR removed. This patch fixes the reference leaks by acquiring and releasing the references in the CPU hotplug callbacks un/register_cpu_online(). With this patch symmetric reference counting is observed with both CPUs present at boot, and those DLPAR added after boot. Fixes: f86e4718f24b ("driver/core: cpu: initialize of_node in cpu's device struture") Cc: stable@vger.kernel.org # v3.12+ Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-25Merge branch 'topic/kprobes' into nextMichael Ellerman
Although most of these kprobes patches are powerpc specific, there's a couple that touch generic code (with Acks). At the moment there's one conflict with acme's tree, but it's not too bad. Still just in case some other conflicts show up, we've put these in a topic branch so another tree could merge some or all of it if necessary.
2017-04-24powerpc/kprobes: Prefer ftrace when probing function entryNaveen N. Rao
KPROBES_ON_FTRACE avoids much of the overhead of regular kprobes as it eliminates the need for a trap, as well as the need to emulate or single-step instructions. Though OPTPROBES provides us with similar performance, we have limited optprobes trampoline slots. As such, when asked to probe at a function entry, default to using the ftrace infrastructure. With: # cd /sys/kernel/debug/tracing # echo 'p _do_fork' > kprobe_events before patch: # cat ../kprobes/list c0000000000daf08 k _do_fork+0x8 [DISABLED] c000000000044fc0 k kretprobe_trampoline+0x0 [OPTIMIZED] and after patch: # cat ../kprobes/list c0000000000d074c k _do_fork+0xc [DISABLED][FTRACE] c0000000000412b0 k kretprobe_trampoline+0x0 [OPTIMIZED] Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-24powerpc: Introduce a new helper to obtain function entry pointsNaveen N. Rao
kprobe_lookup_name() is specific to the kprobe subsystem and may not always return the function entry point (in a subsequent patch for KPROBES_ON_FTRACE). For looking up function entry points, introduce a separate helper and use it in optprobes.c Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-24powerpc/kprobes: Add support for KPROBES_ON_FTRACENaveen N. Rao
Allow kprobes to be placed on ftrace _mcount() call sites. This optimization avoids the use of a trap, by riding on ftrace infrastructure. This depends on HAVE_DYNAMIC_FTRACE_WITH_REGS which depends on MPROFILE_KERNEL, which is only currently enabled on powerpc64le with newer toolchains. Based on the x86 code by Masami. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-24powerpc/ftrace: Restore LR from pt_regsNaveen N. Rao
Pass the real LR to the ftrace handler. This is needed for KPROBES_ON_FTRACE for the pre handlers. Also, with KPROBES_ON_FTRACE, the link register may be updated by the pre handlers or by a registed kretprobe. Honor updated LR by restoring it from pt_regs, rather than from the stack save area. Live patch and function graph continue to work fine with this change. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/kprobes: Blacklist exception handlersNaveen N. Rao
Introduce __head_end to mark end of the early fixed sections and use it to blacklist all exception handlers from kprobes. mpe: We do not need to do anything special for relocatable kernels, where the exception vectors are split from the main kernel, as the split vectors are already excluded by the check for kernel_text_address(). Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> [mpe: Move __head_end outside #ifdef 64-bit to unbreak the 32-bit build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/kprobes: Convert __kprobes to NOKPROBE_SYMBOL()Naveen N. Rao
Along similar lines as commit 9326638cbee2 ("kprobes, x86: Use NOKPROBE_SYMBOL() instead of __kprobes annotation"), convert __kprobes annotation to either NOKPROBE_SYMBOL() or nokprobe_inline. The latter forces inlining, in which case the caller needs to be added to NOKPROBE_SYMBOL(). Also: - blacklist arch_deref_entry_point(), and - convert a few regular inlines to nokprobe_inline in lib/sstep.c A key benefit is the ability to detect such symbols as being blacklisted. Before this patch: $ cat /sys/kernel/debug/kprobes/blacklist | grep read_mem $ perf probe read_mem Failed to write event: Invalid argument Error: Failed to add events. $ dmesg | tail -1 [ 3736.112815] Could not insert probe at _text+10014968: -22 After patch: $ cat /sys/kernel/debug/kprobes/blacklist | grep read_mem 0xc000000000072b50-0xc000000000072d20 read_mem $ perf probe read_mem read_mem is blacklisted function, skip it. Added new events: (null):(null) (on read_mem) probe:read_mem (on read_mem) You can now use it in all perf tools, such as: perf record -e probe:read_mem -aR sleep 1 $ grep " read_mem" /proc/kallsyms c000000000072b50 t read_mem c0000000005f3b40 t read_mem $ cat /sys/kernel/debug/kprobes/list c0000000005f3b48 k read_mem+0x8 [DISABLED] Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> [mpe: Minor change log formatting, fix up some conflicts] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/ftrace: Move stack setup and teardown code into ftrace_graph_caller()Naveen N. Rao
Move the stack setup and teardown code into ftrace_graph_caller(). This way, we don't incur the cost of setting it up unless function graph is enabled for this function. Also, remove the extraneous LR restore code after the function graph stub. LR has previously been restored and neither livepatch_handler() nor ftrace_graph_caller() return back here. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> [mpe: Drop bad change to non-mprofile-kernel version of ftrace_graph_caller] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/kprobes: Remove duplicate saving of MSRNaveen N. Rao
set_current_kprobe() already saves regs->msr into kprobe_saved_msr. Remove the redundant save. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/64s: Simplify POWER9 DD1 idle workaround codeNicholas Piggin
The idle workaround does not need to load PACATOC, and it does not need to be called within a nested function that requires LR to be saved. Load the PACATOC at entry to the idle wakeup. It does not matter which PACA this comes from, so it's okay to call before the workaround. Then apply the workaround to get the right PACA. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/64s: Idle POWER8 avoid full state loss recovery where possibleNicholas Piggin
If not all threads were in winkle, full state loss recovery is not necessary and can be avoided. A previous patch removed this optimisation due to some complexity with the implementation. Re-implement it by counting the number of threads in winkle with the per-core idle state. Only restore full state loss if all threads were in winkle. This has a small window of false positives right before threads execute winkle and just after they wake up, when the winkle count does not reflect the true number of threads in winkle. This is not a significant problem in comparison with even the minimum winkle duration. For correctness, a false positive is not a problem (only false negatives would be). Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/64s: Idle do not hold reservation longer than requiredNicholas Piggin
When taking the core idle state lock, grab it immediately like a regular lock, rather than adding more tests in there. Holding the lock keeps it stable, so there is no need to do it whole holding the reservation. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/64s: Expand core idle state bitsNicholas Piggin
In preparation for adding more bits to the core idle state word, move the lock bit up, and unlock by flipping the lock bit rather than masking off all but the thread bits. Add branch hints for atomic operations while we're here. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/64s: Fix POWER9 machine check handler from stop stateNicholas Piggin
The ISA specifies power save wakeup due to a machine check exception can cause a machine check interrupt (rather than the usual system reset interrupt). The machine check handler copes with this by doing low level machine check recovery without restoring full state from idle, then queues up a machine check event for logging, then directly executes the same idle instruction it woke from. This minimises the work done before recovery is performed. The problem is that it requires machine specific instructions and knowledge of the book3s idle code. Currently it only has code to handle POWER8 idle, so POWER9 crashes when trying to execute the P8 idle instructions which don't exist in ISAv3.0B. cpu 0x0: Vector: e40 (Emulation Assist) at [c0000000008f3810] pc: c000000000008380: machine_check_handle_early+0x130/0x2f0 lr: c00000000053a098: stop_loop+0x68/0xd0 sp: c0000000008f3a90 msr: 9000000000081001 current = 0xc0000000008a1080 paca = 0xc00000000ffd0000 softe: 0 irq_happened: 0x01 pid = 0, comm = swapper/0 Instead of going to sleep after recovery, do the usual idle wakeup and state restoration by calling into the normal idle wakeup path. This reuses the normal idle wakeup paths. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/64s: Use alternative feature patchingNicholas Piggin
This reduces the number of nops for POWER8. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/64s: Stop using bit in HSPRG0 to test winkleNicholas Piggin
The POWER8 idle code has a neat trick of programming the power on engine to restore a low bit into HSPRG0, so idle wakeup code can test and see if it has been programmed this way and therefore lost all state. Restore time can be reduced if winkle has not been reached. However this messes with our r13 PACA pointer, and requires HSPRG0 to be written to. It also optimizes the slowest and most uncommon case at the expense of another SPR write in the common nap state wakeup. Remove this complexity and assume winkle sleeps always require a state restore. This speedup could be made entirely contained within the winkle idle code by counting per-core winkles and setting a thread bitmap when all have gone to winkle. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/64s: Move remaining system reset idle code into idle_book3s.SNicholas Piggin
No functional change. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-23powerpc/64s: Remove unnecessary relocation branch from idle handlerNicholas Piggin
The system reset idle handler system_reset_idle_common is relocated, so relocation is not required to branch to kvm_start_guest. The superfluous relocation does not result in incorrect code, but it does not compile outside of exception-64s.S (with fixed section definitions). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-20PCI: Add BAR index argument to pci_mmap_page_range()David Woodhouse
In all cases we know which BAR it is. Passing it in means that arch code (or generic code; watch this space) won't have to go looking for it again. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-04-20powerpc/kprobes: Emulate instructions on kprobe handler re-entryNaveen N. Rao
On kprobe handler re-entry, try to emulate the instruction rather than single stepping always. Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-20powerpc/kprobes: Factor out code to emulate instruction into a helperNaveen N. Rao
Factor out code to emulate instruction into a try_to_emulate() helper function. This makes no functional changes. Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-20powerpc/kretprobes: Override default function entry offsetNaveen N. Rao
With ABIv2, we offset 8 bytes into a function to get at the local entry point. mpe: NB this function is currently not called, the change to generic code to call it is being merged via the tip tree. Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>