summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-08-18kernel/watchdog: Prevent false positives with turbo modesThomas Gleixner
The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <Kan.liang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dzickus@redhat.com Cc: prarit@redhat.com Cc: ak@linux.intel.com Cc: babu.moger@oracle.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: acme@redhat.com Cc: stable@vger.kernel.org Cc: atomlin@redhat.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-18genirq: Restore trigger settings in irq_modify_status()Marc Zyngier
irq_modify_status starts by clearing the trigger settings from irq_data before applying the new settings, but doesn't restore them, leaving them to IRQ_TYPE_NONE. That's pretty confusing to the potential request_irq() that could follow. Instead, snapshot the settings before clearing them, and restore them if the irq_modify_status() invocation was not changing the trigger. Fixes: 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ") Reported-and-tested-by: jeffy <jeffy.chen@rock-chips.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jon Hunter <jonathanh@nvidia.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170818095345.12378-1-marc.zyngier@arm.com
2017-08-18Merge branch 'irq/for-gpio' into irq/coreThomas Gleixner
Merge the flow handlers and irq domain extensions which are in a separate branch so they can be consumed by the gpio folks.
2017-08-18irqdomain: Add irq_domain_{push,pop}_irq() functionsDavid Daney
For an already existing irqdomain hierarchy, as might be obtained via a call to pci_enable_msix_range(), a PCI driver wishing to add an additional irqdomain to the hierarchy needs to be able to insert the irqdomain to that already initialized hierarchy. Calling irq_domain_create_hierarchy() allows the new irqdomain to be created, but no existing code allows for initializing the associated irq_data. Add a couple of helper functions (irq_domain_push_irq() and irq_domain_pop_irq()) to initialize the irq_data for the new irqdomain added to an existing hierarchy. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-gpio@vger.kernel.org Link: http://lkml.kernel.org/r/1503017616-3252-6-git-send-email-david.daney@cavium.com
2017-08-18irqdomain: Check for NULL function pointer in irq_domain_free_irqs_hierarchy()David Daney
A follow-on patch will call irq_domain_free_irqs_hierarchy() when the free() function pointer may be NULL. Add a NULL pointer check to handle this new use case. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-gpio@vger.kernel.org Link: http://lkml.kernel.org/r/1503017616-3252-5-git-send-email-david.daney@cavium.com
2017-08-18irqdomain: Factor out code to add and remove items to and from the revmapDavid Daney
The code to add and remove items to and from the revmap occurs several times. In preparation for the follow on patches that add more uses of this code, factor this out in to separate static functions. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-gpio@vger.kernel.org Link: http://lkml.kernel.org/r/1503017616-3252-4-git-send-email-david.daney@cavium.com
2017-08-18genirq: Add handle_fasteoi_{level,edge}_irq flow handlersDavid Daney
Follow-on patch for gpio-thunderx uses a irqdomain hierarchy which requires slightly different flow handlers, add them to chip.c which contains most of the other flow handlers. Make these conditionally compiled based on CONFIG_IRQ_FASTEOI_HIERARCHY_HANDLERS. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-gpio@vger.kernel.org Link: http://lkml.kernel.org/r/1503017616-3252-3-git-send-email-david.daney@cavium.com
2017-08-18genirq: Export more irq_chip_*_parent() functionsDavid Daney
Many of the family of functions including irq_chip_mask_parent(), irq_chip_unmask_parent() are exported, but not all. Add EXPORT_SYMBOL_GPL to irq_chip_enable_parent, irq_chip_disable_parent and irq_chip_set_affinity_parent, so they likewise are usable from modules. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-gpio@vger.kernel.org Link: http://lkml.kernel.org/r/1503017616-3252-2-git-send-email-david.daney@cavium.com
2017-08-18genirq/proc: Use the the accessor to report the effective affinityMarc Zyngier
If CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK is defined, but that the interrupt is not single target, the effective affinity reported in /proc/irq/x/effective_affinity will be empty, which is not the truth. Instead, use the accessor to report the affinity, which will pick the right mask. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: James Hogan <james.hogan@imgtec.com> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Chris Zankel <chris@zankel.net> Cc: Kevin Cernekee <cernekee@gmail.com> Cc: Wei Xu <xuwei5@hisilicon.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Gregory Clement <gregory.clement@free-electrons.com> Cc: Matt Redfearn <matt.redfearn@imgtec.com> Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> Link: http://lkml.kernel.org/r/20170818083925.10108-3-marc.zyngier@arm.com
2017-08-18genirq/debugfs: Triggering of interrupts from userspaceMarc Zyngier
When developing new (and therefore buggy) interrupt related code, it can sometimes be useful to inject interrupts without having to rely on a device to actually generate them. This functionnality relies either on the irqchip driver to expose a irq_set_irqchip_state(IRQCHIP_STATE_PENDING) callback, or on the core code to be able to retrigger a (edge-only) interrupt. To use this feature: echo -n trigger > /sys/kernel/debug/irq/irqs/IRQNUM WARNING: This is DANGEROUS, and strictly a debug feature. Do not use it on a production system. Your HW is likely to catch fire, your data to be corrupted, and reporting this will make you look an even bigger fool than the idiot who wrote this patch. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170818081156.9264-1-marc.zyngier@arm.com
2017-08-18ACPI / PM: Check low power idle constraints for debug onlySrinivas Pandruvada
For SoC to achieve its lowest power platform idle state a set of hardware preconditions must be met. These preconditions or constraints can be obtained by issuing a device specific method (_DSM) with function "1". Refer to the document provided in the link below. Here during initialization (from attach() callback of LPS0 device), invoke function 1 to get the device constraints. Each enabled constraint is stored in a table. The devices in this table are used to check whether they were in required minimum state, while entering suspend. This check is done from platform freeze wake() callback, only when /sys/power/pm_debug_messages attribute is non zero. If any constraint is not met and device is ACPI power managed then it prints the device information to kernel logs. Also if debug is enabled in acpi/sleep.c, the constraint table and state of each device on wake is dumped in kernel logs. Since pm_debug_messages_on setting is used as condition to check constraints outside kernel/power/main.c, pm_debug_messages_on is changed to a global variable. Link: http://www.uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-18cpufreq: schedutil: Always process remote callback with slow switchingViresh Kumar
The frequency update from the utilization update handlers can be divided into two parts: (A) Finding the next frequency (B) Updating the frequency While any CPU can do (A), (B) can be restricted to a group of CPUs only, depending on the current platform. For platforms where fast cpufreq switching is possible, both (A) and (B) are always done from the same CPU and that CPU should be capable of changing the frequency of the target CPU. But for platforms where fast cpufreq switching isn't possible, after doing (A) we wake up a kthread which will eventually do (B). This kthread is already bound to the right set of CPUs, i.e. only those which can change the frequency of CPUs of a cpufreq policy. And so any CPU can actually do (A) in this case, as the frequency is updated from the right set of CPUs only. Check cpufreq_can_do_remote_dvfs() only for the fast switching case. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-18cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarilyViresh Kumar
Utilization update callbacks are now processed remotely, even on the CPUs that don't share cpufreq policy with the target CPU (if dvfs_possible_from_any_cpu flag is set). But in non-fast switch paths, the frequency is changed only from one of policy->related_cpus. This happens because the kthread which does the actual update is bound to a subset of CPUs (i.e. related_cpus). Allow frequency to be remotely updated as well (i.e. call __cpufreq_driver_target()) if dvfs_possible_from_any_cpu flag is set. Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-17Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps"Kees Cook
This reverts commit 68c4a4f8abc60c9440ede9cd123d48b78325f7a3, with various conflict clean-ups. The capability check required too much privilege compared to simple DAC controls. A system builder was forced to have crash handler processes run with CAP_SYSLOG which would give it the ability to read (and wipe) the _current_ dmesg, which is much more access than being given access only to the historical log stored in pstorefs. With the prior commit to make the root directory 0750, the files are protected by default but a system builder can now opt to give access to a specific group (via chgrp on the pstorefs root directory) without being forced to also give away CAP_SYSLOG. Suggested-by: Nick Kralevich <nnk@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
2017-08-17alarmtimer: Fix unavailable wake-up source in sysfsGeert Uytterhoeven
Currently the alarmtimer registers a wake-up source unconditionally, regardless of the system having a (wake-up capable) RTC or not. Hence the alarmtimer will always show up in /sys/kernel/debug/wakeup_sources, even if it is not available, and thus cannot be a wake-up source. To fix this, postpone registration until a wake-up capable RTC device is added. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-08-17timekeeping: Use proper timekeeper for debug codeStafford Horne
When CONFIG_DEBUG_TIMEKEEPING is enabled the timekeeping_check_update() function will update status like last_warning and underflow_seen on the timekeeper. If there are issues found this state is used to rate limit the warnings that get printed. This rate limiting doesn't really really work if stored in real_tk as the shadow timekeeper is overwritten onto real_tk at the end of every update_wall_time() call, resetting last_warning and other statuses. Fix rate limiting by using the shadow_timekeeper for timekeeping_check_update(). Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Fixes: commit 57d05a93ada7 ("time: Rework debugging variables so they aren't global") Signed-off-by: Stafford Horne <shorne@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2017-08-17bpf: don't enable preemption twice in smap_do_verdictDaniel Borkmann
In smap_do_verdict(), the fall-through branch leads to call preempt_enable() twice for the SK_REDIRECT, which creates an imbalance. Only enable it for all remaining cases again. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-17bpf: fix liveness propagation to parent in spilled stack slotsDaniel Borkmann
Using parent->regs[] when propagating REG_LIVE_READ for spilled regs doesn't work since parent->regs[] denote the set of normal registers but not spilled ones. Propagate to the correct regs. Fixes: dc503a8ad984 ("bpf/verifier: track liveness for pruning") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-17Merge branches 'doc.2017.08.17a', 'fixes.2017.08.17a', ↵Paul E. McKenney
'hotplug.2017.07.25b', 'misc.2017.08.17a', 'spin_unlock_wait_no.2017.08.17a', 'srcu.2017.07.27c' and 'torture.2017.07.24c' into HEAD doc.2017.08.17a: Documentation updates. fixes.2017.08.17a: RCU fixes. hotplug.2017.07.25b: CPU-hotplug updates. misc.2017.08.17a: Miscellaneous fixes outside of RCU (give or take conflicts). spin_unlock_wait_no.2017.08.17a: Remove spin_unlock_wait(). srcu.2017.07.27c: SRCU updates. torture.2017.07.24c: Torture-test updates.
2017-08-17locking: Remove spin_unlock_wait() generic definitionsPaul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes spin_unlock_wait() and related definitions from core code. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-17exit: Replace spin_unlock_wait() with lock/unlock pairPaul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in do_exit() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because the lock is a per-task lock, and this is happening only at task-exit time. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-17completion: Replace spin_unlock_wait() with lock/unlock pairPaul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in completion_done() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because the lock will be held only the wakeup happens really quickly. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-17membarrier: Provide expedited private commandMathieu Desnoyers
Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built from all runqueues for which current thread's mm is the same as the thread calling sys_membarrier. It executes faster than the non-expedited variant (no blocking). It also works on NOHZ_FULL configurations. Scheduler-wise, it requires a memory barrier before and after context switching between processes (which have different mm). The memory barrier before context switch is already present. For the barrier after context switch: * Our TSO archs can do RELEASE without being a full barrier. Look at x86 spin_unlock() being a regular STORE for example. But for those archs, all atomics imply smp_mb and all of them have atomic ops in switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full barrier. * From all weakly ordered machines, only ARM64 and PPC can do RELEASE, the rest does indeed do smp_mb(), so there the spin_unlock() is a full barrier and we're good. * ARM64 has a very heavy barrier in switch_to(), which suffices. * PPC just removed its barrier from switch_to(), but appears to be talking about adding something to switch_mm(). So add a smp_mb__after_unlock_lock() for now, until this is settled on the PPC side. Changes since v3: - Properly document the memory barriers provided by each architecture. Changes since v2: - Address comments from Peter Zijlstra, - Add smp_mb__after_unlock_lock() after finish_lock_switch() in finish_task_switch() to add the memory barrier we need after storing to rq->curr. This is much simpler than the previous approach relying on atomic_dec_and_test() in mmdrop(), which actually added a memory barrier in the common case of switching between userspace processes. - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full kernel, rather than having the whole membarrier system call returning -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full. Adapt the CMD_QUERY mask accordingly. Changes since v1: - move membarrier code under kernel/sched/ because it uses the scheduler runqueue, - only add the barrier when we switch from a kernel thread. The case where we switch from a user-space thread is already handled by the atomic_dec_and_test() in mmdrop(). - add a comment to mmdrop() documenting the requirement on the implicit memory barrier. CC: Peter Zijlstra <peterz@infradead.org> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> CC: Boqun Feng <boqun.feng@gmail.com> CC: Andrew Hunter <ahh@google.com> CC: Maged Michael <maged.michael@gmail.com> CC: gromer@google.com CC: Avi Kivity <avi@scylladb.com> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: Paul Mackerras <paulus@samba.org> CC: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Dave Watson <davejwatson@fb.com>
2017-08-17rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()Paul E. McKenney
The rcu_idle_exit() and rcu_idle_enter() functions are exported because they were originally used by RCU_NONIDLE(), which was intended to be usable from modules. However, RCU_NONIDLE() now instead uses rcu_irq_enter_irqson() and rcu_irq_exit_irqson(), which are not exported, and there have been no complaints. This commit therefore removes the exports from rcu_idle_exit() and rcu_idle_enter(). Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17rcu: Add warning to rcu_idle_enter() for irqs enabledPaul E. McKenney
All current callers of rcu_idle_enter() have irqs disabled, and rcu_idle_enter() relies on this, but doesn't check. This commit therefore adds a RCU_LOCKDEP_WARN() to add some verification to the trust. While we are there, pass "true" rather than "1" to rcu_eqs_enter(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17rcu: Make rcu_idle_enter() rely on callers disabling irqsPeter Zijlstra (Intel)
All callers to rcu_idle_enter() have irqs disabled, so there is no point in rcu_idle_enter disabling them again. This commit therefore replaces the irq disabling with a RCU_LOCKDEP_WARN(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17rcu: Add assertions verifying blocked-tasks listPaul E. McKenney
This commit adds assertions verifying the consistency of the rcu_node structure's ->blkd_tasks list and its ->gp_tasks, ->exp_tasks, and ->boost_tasks pointers. In particular, the ->blkd_tasks lists must be empty except for leaf rcu_node structures. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit()Masami Hiramatsu
Set disable_rcu_irq_enter on not only rcu_eqs_enter_common() but also rcu_eqs_exit(), since rcu_eqs_exit() suffers from the same issue as was fixed for rcu_eqs_enter_common() by commit 03ecd3f48e57 ("rcu/tracing: Add rcu_disabled to denote when rcu_irq_enter() will not work"). Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17rcu: Add TPS() protection for _rcu_barrier_trace stringsPaul E. McKenney
The _rcu_barrier_trace() function is a wrapper for trace_rcu_barrier(), which needs TPS() protection for strings passed through the second argument. However, it has escaped prior TPS()-ification efforts because it _rcu_barrier_trace() does not start with "trace_". This commit therefore adds the needed TPS() protection Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-17rcu: Use idle versions of swait to make idle-hack clearLuis R. Rodriguez
These RCU waits were set to use interruptible waits to avoid the kthreads contributing to system load average, even though they are not interruptible as they are spawned from a kthread. Use the new TASK_IDLE swaits which makes our goal clear, and removes confusion about these paths possibly being interruptible -- they are not. When the system is idle the RCU grace-period kthread will spend all its time blocked inside the swait_event_interruptible(). If the interruptible() was not used, then this kthread would contribute to the load average. This means that an idle system would have a load average of 2 (or 3 if PREEMPT=y), rather than the load average of 0 that almost fifty years of UNIX has conditioned sysadmins to expect. The same argument applies to swait_event_interruptible_timeout() use. The RCU grace-period kthread spends its time blocked inside this call while waiting for grace periods to complete. In particular, if there was only one busy CPU, but that CPU was frequently invoking call_rcu(), then the RCU grace-period kthread would spend almost all its time blocked inside the swait_event_interruptible_timeout(). This would mean that the load average would be 2 rather than the expected 1 for the single busy CPU. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17rcu: Add event tracing to ->gp_tasks update at GP startPaul E. McKenney
There is currently event tracing to track when a task is preempted within a preemptible RCU read-side critical section, and also when that task subsequently reaches its outermost rcu_read_unlock(), but none indicating when a new grace period starts when that grace period must wait on pre-existing readers that have been been preempted at least once since the beginning of their current RCU read-side critical sections. This commit therefore adds an event trace at grace-period start in the case where there are such readers. Note that only the first reader in the list is traced. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-17rcu: Move rcu.h to new trivial-function stylePaul E. McKenney
This commit saves a few lines in kernel/rcu/rcu.h by moving to single-line definitions for trivial functions, instead of the old style where the two curly braces each get their own line. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17rcu: Add TPS() to event-traced stringsPaul E. McKenney
Strings used in event tracing need to be specially handled, for example, using the TPS() macro. Without the TPS() macro, although output looks fine from within a running kernel, extracting traces from a crash dump produces garbage instead of strings. This commit therefore adds the TPS() macro to some unadorned strings that were passed to event-tracing macros. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-17rcu: Create reasonable API for do_exit() TASKS_RCU processingPaul E. McKenney
Currently, the exit-time support for TASKS_RCU is open-coded in do_exit(). This commit creates exit_tasks_rcu_start() and exit_tasks_rcu_finish() APIs for do_exit() use. This has the benefit of confining the use of the tasks_rcu_exit_srcu variable to one file, allowing it to become static. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17rcu: Drive TASKS_RCU directly off of PREEMPTPaul E. McKenney
The actual use of TASKS_RCU is only when PREEMPT, otherwise RCU-sched is used instead. This commit therefore makes synchronize_rcu_tasks() and call_rcu_tasks() available always, but mapped to synchronize_sched() and call_rcu_sched(), respectively, when !PREEMPT. This approach also allows some #ifdefs to be removed from rcutorture. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org>
2017-08-17locking/lockdep: Explicitly initialize wq_barrier::done::mapBoqun Feng
With the new lockdep crossrelease feature, which checks completions usage, a false positive is reported in the workqueue code: > Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released > Task B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released > Task C : acquired of lock#3 -> wait for completion of barr->done > (Task C is in lru_add_drain_all_cpuslocked()) > Worker D : wait for wfc.work to be released -> will complete barr->done Such a dead lock can not happen because Task C's barr->done and Worker D's barr->done can not be the same instance. The reason of this false positive is we initialize all wq_barrier::done at insert_wq_barrier() via init_completion(), which makes them belong to the same lock class, therefore, impossible circles are reported. To fix this, explicitly initialize the lockdep map for wq_barrier::done in insert_wq_barrier(), so that the lock class key of wq_barrier::done is a subkey of the corresponding work_struct, as a result we won't build a dependency between a wq_barrier with a unrelated work, and we can differ wq barriers based on the related works, so the false positive above is avoided. Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE to make the code simple and away from unnecessary #ifdefs. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17locking/refcounts, x86/asm: Implement fast refcount overflow protectionKees Cook
This implements refcount_t overflow protection on x86 without a noticeable performance impact, though without the fuller checking of REFCOUNT_FULL. This is done by duplicating the existing atomic_t refcount implementation but with normally a single instruction added to detect if the refcount has gone negative (e.g. wrapped past INT_MAX or below zero). When detected, the handler saturates the refcount_t to INT_MIN / 2. With this overflow protection, the erroneous reference release that would follow a wrap back to zero is blocked from happening, avoiding the class of refcount-overflow use-after-free vulnerabilities entirely. Only the overflow case of refcounting can be perfectly protected, since it can be detected and stopped before the reference is freed and left to be abused by an attacker. There isn't a way to block early decrements, and while REFCOUNT_FULL stops increment-from-zero cases (which would be the state _after_ an early decrement and stops potential double-free conditions), this fast implementation does not, since it would require the more expensive cmpxchg loops. Since the overflow case is much more common (e.g. missing a "put" during an error path), this protection provides real-world protection. For example, the two public refcount overflow use-after-free exploits published in 2016 would have been rendered unexploitable: http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/ http://cyseclabs.com/page?n=02012016 This implementation does, however, notice an unchecked decrement to zero (i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it resulted in a zero). Decrements under zero are noticed (since they will have resulted in a negative value), though this only indicates that a use-after-free may have already happened. Such notifications are likely avoidable by an attacker that has already exploited a use-after-free vulnerability, but it's better to have them reported than allow such conditions to remain universally silent. On first overflow detection, the refcount value is reset to INT_MIN / 2 (which serves as a saturation value) and a report and stack trace are produced. When operations detect only negative value results (such as changing an already saturated value), saturation still happens but no notification is performed (since the value was already saturated). On the matter of races, since the entire range beyond INT_MAX but before 0 is negative, every operation at INT_MIN / 2 will trap, leaving no overflow-only race condition. As for performance, this implementation adds a single "js" instruction to the regular execution flow of a copy of the standard atomic_t refcount operations. (The non-"and_test" refcount_dec() function, which is uncommon in regular refcount design patterns, has an additional "jz" instruction to detect reaching exactly zero.) Since this is a forward jump, it is by default the non-predicted path, which will be reinforced by dynamic branch prediction. The result is this protection having virtually no measurable change in performance over standard atomic_t operations. The error path, located in .text.unlikely, saves the refcount location and then uses UD0 to fire a refcount exception handler, which resets the refcount, handles reporting, and returns to regular execution. This keeps the changes to .text size minimal, avoiding return jumps and open-coded calls to the error reporting routine. Example assembly comparison: refcount_inc() before: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) refcount_inc() after: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) ffffffff8154614d: 0f 88 80 d5 17 00 js ffffffff816c36d3 ... .text.unlikely: ffffffff816c36d3: 48 8d 4d f4 lea -0xc(%rbp),%rcx ffffffff816c36d7: 0f ff (bad) These are the cycle counts comparing a loop of refcount_inc() from 1 to INT_MAX and back down to 0 (via refcount_dec_and_test()), between unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL (refcount_t-full), and this overflow-protected refcount (refcount_t-fast): 2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s: cycles protections atomic_t 82249267387 none refcount_t-fast 82211446892 overflow, untested dec-to-zero refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero This code is a modified version of the x86 PAX_REFCOUNT atomic_t overflow defense from the last public patch of PaX/grsecurity, based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Thanks to PaX Team for various suggestions for improvement for repurposing this code to be a refcount-only protection. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Hans Liljestrand <ishkamiel@gmail.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Jann Horn <jannh@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arozansk@redhat.com Cc: axboe@kernel.dk Cc: kernel-hardening@lists.openwall.com Cc: linux-arch <linux-arch@vger.kernel.org> Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-16Merge tag 'audit-pr-20170816' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: "Two small fixes to the audit code, both explained well in the respective patch descriptions, but the quick summary is one use-after-free fix, and one silly fanotify notification flag fix" * tag 'audit-pr-20170816' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: Receive unmount event audit: Fix use after free in audit_remove_watch_rule()
2017-08-16bpf: sock_map fixes for !CONFIG_BPF_SYSCALL and !STREAM_PARSERJohn Fastabend
Resolve issues with !CONFIG_BPF_SYSCALL and !STREAM_PARSER net/core/filter.c: In function ‘do_sk_redirect_map’: net/core/filter.c:1881:3: error: implicit declaration of function ‘__sock_map_lookup_elem’ [-Werror=implicit-function-declaration] sk = __sock_map_lookup_elem(ri->map, ri->ifindex); ^ net/core/filter.c:1881:6: warning: assignment makes pointer from integer without a cast [enabled by default] sk = __sock_map_lookup_elem(ri->map, ri->ifindex); Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16bpf: sockmap state change warning fixJohn Fastabend
psock will uninitialized in default case we need to do the same psock lookup and check as in other branch. Fixes compile warning below. kernel/bpf/sockmap.c: In function ‘smap_state_change’: kernel/bpf/sockmap.c:156:21: warning: ‘psock’ may be used uninitialized in this function [-Wmaybe-uninitialized] struct smap_psock *psock; Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: David Miller <davem@davemloft.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16bpf: devmap: remove unnecessary value size checkJohn Fastabend
In the devmap alloc map logic we check to ensure that the sizeof the values are not greater than KMALLOC_MAX_SIZE. But, in the dev map case we ensure the value size is 4bytes earlier in the function because all values should be netdev ifindex values. The second check is harmless but is not needed so remove it. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16bpf: add access to sock fields and pkt data from sk_skb programsJohn Fastabend
Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16bpf: sockmap with sk redirect supportJohn Fastabend
Recently we added a new map type called dev map used to forward XDP packets between ports (6093ec2dc313). This patches introduces a similar notion for sockets. A sockmap allows users to add participating sockets to a map. When sockets are added to the map enough context is stored with the map entry to use the entry with a new helper bpf_sk_redirect_map(map, key, flags) This helper (analogous to bpf_redirect_map in XDP) is given the map and an entry in the map. When called from a sockmap program, discussed below, the skb will be sent on the socket using skb_send_sock(). With the above we need a bpf program to call the helper from that will then implement the send logic. The initial site implemented in this series is the recv_sock hook. For this to work we implemented a map attach command to add attributes to a map. In sockmap we add two programs a parse program and a verdict program. The parse program uses strparser to build messages and pass them to the verdict program. The parse programs use the normal strparser semantics. The verdict program is of type SK_SKB. The verdict program returns a verdict SK_DROP, or SK_REDIRECT for now. Additional actions may be added later. When SK_REDIRECT is returned, expected when bpf program uses bpf_sk_redirect_map(), the sockmap logic will consult per cpu variables set by the helper routine and pull the sock entry out of the sock map. This pattern follows the existing redirect logic in cls and xdp programs. This gives the flow, recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock \ -> kfree_skb As an example use case a message based load balancer may use specific logic in the verdict program to select the sock to send on. Sample programs are provided in future patches that hopefully illustrate the user interfaces. Also selftests are in follow-on patches. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16bpf: export bpf_prog_inc_not_zeroJohn Fastabend
bpf_prog_inc_not_zero will be used by upcoming sockmap patches this patch simply exports it so we can pull it in. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16sched/completion: Document that reinit_completion() must be called after ↵Steven Rostedt
complete_all() The complete_all() function modifies the completion's "done" variable to UINT_MAX, and no other caller (wait_for_completion(), etc) will modify it back to zero. That means that any call to complete_all() must have a reinit_completion() before that completion can be used again. Document this fact by the complete_all() function. Also document that completion_done() will always return true if complete_all() is called. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170816131202.195c2f4b@gandalf.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-16Merge branch 'irq/for-gpio' into irq/coreThomas Gleixner
Merge the irq simulator which is in a separate branch so it can be consumed by the gpio folks.
2017-08-16genirq/irq_sim: Add a devres variant of irq_sim_init()Bartosz Golaszewski
Add a resource managed version of irq_sim_init(). This can be conveniently used in device drivers. Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-doc@vger.kernel.org Cc: linux-gpio@vger.kernel.org Cc: Bamvor Jian Zhang <bamvor.zhangjian@linaro.org> Cc: Jonathan Cameron <jic23@kernel.org> Link: http://lkml.kernel.org/r/20170814145318.6495-3-brgl@bgdev.pl Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-08-16genirq/irq_sim: Add a simple interrupt simulator frameworkBartosz Golaszewski
Implement a simple, irq_work-based framework for simulating interrupts. Currently the API exposes routines for initializing and deinitializing the simulator object, enqueueing the interrupts and retrieving the allocated interrupt numbers based on the offset of the dummy interrupt in the simulator struct. Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-doc@vger.kernel.org Cc: linux-gpio@vger.kernel.org Cc: Bamvor Jian Zhang <bamvor.zhangjian@linaro.org> Cc: Jonathan Cameron <jic23@kernel.org> Link: http://lkml.kernel.org/r/20170814145318.6495-2-brgl@bgdev.pl Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-08-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller