summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2015-10-06rcu: Move preemption disabling out of __srcu_read_lock()Paul E. McKenney
Currently, __srcu_read_lock() cannot be invoked from restricted environments because it contains calls to preempt_disable() and preempt_enable(), both of which can invoke lockdep, which is a bad idea in some restricted execution modes. This commit therefore moves the preempt_disable() and preempt_enable() from __srcu_read_lock() to srcu_read_lock(). It also inserts the preempt_disable() and preempt_enable() around the call to __srcu_read_lock() in do_exit(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06rcu: Add online/offline info to stall warning messagePaul E. McKenney
This commit makes the RCU CPU stall warning message print online/offline indications immediately after a hyphen following the CPU number. A "O" indicates that the global CPU-hotplug system believes that the CPU is online, a "o" that RCU perceived the CPU to be online at the beginning of the current expedited grace period, and an "N" that RCU currently believes that it will perceive the CPU as being online at the beginning of the next expedited grace period, with "." otherwise for all three indications. So for CPU 10, you would normally see "10-OoN:" indicating that everything believes that the CPU is online. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-10-06rcu: Eliminate panic when silly boot-time fanout specifiedPaul E. McKenney
This commit loosens rcutree.rcu_fanout_leaf range checks and replaces a panic() with a fallback to compile-time values. This fallback is accompanied by a WARN_ON(), and both occur when the rcutree.rcu_fanout_leaf value is too small to accommodate the number of CPUs. For example, given the current four-level limit for the rcu_node tree, a system with more than 16 CPUs built with CONFIG_FANOUT=2 must have rcutree.rcu_fanout_leaf larger than 2. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06rcu: Don't disable preemption for Tiny and Tree RCU readersBoqun Feng
Because preempt_disable() maps to barrier() for non-debug builds, it forces the compiler to spill and reload registers. Because Tree RCU and Tiny RCU now only appear in CONFIG_PREEMPT=n builds, these barrier() instances generate needless extra code for each instance of rcu_read_lock() and rcu_read_unlock(). This extra code slows down Tree RCU and bloats Tiny RCU. This commit therefore removes the preempt_disable() and preempt_enable() from the non-preemptible implementations of __rcu_read_lock() and __rcu_read_unlock(), respectively. However, for debug purposes, preempt_disable() and preempt_enable() are still invoked if CONFIG_PREEMPT_COUNT=y, because this allows detection of sleeping inside atomic sections in non-preemptible kernels. However, Tiny and Tree RCU operates by coalescing all RCU read-side critical sections on a given CPU that lie between successive quiescent states. It is therefore necessary to compensate for removing barriers from __rcu_read_lock() and __rcu_read_unlock() by adding them to a couple of the RCU functions invoked during quiescent states, namely to rcu_all_qs() and rcu_note_context_switch(). However, note that the latter is more paranoia than necessity, at least until link-time optimizations become more aggressive. This is based on an earlier patch by Paul E. McKenney, fixing a bug encountered in kernels built with CONFIG_PREEMPT=n and CONFIG_PREEMPT_COUNT=y. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2015-10-06rcu: Use call_rcu_func_t to replace explicit type equivalentsBoqun Feng
We have had the call_rcu_func_t typedef for a quite awhile, but we still use explicit function pointer types in some places. These types can confuse cscope and can be hard to read. This patch therefore replaces these types with the call_rcu_func_t typedef. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06rcu: Use rcu_callback_t in call_rcu*() and friendsBoqun Feng
As we now have rcu_callback_t typedefs as the type of rcu callbacks, we should use it in call_rcu*() and friends as the type of parameters. This could save us a few lines of code and make it clear which function requires an rcu callbacks rather than other callbacks as its argument. Besides, this can also help cscope to generate a better database for code reading. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06sched: Export sched_setscheduler_nocheckDavidlohr Bueso
The new locktorture rtmutex_lock tests exercise priority boosting, which means that they need to set some tasks to real-time priority. To do this, they use sched_setscheduler_nocheck(). However, this is not exported to modules, which results in the following error when building locktorture as a module: ERROR: "sched_setscheduler_nocheck" [kernel/locking/locktorture.ko] undefined! This commit therefore adds an EXPORT_SYMBOL_GPL() to allow this function to be invoked from locktorture when built as a module. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2015-10-06locking/rwsem: Use acquire/release semanticsDavidlohr Bueso
As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic: Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly ordered archs can benefit from more relaxed use of barriers when locking and unlocking, instead of regular full barrier semantics. While currently only arm64 supports such optimizations, updating corresponding locking primitives serves for other archs to immediately benefit as well, once the necessary machinery is implemented of course. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E.McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443643395-17016-6-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06locking/mcs: Use acquire/release semanticsDavidlohr Bueso
As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic: Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly ordered archs can benefit from more relaxed use of barriers when locking and unlocking, instead of regular full barrier semantics. While currently only arm64 supports such optimizations, updating corresponding locking primitives serves for other archs to immediately benefit as well, once the necessary machinery is implemented of course. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E.McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443643395-17016-5-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06locking/rtmutex: Use acquire/release semanticsDavidlohr Bueso
As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic: Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly ordered archs can benefit from more relaxed use of barriers when locking and unlocking, instead of regular full barrier semantics. While currently only arm64 supports such optimizations, updating corresponding locking primitives serves for other archs to immediately benefit as well, once the necessary machinery is implemented of course. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E.McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443643395-17016-4-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06locking/mutex: Use acquire/release semanticsDavidlohr Bueso
As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic: Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly ordered archs can benefit from more relaxed use of barriers when locking and unlocking, instead of regular full barrier semantics. While currently only arm64 supports such optimizations, updating corresponding locking primitives serves for other archs to immediately benefit as well, once the necessary machinery is implemented of course. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E.McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443643395-17016-3-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06Merge tag 'v4.3-rc4' into locking/core, to pick up fixes before applying new ↵Ingo Molnar
changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Remove a parameter in the migrate_task_rq() functionxiaofeng.yan
The parameter "int next_cpu" in the following function is unused: migrate_task_rq(struct task_struct *p, int next_cpu) Remove it. Signed-off-by: xiaofeng.yan <yanxiaofeng@inspur.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1442991360-31945-1-git-send-email-yanxiaofeng@inspur.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Drop unlikely behind BUG_ON()Geliang Tang
(1) For !CONFIG_BUG cases, the bug call is a no-op, so we couldn't care less and the change is ok. (2) PPC and MIPS, which HAVE_ARCH_BUG_ON, do not rely on branch predictions as it seems to be pointless [1] and thus callers should not be trying to push an optimization in the first place. (3) For CONFIG_BUG and !HAVE_ARCH_BUG_ON cases, BUG_ON() contains an unlikely compiler flag already. Hence, we can drop unlikely behind BUG_ON(). [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.3/02289.html Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/6fa7125979f98bbeac26e268271769b6ca935c8d.1444051018.git.geliangtang@163.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Fix task and run queue sched_info::run_delay inconsistenciesPeter Zijlstra
Mike Meyer reported the following bug: > During evaluation of some performance data, it was discovered thread > and run queue run_delay accounting data was inconsistent with the other > accounting data that was collected. Further investigation found under > certain circumstances execution time was leaking into the task and > run queue accounting of run_delay. > > Consider the following sequence: > > a. thread is running. > b. thread moves beween cgroups, changes scheduling class or priority. > c. thread sleeps OR > d. thread involuntarily gives up cpu. > > a. implies: > > thread->sched_info.last_queued = 0 > > a. and b. results in the following: > > 1. dequeue_task(rq, thread) > > sched_info_dequeued(rq, thread) > delta = 0 > > sched_info_reset_dequeued(thread) > thread->sched_info.last_queued = 0 > > thread->sched_info.run_delay += delta > > 2. enqueue_task(rq, thread) > > sched_info_queued(rq, thread) > > /* thread is still on cpu at this point. */ > thread->sched_info.last_queued = task_rq(thread)->clock; > > c. results in: > > dequeue_task(rq, thread) > > sched_info_dequeued(rq, thread) > > /* delta is execution time not run_delay. */ > delta = task_rq(thread)->clock - thread->sched_info.last_queued > > sched_info_reset_dequeued(thread) > thread->sched_info.last_queued = 0 > > thread->sched_info.run_delay += delta > > Since thread was running between enqueue_task(rq, thread) and > dequeue_task(rq, thread), the delta above is really execution > time and not run_delay. > > d. results in: > > __sched_info_switch(thread, next_thread) > > sched_info_depart(rq, thread) > > sched_info_queued(rq, thread) > > /* last_queued not updated due to being non-zero */ > return > > Since thread was running between enqueue_task(rq, thread) and > __sched_info_switch(thread, next_thread), the execution time > between enqueue_task(rq, thread) and > __sched_info_switch(thread, next_thread) now will become > associated with run_delay due to when last_queued was last updated. > This alternative patch solves the problem by not calling sched_info_{de,}queued() in {de,en}queue_task(). Therefore the sched_info state is preserved and things work as expected. By inlining the {de,en}queue_task() functions the new condition becomes (mostly) a compile-time constant and we'll not emit any new branch instructions. It even shrinks the code (due to inlining {en,de}queue_task()): $ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig text data bss dec hex filename 64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o 64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig Reported-by: Mike Meyer <Mike.Meyer@Teradata.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/numa: Fix task_tick_fair() from disabling numa_balancingSrikar Dronamraju
If static branch 'sched_numa_balancing' is enabled, it should kickstart NUMA balancing through task_tick_numa(). However the following commit: 2a595721a1fa ("sched/numa: Convert sched_numa_balancing to a static_branch") erroneously disables this. Fix this anomaly by enabling task_tick_numa() when the static branch 'sched_numa_balancing' is enabled. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1443752305-27413-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Add preempt_count invariant checkPeter Zijlstra
Ingo requested I keep my debug check for the preempt_count invariant. Requested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: More notrace annotationsPeter Zijlstra
preempt_schedule_common() is marked notrace, but it does not use _notrace() preempt_count functions and __schedule() is also not marked notrace, which means that its perfectly possible to end up in the tracer from preempt_schedule_common(). Steve says: | Yep, there's some history to this. This was originally the issue that | caused function tracing to go into infinite recursion. But now we have | preempt_schedule_notrace(), which is used by the function tracer, and | that function must not be traced till preemption is disabled. | | Now if function tracing is running and we take an interrupt when | NEED_RESCHED is set, it calls | | preempt_schedule_common() (not traced) | | But then that calls preempt_disable() (traced) | | function tracer calls preempt_disable_notrace() followed by | preempt_enable_notrace() which will see NEED_RESCHED set, and it will | call preempt_schedule_notrace(), which stops the recursion, but | still calls __schedule() here, and that means when we return, we call | the __schedule() from preempt_schedule_common(). | | That said, I prefer this patch. Preemption is disabled before calling | __schedule(), and we get rid of a one round recursion with the | scheduler. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Simplify preempt_count testsPeter Zijlstra
Since we stopped setting PREEMPT_ACTIVE, there is no need to mask it out of preempt_count() tests. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Robustify preemption leak checksPeter Zijlstra
When we warn about a preempt_count leak; reset the preempt_count to the known good value such that the problem does not ripple forward. This is most important on x86 which has a per cpu preempt_count that is not saved/restored (after this series). So if you schedule with an invalid (!2*PREEMPT_DISABLE_OFFSET) preempt_count the next task is messed up too. Enforcing this invariant limits the borkage to just the one task. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Stop setting PREEMPT_ACTIVEPeter Zijlstra
Now that nothing tests for PREEMPT_ACTIVE anymore, stop setting it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Fix trace_sched_switch()Peter Zijlstra
__trace_sched_switch_state() is the last remaining PREEMPT_ACTIVE user, move trace_sched_switch() from prepare_task_switch() to __schedule() and propagate the @preempt argument. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Add preempt argument to __schedule()Peter Zijlstra
There is only a single PREEMPT_ACTIVE use in the regular __schedule() path and that is to circumvent the task->state check. Since the code setting PREEMPT_ACTIVE is the immediate caller of __schedule() we can replace this with a function argument. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Create preempt_count invariantPeter Zijlstra
Assuming units of PREEMPT_DISABLE_OFFSET for preempt_count() numbers. Now that TASK_DEAD no longer results in preempt_count() == 3 during scheduling, we will always call context_switch() with preempt_count() == 2. However, we don't always end up with preempt_count() == 2 in finish_task_switch() because new tasks get created with preempt_count() == 1. Create FORK_PREEMPT_COUNT and set it to 2 and use that in the right places. Note that we cannot use INIT_PREEMPT_COUNT as that serves another purpose (boot). After this, preempt_count() is invariant across the context switch, with exception of PREEMPT_ACTIVE. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Rework TASK_DEAD preemption exceptionPeter Zijlstra
TASK_DEAD is special in that the final schedule call from do_exit() must be done with preemption disabled. This means we end up scheduling with a preempt_count() higher than usual (3 instead of the 'expected' 2). Since future patches will want to rely on an invariant preempt_count() value during schedule, fix this up. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06Merge branch 'sched/urgent' into sched/core, to pick up fixes before ↵Ingo Molnar
applying new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06sched/core: Fix TASK_DEAD race in finish_task_switch()Peter Zijlstra
So the problem this patch is trying to address is as follows: CPU0 CPU1 context_switch(A, B) ttwu(A) LOCK A->pi_lock A->on_cpu == 0 finish_task_switch(A) prev_state = A->state <-. WMB | A->on_cpu = 0; | UNLOCK rq0->lock | | context_switch(C, A) `-- A->state = TASK_DEAD prev_state == TASK_DEAD put_task_struct(A) context_switch(A, C) finish_task_switch(A) A->state == TASK_DEAD put_task_struct(A) The argument being that the WMB will allow the load of A->state on CPU0 to cross over and observe CPU1's store of A->state, which will then result in a double-drop and use-after-free. Now the comment states (and this was true once upon a long time ago) that we need to observe A->state while holding rq->lock because that will order us against the wakeup; however the wakeup will not in fact acquire (that) rq->lock; it takes A->pi_lock these days. We can obviously fix this by upgrading the WMB to an MB, but that is expensive, so we'd rather avoid that. The alternative this patch takes is: smp_store_release(&A->on_cpu, 0), which avoids the MB on some archs, but not important ones like ARM. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> # v3.1+ Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: manfred@colorfullife.com Cc: will.deacon@arm.com Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()") Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-05ebpf: include perf_event only where really neededDaniel Borkmann
Commit ea317b267e9d ("bpf: Add new bpf map type to store the pointer to struct perf_event") added perf_event.h to the main eBPF header, so it gets included for all users. perf_event.h is actually only needed from array map side, so lets sanitize this a bit. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Kaixu Xia <xiakaixu@huawei.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05bpf, seccomp: prepare for upcoming criu supportDaniel Borkmann
The current ongoing effort to dump existing cBPF seccomp filters back to user space requires to hold the pre-transformed instructions like we do in case of socket filters from sk_attach_filter() side, so they can be reloaded in original form at a later point in time by utilities such as criu. To prepare for this, simply extend the bpf_prog_create_from_user() API to hold a flag that tells whether we should store the original or not. Also, fanout filters could make use of that in future for things like diag. While fanout filters already use bpf_prog_destroy(), move seccomp over to them as well to handle original programs when present. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Tycho Andersen <tycho.andersen@canonical.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Tested-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-04Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "This update contains: - Fix for a long standing race affecting /proc/irq/NNN - One line fix for ARM GICV3-ITS counting the wrong data - Warning silencing in ARM GICV3-ITS. Another GCC trying to be overly clever issue" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Count additional LPIs for the aliased devices irqchip/gic-v3-its: Silence warning when its_lpi_alloc_chunks gets inlined genirq: Fix race in register_irq_proc()
2015-10-04debugfs: Pass bool pointer to debugfs_create_bool()Viresh Kumar
Its a bit odd that debugfs_create_bool() takes 'u32 *' as an argument, when all it needs is a boolean pointer. It would be better to update this API to make it accept 'bool *' instead, as that will make it more consistent and often more convenient. Over that bool takes just a byte. That required updates to all user sites as well, in the same commit updating the API. regmap core was also using debugfs_{read|write}_file_bool(), directly and variable types were updated for that to be bool as well. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Mark Brown <broonie@kernel.org> Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-03Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "An abs64() fix in the watchdog driver, and two clocksource driver NO_IRQ assumption fixes" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Fix abs() usage w/ 64bit values clocksource/drivers/keystone: Fix bad NO_IRQ usage clocksource/drivers/rockchip: Fix bad NO_IRQ usage
2015-10-03sched, bpf: add helper for retrieving routing realmsDaniel Borkmann
Using routing realms as part of the classifier is quite useful, it can be viewed as a tag for one or multiple routing entries (think of an analogy to net_cls cgroup for processes), set by user space routing daemons or via iproute2 as an indicator for traffic classifiers and later on processed in the eBPF program. Unlike actions, the classifier can inspect device flags and enable netif_keep_dst() if necessary. tc actions don't have that possibility, but in case people know what they are doing, it can be used from there as well (e.g. via devs that must keep dsts by design anyway). If a realm is set, the handler returns the non-zero realm. User space can set the full 32bit realm for the dst. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03ebpf: migrate bpf_prog's flags to bitfieldDaniel Borkmann
As we need to add further flags to the bpf_prog structure, lets migrate both bools to a bitfield representation. The size of the base structure (excluding insns) remains unchanged at 40 bytes. Add also tags for the kmemchecker, so that it doesn't throw false positives. Even in case gcc would generate suboptimal code, it's not being accessed in performance critical paths. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-02clocksource: Fix abs() usage w/ 64bit valuesJohn Stultz
This patch fixes one cases where abs() was being used with 64-bit nanosecond values, where the result may be capped at 32-bits. This potentially could cause watchdog false negatives on 32-bit systems, so this patch addresses the issue by using abs64(). Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1442279124-7309-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/dsa/slave.c net/dsa/slave.c simply had overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-01ntp: use timespec64 in sync_cmos_clockArnd Bergmann
The sync_cmos_clock has one use of struct timespec, which we want to eventually replace with timespec64 or similar in the kernel. There is no way this one can overflow, but the conversion to timespec64 is trivial and has no other dependencies. Acked-by: Richard Cochran <richardcochran@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-10-01ntp/pps: replace getnstime_raw_and_real with 64-bit versionArnd Bergmann
There is exactly one caller of getnstime_raw_and_real in the kernel, which is the pps_get_ts function. This changes the caller and the implementation to work on timespec64 types rather than timespec, to avoid the time_t overflow on 32-bit architectures. For consistency with the other new functions (ktime_get_seconds, ktime_get_real_*, ...), I'm renaming the function to ktime_get_raw_and_real_ts64. We still need to convert from the internal 64-bit type to 32 bit types in the caller, but this conversion is now pushed out from getnstime_raw_and_real to pps_get_ts. A follow-up patch changes the remaining pps code to completely avoid the conversion. Acked-by: Richard Cochran <richardcochran@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-10-01ntp/pps: use timespec64 for hardpps()Arnd Bergmann
There is only one user of the hardpps function in the kernel, so it makes sense to atomically change it over to using 64-bit timestamps for y2038 safety. In the hardpps implementation, we also need to change the pps_normtime structure, which is similar to struct timespec and also requires a 64-bit seconds portion. This introduces two temporary variables in pps_kc_event() to do the conversion, they will be removed again in the next step, which seemed preferable to having a larger patch changing it all at the same time. Acked-by: Richard Cochran <richardcochran@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2015-10-01Merge branch 'irq/for-arm' into irq/coreThomas Gleixner
Bring in the change which we offered arm[64] folks to pull into their trees.
2015-10-01ftrace: Remove redundant swap functionRasmus Villemoes
To cover the common case of sorting an array of pointers, Daniel Wagner recently modified the library sort() to use a specific swap function for size==8, in addition to the size==4 case which was already handled. Since sizeof(long) is either 4 or 8, ftrace_swap_ips() is redundant and we can just let sort() pick an appropriate and fast swap callback. Link: http://lkml.kernel.org/r/1441834023-13130-1-git-send-email-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-10-01kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME supportAndrey Smetanin
HV_X64_MSR_VP_RUNTIME msr used by guest to get "the time the virtual processor consumes running guest code, and the time the associated logical processor spends running hypervisor code on behalf of that guest." Calculation of this time is performed by task_cputime_adjusted() for vcpu task. Necessary to support loading of winhv.sys in guest, which in turn is required to support Windows VMBus. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Gleb Natapov <gleb@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01tracing: Use kstrdup_const instead of private implementationRasmus Villemoes
The kernel now has kstrdup_const/kfree_const for reusing .rodata (typically string literals) when possible; there's no reason to duplicate that logic in the tracing system. Moreover, as the comment above core_kernel_data states, it may not always return true for .rodata - that is for example the case on x86_64, where we thus end up kstrdup'ing all the passed-in strings. Arguably, testing for .rodata explicitly (as kstrdup_const does) is also more correct: I don't think one is supposed to be able to change the name after creating the event_subsystem by passing the address of a static char (but non-const) array. Link: http://lkml.kernel.org/r/1441833841-12955-1-git-send-email-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-10-01genirq: Introduce generic irq migration for cpu hotunplugYang Yingliang
ARM and ARM64 have almost identical code for migrating interrupts on cpu hotunplug. Provide a generic version which can be used by both. The new code addresses a shortcoming in the ARM[64] variants which fails to update the affinity change in some cases. The solution for this is to use the core function irq_do_set_affinity() instead of open coding it. [ tglx: Added copyright notice and license boilerplate. Rewrote subject and changelog. ] Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Acked-by: Russell King - ARM Linux <linux@arm.linux.org.uk> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Hanjun Guo <hanjun.guo@linaro.org> Cc: <linux-arm-kernel@lists.infradead.org> Link: http://lkml.kernel.org/r/1443087135-17044-2-git-send-email-yangyingliang@huawei.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-01genirq: Fix race in register_irq_proc()Ben Hutchings
Per-IRQ directories in procfs are created only when a handler is first added to the irqdesc, not when the irqdesc is created. In the case of a shared IRQ, multiple tasks can race to create a directory. This race condition seems to have been present forever, but is easier to hit with async probing. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Link: http://lkml.kernel.org/r/1443266636.2004.2.camel@decadent.org.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
2015-09-30tracing: Add trace options for tracer options to instancesSteven Rostedt (Red Hat)
Add the tracer options to instances options directory as well. Only add the options for tracers that are allowed to be enabled by an instance. But note, that tracer options are global. That is, tracer options enabled in an instance, also take affect at the top level and in other instances. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-30tracing: Add trace options for core options to instancesSteven Rostedt (Red Hat)
Allow instances to have their own options, at least for the core options (non tracer specific ones). There are a few global options that should not be added to instances, like enabling of trace_printk, and the sched comm recording, which do not have a specific trace instance associated to them. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-30tracing: Make ftrace_trace_stack() depend on general trace_array flagSteven Rostedt (Red Hat)
In preparation for the multi buffer instances to have their own trace_flags, the check in ftrace_trace_stack() needs to test the trace_array descriptor flag that is for the current event, not the global_trace descriptor. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-30tracing: Add a method to pass in trace_array descriptor to option filesSteven Rostedt (Red Hat)
In preparation of having the multi buffer instances having their own trace option flags, the trace option files needs a way to not only pass in the flag they represent, but also the trace_array descriptor. A new field is added to the trace_array descriptor called trace_flags_index, which is a 32 byte character array representing a bit. This array is simply filled with the index of the array, where index_array[n] = n; Then the address of this array is passed to the file callbacks instead of the index of the flag index. Then to retrieve both the flag index and the trace_array descriptor: data is the passed in argument. index = *(unsigned char *)data; data -= index; /* Now data points to the address of the array in the trace_array */ tr = container_of(data, struct trace_array, trace_flags_index); Suggested-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-30tracing: Move trace_flags from global to a trace_array fieldSteven Rostedt (Red Hat)
In preparation to make trace options per instance, the global trace_flags needs to be moved from being a global variable to a field within the trace instance trace_array structure. There's still more work to do, as there's some functions that use trace_flags without passing in a way to get to the current_trace array. For those, the global_trace is used directly (from trace.c). This includes setting and clearing the trace_flags. This means that when a new instance is created, it just gets the trace_flags of the global_trace and will not be able to modify them. Depending on the functions that have access to the trace_array, the flags of an instance may not affect parts of its trace, where the global_trace is used. These will be fixed in future changes. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>