summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2018-09-10locking/ww_mutex: Fix spelling mistake "cylic" -> "cyclic"Colin Ian King
Trivial fix to spelling mistake in pr_err() error message Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Link: http://lkml.kernel.org/r/20180824112235.8842-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/lockdep: Delete unnecessary #includeBen Hutchings
Commit: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage") added the inclusion of <trace/events/preemptirq.h>. liblockdep doesn't have a stub version of that header so now fails to build. However, commit: bff1b208a5d1 ("tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"") removed the use of functions declared in that header. So delete the #include. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <alexander.levin@verizon.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Fixes: bff1b208a5d1 ("tracing: Partial revert of "tracing: Centralize ...") Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints ...") Link: http://lkml.kernel.org/r/20180828203315.GD18030@decadent.org.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10PM / sleep: Show freezing tasks that caused a suspend abortTodd Brandt
For debug purposes it would be nice to see which tasks caused a suspend abort, i.e. which tasks were still in the process of freezing when a wakeup event occurred. This patch adds the info to pm_debug_messages. Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-09-10locking/rwsem: Make owner store task pointer of last owning readerWaiman Long
Currently, when a reader acquires a lock, it only sets the RWSEM_READER_OWNED bit in the owner field. The other bits are simply not used. When debugging hanging cases involving rwsems and readers, the owner value does not provide much useful information at all. This patch modifies the current behavior to always store the task_struct pointer of the last rwsem-acquiring reader in a reader-owned rwsem. This may be useful in debugging rwsem hanging cases especially if only one reader is involved. However, the task in the owner field may not the real owner or one of the real owners at all when the owner value is examined, for example, in a crash dump. So it is just an additional hint about the past history. If CONFIG_DEBUG_RWSEMS=y is enabled, the owner field will be checked at unlock time too to make sure the task pointer value is valid. That does have a slight performance cost and so is only enabled as part of that debug option. From the performance point of view, it is expected that the changes shouldn't have any noticeable performance impact. A rwsem microbenchmark (with 48 worker threads and 1:1 reader/writer ratio) was ran on a 2-socket 24-core 48-thread Haswell system. The locking rates on a 4.19-rc1 based kernel were as follows: 1) Unpatched kernel: 543.3 kops/s 2) Patched kernel: 549.2 kops/s 3) Patched kernel (CONFIG_DEBUG_RWSEMS on): 546.6 kops/s There was actually a slight increase in performance (1.1%) in this particular case. Maybe it was caused by the elimination of a branch or just a testing noise. Turning on the CONFIG_DEBUG_RWSEMS option also had less than the expected impact on performance. The least significant 2 bits of the owner value are now used to designate the rwsem is readers owned and the owners are anonymous. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1536265114-10842-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/numa: Remove unused numa_stats::nr_running fieldVincent Guittot
nr_running in struct numa_stats is not used anywhere in the code. Remove it. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1535548752-4434-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/numa: Remove unused code from update_numa_stats()Vincent Guittot
With: commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()") the local variables 'smt', 'cpus' and 'capacity' and their results are not used anymore in numa_has_capacity() Remove this unused code. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1535548752-4434-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/debug: Explicitly cast sched_feat() to boolPeter Zijlstra
LLVM has a warning that tags expressions like: if (foo && non-bool-const) This pattern triggers for CONFIG_SCHED_DEBUG=n where sched_feat() ends up being whatever bit we select. Avoid the warning with an explicit cast to bool. Reported-by: Philipp Klocke <philipp97kl@gmail.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domainsMorten Rasmussen
The 'prefer sibling' sched_domain flag is intended to encourage spreading tasks to sibling sched_domain to take advantage of more caches and core for SMT systems. It has recently been changed to be on all non-NUMA topology level. However, spreading across domains with CPU capacity asymmetry isn't desirable, e.g. spreading from high capacity to low capacity CPUs even if high capacity CPUs aren't overutilized might give access to more cache but the CPU will be slower and possibly lead to worse overall throughput. To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain level immediately below SD_ASYM_CPUCAPACITY. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-13-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Don't move tasks to lower capacity CPUs unless necessaryChris Redpath
When lower capacity CPUs are load balancing and considering to pull something from a higher capacity group, we should not pull tasks from a CPU with only one task running as this is guaranteed to impede progress for that task. If there is more than one task running, load balance in the higher capacity group would have already made any possible moves to resolve imbalance and we should make better use of system compute capacity by moving a task if we still have more than one running. Signed-off-by: Chris Redpath <chris.redpath@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-11-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Set rq->rd->overload when misfitValentin Schneider
Idle balance is a great opportunity to pull a misfit task. However, there are scenarios where misfit tasks are present but idle balance is prevented by the overload flag. A good example of this is a workload of n identical tasks. Let's suppose we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly CPU-intensive tasks - for the sake of simplicity let's say they are just CPU hogs, even when running on big CPUs. They are identical tasks, so on an SMP system they should all end at (roughly) the same time. However, in our case the LITTLE CPUs are less performing than the big CPUs, so tasks running on the LITTLEs will have a longer completion time. This means that the big CPUs will complete their work earlier, at which point they should pull the tasks from the LITTLEs. What we want to happen is summarized as follows: a,b,c,d are our CPU-hogging tasks _ signifies idling LITTLE_0 | a a a a _ _ LITTLE_1 | b b b b _ _ ---------|------------- big_0 | c c c c a a big_1 | d d d d b b ^ ^ Tasks end on the big CPUs, idle balance happens and the misfit tasks are pulled straight away This however won't happen, because currently the overload flag is only set when there is any CPU that has more than one runnable task - which may very well not be the case here if our CPU-hogging workload is all there is to run. As such, this commit sets the overload flag in update_sg_lb_stats when a group is flagged as having a misfit task. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-10-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE()Valentin Schneider
This variable can be read and set locklessly within update_sd_lb_stats(). As such, READ/WRITE_ONCE() are added to make sure nothing terribly wrong can happen because of the compiler. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-9-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/core: Change root_domain->overload type to intValentin Schneider
sizeof(_Bool) is implementation defined, so let's just go with 'int' as is done for other structures e.g. sched_domain_shared->has_idle_cores. The local 'overload' variable used in update_sd_lb_stats can remain bool, as it won't impact any struct layout and can be assigned to the root_domain field. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-8-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Change 'prefer_sibling' type to boolValentin Schneider
This variable is entirely local to update_sd_lb_stats, so we can safely change its type and slightly clean up its initialisation. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-7-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Kick nohz balance if rq->misfit_task_loadValentin Schneider
There already are a few conditions in nohz_kick_needed() to ensure a nohz kick is triggered, but they are not enough for some misfit task scenarios. Excluding asym packing, those are: - rq->nr_running >=2: Not relevant here because we are running a misfit task, it needs to be migrated regardless and potentially through active balance. - sds->nr_busy_cpus > 1: If there is only the misfit task being run on a group of low capacity CPUs, this will be evaluated to False. - rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here, misfit task needs to be migrated regardless of rt/IRQ pressure As such, this commit adds an rq->misfit_task_load condition to trigger a nohz kick. The idea to kick a nohz balance for misfit tasks originally came from Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for the Android Common Kernel - see: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-6-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Consider misfit tasks when load-balancingMorten Rasmussen
On asymmetric CPU capacity systems load intensive tasks can end up on CPUs that don't suit their compute demand. In this scenarios 'misfit' tasks should be migrated to CPUs with higher compute capacity to ensure better throughput. group_misfit_task indicates this scenario, but tweaks to the load-balance code are needed to make the migrations happen. Misfit balancing only makes sense between a source group of lower per-CPU capacity and destination group of higher compute capacity. Otherwise, misfit balancing is ignored. group_misfit_task has lowest priority so any imbalance due to overload is dealt with first. The modifications are: 1. Only pick a group containing misfit tasks as the busiest group if the destination group has higher capacity and has spare capacity. 2. When the busiest group is a 'misfit' group, skip the usual average load and group capacity checks. 3. Set the imbalance for 'misfit' balancing sufficiently high for a task to be pulled ignoring average load. 4. Pick the CPU with the highest misfit load as the source CPU. 5. If the misfit task is alone on the source CPU, go for active balancing. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Add sched_group per-CPU max capacityMorten Rasmussen
The current sg->min_capacity tracks the lowest per-CPU compute capacity available in the sched_group when rt/irq pressure is taken into account. Minimum capacity isn't the ideal metric for tracking if a sched_group needs offloading to another sched_group for some scenarios, e.g. a sched_group with multiple CPUs if only one is under heavy pressure. Tracking maximum capacity isn't perfect either but a better choice for some situations as it indicates that the sched_group definitely compute capacity constrained either due to rt/irq pressure on all CPUs or asymmetric CPU capacities (e.g. big.LITTLE). Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Add 'group_misfit_task' load-balance typeMorten Rasmussen
To maximize throughput in systems with asymmetric CPU capacities (e.g. ARM big.LITTLE) load-balancing has to consider task and CPU utilization as well as per-CPU compute capacity when load-balancing in addition to the current average load based load-balancing policy. Tasks with high utilization that are scheduled on a lower capacity CPU need to be identified and migrated to a higher capacity CPU if possible to maximize throughput. To implement this additional policy an additional group_type (load-balance scenario) is added: 'group_misfit_task'. This represents scenarios where a sched_group has one or more tasks that are not suitable for its per-CPU capacity. 'group_misfit_task' is only considered if the system is not overloaded or imbalanced ('group_imbalanced' or 'group_overloaded'). Identifying misfit tasks requires the rq lock to be held. To avoid taking remote rq locks to examine source sched_groups for misfit tasks, each CPU is responsible for tracking misfit tasks themselves and update the rq->misfit_task flag. This means checking task utilization when tasks are scheduled and on sched_tick. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/topology: Add static_key for asymmetric CPU capacity optimizationsMorten Rasmussen
The existing asymmetric CPU capacity code should cause minimal overhead for others. Putting it behind a static_key, it has been done for SMT optimizations, would make it easier to extend and improve without causing harm to others moving forward. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: gaku.inami.xh@renesas.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmussen@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/topology: Add SD_ASYM_CPUCAPACITY flag detectionMorten Rasmussen
The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the sched_domain in the hierarchy where all CPU capacities are visible for any CPU's point of view on asymmetric CPU capacity systems. The scheduler can then take to take capacity asymmetry into account when balancing at this level. It also serves as an indicator for how wide task placement heuristics have to search to consider all available CPU capacities as asymmetric systems might often appear symmetric at smallest level(s) of the sched_domain hierarchy. The flag has been around for while but so far only been set by out-of-tree code in Android kernels. One solution is to let each architecture provide the flag through a custom sched_domain topology array and associated mask and flag functions. However, SD_ASYM_CPUCAPACITY is special in the sense that it depends on the capacity and presence of all CPUs in the system, i.e. when hotplugging all CPUs out except those with one particular CPU capacity the flag should disappear even if the sched_domains don't collapse. Similarly, the flag is affected by cpusets where load-balancing is turned off. Detecting when the flags should be set therefore depends not only on topology information but also the cpuset configuration and hotplug state. The arch code doesn't have easy access to the cpuset configuration. Instead, this patch implements the flag detection in generic code where cpusets and hotplug state is already taken care of. All the arch is responsible for is to implement arch_scale_cpu_capacity() and force a full rebuild of the sched_domain hierarchy if capacities are updated, e.g. later in the boot process when cpufreq has initialized. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: valentin.schneider@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmussen@arm.com [ Fixed 'CPU' capitalization. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Fix kernel-doc notation warningRandy Dunlap
Fix kernel-doc warning for missing 'flags' parameter description: ../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: ea14b57e8a18 ("sched/cpufreq: Provide migration hint") Link: http://lkml.kernel.org/r/cdda0d42-880d-4229-a9f7-5899c977a063@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/rwsem: Exit read lock slowpath if queue empty & no writerWaiman Long
It was discovered that a constant stream of readers with occassional writers pounding on a rwsem may cause many of the readers to enter the slowpath unnecessarily thus increasing latency and lowering performance. In the current code, a reader entering the slowpath critical section will unconditionally set the WAITING_BIAS, if not set yet, and clear its active count even if no one is in the wait queue and no writer is present. This causes some incoming readers to observe the presence of waiters in the wait queue and hence have to go into the slowpath themselves. With sufficient numbers of readers and a relatively short lock hold time, the WAITING_BIAS may be repeatedly turned on and off and a substantial portion of the readers will go into the slowpath sustaining a rather long queue in the wait queue spinlock and repeated WAITING_BIAS on/off cycle until the logjam is broken opportunistically. To avoid this situation from happening, an additional check is added to detect the special case that the reader in the critical section is the only one in the wait queue and no writer is present. When that happens, it can just exit the slowpath and return immediately as its active count has already been set in the lock. Other incoming readers won't observe the presence of waiters and so will not be forced into the slowpath. The issue was found in a customer site where they had an application that pounded on the pread64 syscalls heavily on an XFS filesystem. The application was run in a recent 4-socket boxes with a lot of CPUs. They saw significant spinlock contention in the rwsem_down_read_failed() call. With this patch applied, the system CPU usage went down from 85% to 57%, and the spinlock contention in the pread64 syscalls was gone. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Joe Mario <jmario@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1532459425-19204-1-git-send-email-longman@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operationsPeter Zijlstra
Weirdly we seem to have forgotten this... Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10Merge branch 'locking/urgent' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10jump_label: Fix typo in warning messageBorislav Petkov
There's no 'allocatote' - use the next best thing: 'allocate' :-) Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180907103521.31344-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Fix load_balance redo for !imbalanceVincent Guittot
It can happen that load_balance() finds a busiest group and then a busiest rq but the calculated imbalance is in fact 0. In such situation, detach_tasks() returns immediately and lets the flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to have pinned tasks and removed from the load balance mask. then, we redo a load balance without the busiest CPU. This creates wrong load balance situation and generates wrong task migration. If the calculated imbalance is 0, it's useless to try to find a busiest rq as no task will be migrated and we can return immediately. This situation can happen with heterogeneous system or smp system when RT tasks are decreasing the capacity of some CPUs. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dietmar.eggemann@arm.com Cc: jhugo@codeaurora.org Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Fix scale_rt_capacity() for SMTVincent Guittot
Since commit: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") scale_rt_capacity() returns the remaining capacity and not a scale factor to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by scale_rt_capacity() so we must take the sched_domain argument. Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()") Link: http://lkml.kernel.org/r/20180904093626.GA23936@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/fair: Fix vruntime_normalized() for remote non-migration wakeupSteve Muckle
When a task which previously ran on a given CPU is remotely queued to wake up on that same CPU, there is a period where the task's state is TASK_WAKING and its vruntime is not normalized. This is not accounted for in vruntime_normalized() which will cause an error in the task's vruntime if it is switched from the fair class during this time. For example if it is boosted to RT priority via rt_mutex_setprio(), rq->min_vruntime will not be subtracted from the task's vruntime but it will be added again when the task returns to the fair class. The task's vruntime will have been erroneously doubled and the effective priority of the task will be reduced. Note this will also lead to inflation of all vruntimes since the doubled vruntime value will become the rq's min_vruntime when other tasks leave the rq. This leads to repeated doubling of the vruntime and priority penalty. Fix this by recognizing a WAKING task's vruntime as normalized only if sched_remote_wakeup is true. This indicates a migration, in which case the vruntime would have been normalized in migrate_task_rq_fair(). Based on a similar patch from John Dias <joaodias@google.com>. Suggested-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Steve Muckle <smuckle@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: John Dias <joaodias@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Miguel de Dios <migueldedios@google.com> Cc: Morten Rasmussen <Morten.Rasmussen@arm.com> Cc: Patrick Bellasi <Patrick.Bellasi@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: kernel-team@android.com Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration") Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/pelt: Fix update_blocked_averages() for RT and DL classesVincent Guittot
update_blocked_averages() is called to periodiccally decay the stalled load of idle CPUs and to sync all loads before running load balance. When cfs rq is idle, it trigs a load balance during pick_next_task_fair() in order to potentially pull tasks and to use this newly idle CPU. This load balance happens whereas prev task from another class has not been put and its utilization updated yet. This may lead to wrongly account running time as idle time for RT or DL classes. Test that no RT or DL task is running when updating their utilization in update_blocked_averages(). We still update RT and DL utilization instead of simply skipping them to make sure that all metrics are synced when used during load balance. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 371bf4273269 ("sched/rt: Add rt_rq utilization tracking") Fixes: 3727e0e16340 ("sched/dl: Add dl_rq utilization tracking") Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/topology: Set correct NUMA topology typeSrikar Dronamraju
With the following commit: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain") the scheduler introduced a new NUMA level. However this leads to the NUMA topology on 2 node systems to not be marked as NUMA_DIRECT anymore. After this commit, it gets reported as NUMA_BACKPLANE, because sched_domains_numa_level is now 2 on 2 node systems. Fix this by allowing setting systems that have up to 2 NUMA levels as NUMA_DIRECT. While here remove code that assumes that level can be 0. Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andre Wild <wild@linux.vnet.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org> Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain" Link: http://lkml.kernel.org/r/1533920419-17410-1-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10sched/debug: Fix potential deadlock when writing to sched_featuresJiada Wang
The following lockdep report can be triggered by writing to /sys/kernel/debug/sched_features: ====================================================== WARNING: possible circular locking dependency detected 4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18 Not tainted ------------------------------------------------------ sh/3358 is trying to acquire lock: 000000004ad3989d (cpu_hotplug_lock.rw_sem){++++}, at: static_key_enable+0x14/0x30 but task is already holding lock: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&sb->s_type->i_mutex_key#3){+.+.}: lock_acquire+0xb8/0x148 down_write+0xac/0x140 start_creating+0x5c/0x168 debugfs_create_dir+0x18/0x220 opp_debug_register+0x8c/0x120 _add_opp_dev+0x104/0x1f8 dev_pm_opp_get_opp_table+0x174/0x340 _of_add_opp_table_v2+0x110/0x760 dev_pm_opp_of_add_table+0x5c/0x240 dev_pm_opp_of_cpumask_add_table+0x5c/0x100 cpufreq_init+0x160/0x430 cpufreq_online+0x1cc/0xe30 cpufreq_add_dev+0x78/0x198 subsys_interface_register+0x168/0x270 cpufreq_register_driver+0x1c8/0x278 dt_cpufreq_probe+0xdc/0x1b8 platform_drv_probe+0xb4/0x168 driver_probe_device+0x318/0x4b0 __device_attach_driver+0xfc/0x1f0 bus_for_each_drv+0xf8/0x180 __device_attach+0x164/0x200 device_initial_probe+0x10/0x18 bus_probe_device+0x110/0x178 device_add+0x6d8/0x908 platform_device_add+0x138/0x3d8 platform_device_register_full+0x1cc/0x1f8 cpufreq_dt_platdev_init+0x174/0x1bc do_one_initcall+0xb8/0x310 kernel_init_freeable+0x4b8/0x56c kernel_init+0x10/0x138 ret_from_fork+0x10/0x18 -> #2 (opp_table_lock){+.+.}: lock_acquire+0xb8/0x148 __mutex_lock+0x104/0xf50 mutex_lock_nested+0x1c/0x28 _of_add_opp_table_v2+0xb4/0x760 dev_pm_opp_of_add_table+0x5c/0x240 dev_pm_opp_of_cpumask_add_table+0x5c/0x100 cpufreq_init+0x160/0x430 cpufreq_online+0x1cc/0xe30 cpufreq_add_dev+0x78/0x198 subsys_interface_register+0x168/0x270 cpufreq_register_driver+0x1c8/0x278 dt_cpufreq_probe+0xdc/0x1b8 platform_drv_probe+0xb4/0x168 driver_probe_device+0x318/0x4b0 __device_attach_driver+0xfc/0x1f0 bus_for_each_drv+0xf8/0x180 __device_attach+0x164/0x200 device_initial_probe+0x10/0x18 bus_probe_device+0x110/0x178 device_add+0x6d8/0x908 platform_device_add+0x138/0x3d8 platform_device_register_full+0x1cc/0x1f8 cpufreq_dt_platdev_init+0x174/0x1bc do_one_initcall+0xb8/0x310 kernel_init_freeable+0x4b8/0x56c kernel_init+0x10/0x138 ret_from_fork+0x10/0x18 -> #1 (subsys mutex#6){+.+.}: lock_acquire+0xb8/0x148 __mutex_lock+0x104/0xf50 mutex_lock_nested+0x1c/0x28 subsys_interface_register+0xd8/0x270 cpufreq_register_driver+0x1c8/0x278 dt_cpufreq_probe+0xdc/0x1b8 platform_drv_probe+0xb4/0x168 driver_probe_device+0x318/0x4b0 __device_attach_driver+0xfc/0x1f0 bus_for_each_drv+0xf8/0x180 __device_attach+0x164/0x200 device_initial_probe+0x10/0x18 bus_probe_device+0x110/0x178 device_add+0x6d8/0x908 platform_device_add+0x138/0x3d8 platform_device_register_full+0x1cc/0x1f8 cpufreq_dt_platdev_init+0x174/0x1bc do_one_initcall+0xb8/0x310 kernel_init_freeable+0x4b8/0x56c kernel_init+0x10/0x138 ret_from_fork+0x10/0x18 -> #0 (cpu_hotplug_lock.rw_sem){++++}: __lock_acquire+0x203c/0x21d0 lock_acquire+0xb8/0x148 cpus_read_lock+0x58/0x1c8 static_key_enable+0x14/0x30 sched_feat_write+0x314/0x428 full_proxy_write+0xa0/0x138 __vfs_write+0xd8/0x388 vfs_write+0xdc/0x318 ksys_write+0xb4/0x138 sys_write+0xc/0x18 __sys_trace_return+0x0/0x4 other info that might help us debug this: Chain exists of: cpu_hotplug_lock.rw_sem --> opp_table_lock --> &sb->s_type->i_mutex_key#3 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sb->s_type->i_mutex_key#3); lock(opp_table_lock); lock(&sb->s_type->i_mutex_key#3); lock(cpu_hotplug_lock.rw_sem); *** DEADLOCK *** 2 locks held by sh/3358: #0: 00000000a8c4b363 (sb_writers#10){.+.+}, at: vfs_write+0x238/0x318 #1: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428 stack backtrace: CPU: 5 PID: 3358 Comm: sh Not tainted 4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18 Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT) Call trace: dump_backtrace+0x0/0x288 show_stack+0x14/0x20 dump_stack+0x13c/0x1ac print_circular_bug.isra.10+0x270/0x438 check_prev_add.constprop.16+0x4dc/0xb98 __lock_acquire+0x203c/0x21d0 lock_acquire+0xb8/0x148 cpus_read_lock+0x58/0x1c8 static_key_enable+0x14/0x30 sched_feat_write+0x314/0x428 full_proxy_write+0xa0/0x138 __vfs_write+0xd8/0x388 vfs_write+0xdc/0x318 ksys_write+0xb4/0x138 sys_write+0xc/0x18 __sys_trace_return+0x0/0x4 This is because when loading the cpufreq_dt module we first acquire cpu_hotplug_lock.rw_sem lock, then in cpufreq_init(), we are taking the &sb->s_type->i_mutex_key lock. But when writing to /sys/kernel/debug/sched_features, the cpu_hotplug_lock.rw_sem lock depends on the &sb->s_type->i_mutex_key lock. To fix this bug, reverse the lock acquisition order when writing to sched_features, this way cpu_hotplug_lock.rw_sem no longer depends on &sb->s_type->i_mutex_key. Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Jiada Wang <jiada_wang@mentor.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Eugeniu Rosca <erosca@de.adit-jv.com> Cc: George G. Davis <george_davis@mentor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180731121222.26195-1-jiada_wang@mentor.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10locking/mutex: Fix mutex debug call and ww_mutex documentationThomas Hellstrom
The following commit: 08295b3b5bee ("Implement an algorithm choice for Wound-Wait mutexes") introduced a reference in the documentation to a function that was removed in an earlier commit. It also forgot to remove a call to debug_mutex_add_waiter() which is now unconditionally called by __mutex_add_waiter(). Fix those bugs. Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dri-devel@lists.freedesktop.org Fixes: 08295b3b5bee ("Implement an algorithm choice for Wound-Wait mutexes") Link: http://lkml.kernel.org/r/20180903140708.2401-1-thellstrom@vmware.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-10jump_label: Use static_key_linked() accessorBorislav Petkov
... instead of open-coding it, in static_key_mod(). No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180909114252.17575-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-09Merge tag 'perf-urgent-for-mingo-4.19-20180903' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/urgent fixes from Arnaldo Carvalho de Melo: Kernel: - Modify breakpoint fixes (Jiri Olsa) perf annotate: - Fix parsing aarch64 branch instructions after objdump update (Kim Phillips) - Fix parsing indirect calls in 'perf annotate' (Martin Liška) perf probe: - Ignore SyS symbols irrespective of endianness on PowerPC (Sandipan Das) perf trace: - Fix include path for asm-generic/unistd.h on arm64 (Kim Phillips) Core libraries: - Fix potential null pointer dereference in perf_evsel__new_idx() (Hisao Tanabe) - Use fixed size string for comms instead of scanf("%m"), that is not present in the bionic libc and leads to a crash (Chris Phlipot) - Fix bad memory access in trace info on 32-bit systems, we were reading 8 bytes from a 4-byte long variable when saving the command line in the perf.data file. (Chris Phlipot) Build system: - Streamline bpf examples and headers installation, clarifying some install messages. (Arnaldo Carvalho de Melo) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-09-09Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timekeeping fixes from Thomas Gleixner: "Two fixes for timekeeping: - Revert to the previous kthread based update, which is unfortunately required due to lock ordering issues. The removal caused boot failures on old Core2 machines. Add a proper comment why the thread needs to stay to prevent accidental removal in the future. - Fix a silly typo in a function declaration" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Revert "Remove kthread" timekeeping: Fix declaration of read_persistent_wall_and_boot_offset()
2018-09-09Merge branch 'smp-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull cpu hotplug fixes from Thomas Gleixner: "Two fixes for the hotplug state machine code: - Move the misplaces smb() in the hotplug thread function to the proper place, otherwise a half update control struct could be observed - Prevent state corruption on error rollback, which causes the state to advance by one and as a consequence skip it in the bringup sequence" * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu/hotplug: Prevent state corruption on error rollback cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun()
2018-09-08dma-mapping: remove dma_deconfigureChristoph Hellwig
This goes through a lot of hooks just to call arch_teardown_dma_ops. Replace it with a direct call instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2018-09-08dma-mapping: remove dma_configureChristoph Hellwig
There is no good reason for this indirection given that the method always exists. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2018-09-06clocksource: Revert "Remove kthread"Peter Zijlstra
I turns out that the silly spawn kthread from worker was actually needed. clocksource_watchdog_kthread() cannot be called directly from clocksource_watchdog_work(), because clocksource_select() calls timekeeping_notify() which uses stop_machine(). One cannot use stop_machine() from a workqueue() due lock inversions wrt CPU hotplug. Revert the patch but add a comment that explain why we jump through such apparently silly hoops. Fixes: 7197e77abcb6 ("clocksource: Remove kthread") Reported-by: Siegfried Metz <frame@mailbox.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Niklas Cassel <niklas.cassel@linaro.org> Tested-by: Kevin Shanahan <kevin@shanahan.id.au> Tested-by: viktor_jaegerskuepper@freenet.de Tested-by: Siegfried Metz <frame@mailbox.org> Cc: rafael.j.wysocki@intel.com Cc: len.brown@intel.com Cc: diego.viola@gmail.com Cc: rui.zhang@intel.com Cc: bjorn.andersson@linaro.org Link: https://lkml.kernel.org/r/20180905084158.GR24124@hirez.programming.kicks-ass.net
2018-09-06seccomp: remove unnecessary unlikely()Igor Stoppa
WARN_ON() already contains an unlikely(), so it's not necessary to wrap it into another. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: linux-security-module@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: James Morris <james.morris@microsoft.com>
2018-09-06Merge tag 'trace-v4.19-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This fixes two annoying bugs: - The first one is a side effect caused by using SRCU for rcuidle tracepoints. It seems that the perf was depending on the rcuidle tracepoints to make RCU watch when it wasn't. The real fix will be to have perf use SRCU instead of depending on RCU watching, but that can't be done until SRCU is safe to use in NMI context (Paul's working on that). - The second bug fix is for a bug that's been periodically making my tests fail randomly for some time. I haven't had time to track it down, but finally have. It has to do with stressing NMIs (via perf) while enabling or disabling ftrace function handling with lockdep enabled. If an interrupt happens and just as it returns, it sets lockdep back to "interrupts enabled" but before it returns an NMI is triggered, and if this happens while printk_nmi_enter has a breakpoint attached to it (because ftrace is converting it to or from nop to call fentry), the breakpoint trap also calls into lockdep, and since returning from the NMI to a interrupt handler, interrupts were disabled when the NMI went off, lockdep keeps its state as interrupts disabled when it returns back from the interrupt handler where interrupts are enabled. This causes lockdep_assert_irqs_enabled() to trigger a false positive" * tag 'trace-v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: printk/tracing: Do not trace printk_nmi_enter() tracing: Add back in rcu_irq_enter/exit_irqson() for rcuidle tracepoints
2018-09-06printk/tracing: Do not trace printk_nmi_enter()Steven Rostedt (VMware)
I hit the following splat in my tests: ------------[ cut here ]------------ IRQs not enabled as expected WARNING: CPU: 3 PID: 0 at kernel/time/tick-sched.c:982 tick_nohz_idle_enter+0x44/0x8c Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipv6 CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.19.0-rc2-test+ #2 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 EIP: tick_nohz_idle_enter+0x44/0x8c Code: ec 05 00 00 00 75 26 83 b8 c0 05 00 00 00 75 1d 80 3d d0 36 3e c1 00 75 14 68 94 63 12 c1 c6 05 d0 36 3e c1 01 e8 04 ee f8 ff <0f> 0b 58 fa bb a0 e5 66 c1 e8 25 0f 04 00 64 03 1d 28 31 52 c1 8b EAX: 0000001c EBX: f26e7f8c ECX: 00000006 EDX: 00000007 ESI: f26dd1c0 EDI: 00000000 EBP: f26e7f40 ESP: f26e7f38 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010296 CR0: 80050033 CR2: 0813c6b0 CR3: 2f342000 CR4: 001406f0 Call Trace: do_idle+0x33/0x202 cpu_startup_entry+0x61/0x63 start_secondary+0x18e/0x1ed startup_32_smp+0x164/0x168 irq event stamp: 18773830 hardirqs last enabled at (18773829): [<c040150c>] trace_hardirqs_on_thunk+0xc/0x10 hardirqs last disabled at (18773830): [<c040151c>] trace_hardirqs_off_thunk+0xc/0x10 softirqs last enabled at (18773824): [<c0ddaa6f>] __do_softirq+0x25f/0x2bf softirqs last disabled at (18773767): [<c0416bbe>] call_on_stack+0x45/0x4b ---[ end trace b7c64aa79e17954a ]--- After a bit of debugging, I found what was happening. This would trigger when performing "perf" with a high NMI interrupt rate, while enabling and disabling function tracer. Ftrace uses breakpoints to convert the nops at the start of functions to calls to the function trampolines. The breakpoint traps disable interrupts and this makes calls into lockdep via the trace_hardirqs_off_thunk in the entry.S code. What happens is the following: do_idle { [interrupts enabled] <interrupt> [interrupts disabled] TRACE_IRQS_OFF [lockdep says irqs off] [...] TRACE_IRQS_IRET test if pt_regs say return to interrupts enabled [yes] TRACE_IRQS_ON [lockdep says irqs are on] <nmi> nmi_enter() { printk_nmi_enter() [traced by ftrace] [ hit ftrace breakpoint ] <breakpoint exception> TRACE_IRQS_OFF [lockdep says irqs off] [...] TRACE_IRQS_IRET [return from breakpoint] test if pt_regs say interrupts enabled [no] [iret back to interrupt] [iret back to code] tick_nohz_idle_enter() { lockdep_assert_irqs_enabled() [lockdep say no!] Although interrupts are indeed enabled, lockdep thinks it is not, and since we now do asserts via lockdep, it gives a false warning. The issue here is that printk_nmi_enter() is called before lockdep_off(), which disables lockdep (for this reason) in NMIs. By simply not allowing ftrace to see printk_nmi_enter() (via notrace annotation) we keep lockdep from getting confused. Cc: stable@vger.kernel.org Fixes: 42a0bb3f71383 ("printk/nmi: generic solution for safe printk in NMI") Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-09-06cpu/hotplug: Prevent state corruption on error rollbackThomas Gleixner
When a teardown callback fails, the CPU hotplug code brings the CPU back to the previous state. The previous state becomes the new target state. The rollback happens in undo_cpu_down() which increments the state unconditionally even if the state is already the same as the target. As a consequence the next CPU hotplug operation will start at the wrong state. This is easily to observe when __cpu_disable() fails. Prevent the unconditional undo by checking the state vs. target before incrementing state and fix up the consequently wrong conditional in the unplug code which handles the failure of the final CPU take down on the control CPU side. Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core") Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Neeraj Upadhyay <neeraju@codeaurora.org> Cc: josh@joshtriplett.org Cc: peterz@infradead.org Cc: jiangshanlai@gmail.com Cc: dzickus@redhat.com Cc: brendan.jackman@arm.com Cc: malat@debian.org Cc: sramana@codeaurora.org Cc: linux-arm-msm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1809051419580.1416@nanos.tec.linutronix.de ----
2018-09-06cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun()Neeraj Upadhyay
The smp_mb() in cpuhp_thread_fun() is misplaced. It needs to be after the load of st->should_run to prevent reordering of the later load/stores w.r.t. the load of st->should_run. Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infraded.org> Cc: josh@joshtriplett.org Cc: peterz@infradead.org Cc: jiangshanlai@gmail.com Cc: dzickus@redhat.com Cc: brendan.jackman@arm.com Cc: malat@debian.org Cc: mojha@codeaurora.org Cc: sramana@codeaurora.org Cc: linux-arm-msm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1536126727-11629-1-git-send-email-neeraju@codeaurora.org
2018-09-05bpf/verifier: fix verifier instabilityAlexei Starovoitov
Edward Cree says: In check_mem_access(), for the PTR_TO_CTX case, after check_ctx_access() has supplied a reg_type, the other members of the register state are set appropriately. Previously reg.range was set to 0, but as it is in a union with reg.map_ptr, which is larger, upper bytes of the latter were left in place. This then caused the memcmp() in regsafe() to fail, preventing some branches from being pruned (and occasionally causing the same program to take a varying number of processed insns on repeated verifier runs). Fix the instability by clearing bpf_reg_state in __mark_reg_[un]known() Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Debugged-by: Edward Cree <ecree@solarflare.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2018-09-04Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "17 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: nilfs2: convert to SPDX license tags drivers/dax/device.c: convert variable to vm_fault_t type lib/Kconfig.debug: fix three typos in help text checkpatch: add __ro_after_init to known $Attribute mm: fix BUG_ON() in vmf_insert_pfn_pud() from VM_MIXEDMAP removal uapi/linux/keyctl.h: don't use C++ reserved keyword as a struct member name memory_hotplug: fix kernel_panic on offline page processing checkpatch: add optional static const to blank line declarations test ipc/shm: properly return EIDRM in shm_lock() mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported. mm/util.c: improve kvfree() kerneldoc tools/vm/page-types.c: fix "defined but not used" warning tools/vm/slabinfo.c: fix sign-compare warning kmemleak: always register debugfs file mm: respect arch_dup_mmap() return value mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm(). mm: memcontrol: print proper OOM header when no eligible victim left
2018-09-04mm: respect arch_dup_mmap() return valueNadav Amit
Commit d70f2a14b72a ("include/linux/sched/mm.h: uninline mmdrop_async(), etc") ignored the return value of arch_dup_mmap(). As a result, on x86, a failure to duplicate the LDT (e.g. due to memory allocation error) would leave the duplicated memory mapping in an inconsistent state. Fix by using the return value, as it was before the change. Link: http://lkml.kernel.org/r/20180823051229.211856-1-namit@vmware.com Fixes: d70f2a14b72a4 ("include/linux/sched/mm.h: uninline mmdrop_async(), etc") Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Must perform TXQ teardown before unregistering interfaces in mac80211, from Toke Høiland-Jørgensen. 2) Don't allow creating mac80211_hwsim with less than one channel, from Johannes Berg. 3) Division by zero in cfg80211, fix from Johannes Berg. 4) Fix endian issue in tipc, from Haiqing Bai. 5) BPF sockmap use-after-free fixes from Daniel Borkmann. 6) Spectre-v1 in mac80211_hwsim, from Jinbum Park. 7) Missing rhashtable_walk_exit() in tipc, from Cong Wang. 8) Revert kvzalloc() conversion of AF_PACKET, it breaks mmap() when kvzalloc() tries to use kmalloc() pages. From Eric Dumazet. 9) Fix deadlock in hv_netvsc, from Dexuan Cui. 10) Do not restart timewait timer on RST, from Florian Westphal. 11) Fix double lwstate refcount grab in ipv6, from Alexey Kodanev. 12) Unsolicit report count handling is off-by-one, fix from Hangbin Liu. 13) Sleep-in-atomic in cadence driver, from Jia-Ju Bai. 14) Respect ttl-inherit in ip6 tunnel driver, from Hangbin Liu. 15) Use-after-free in act_ife, fix from Cong Wang. 16) Missing hold to meta module in act_ife, from Vlad Buslov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (91 commits) net: phy: sfp: Handle unimplemented hwmon limits and alarms net: sched: action_ife: take reference to meta module act_ife: fix a potential use-after-free net/mlx5: Fix SQ offset in QPs with small RQ tipc: correct spelling errors for tipc_topsrv_queue_evt() comments tipc: correct spelling errors for struct tipc_bc_base's comment bnxt_en: Do not adjust max_cp_rings by the ones used by RDMA. bnxt_en: Clean up unused functions. bnxt_en: Fix firmware signaled resource change logic in open. sctp: not traverse asoc trans list if non-ipv6 trans exists for ipv6_flowlabel sctp: fix invalid reference to the index variable of the iterator net/ibm/emac: wrong emac_calc_base call was used by typo net: sched: null actions array pointer before releasing action vhost: fix VHOST_GET_BACKEND_FEATURES ioctl request definition r8169: add support for NCube 8168 network card ip6_tunnel: respect ttl inherit for ip6tnl mac80211: shorten the IBSS debug messages mac80211: don't Tx a deauth frame if the AP forbade Tx mac80211: Fix station bandwidth setting after channel switch mac80211: fix a race between restart and CSA flows ...
2018-09-04stackleak: Allow runtime disabling of kernel stack erasingAlexander Popov
Introduce CONFIG_STACKLEAK_RUNTIME_DISABLE option, which provides 'stack_erasing' sysctl. It can be used in runtime to control kernel stack erasing for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Alexander Popov <alex.popov@linux.com> Tested-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-04fs/proc: Show STACKLEAK metrics in the /proc file systemAlexander Popov
Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about tasks via the /proc file system. In particular, /proc/<pid>/stack_depth shows the maximum kernel stack consumption for the current and previous syscalls. Although this information is not precise, it can be useful for estimating the STACKLEAK performance impact for your workloads. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Alexander Popov <alex.popov@linux.com> Tested-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org>