summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2015-04-02bpf: Fix the build on BPF_SYSCALL=y && !CONFIG_TRACING kernels, make it more ↵Ingo Molnar
configurable So bpf_tracing.o depends on CONFIG_BPF_SYSCALL - but that's not its only dependency, it also depends on the tracing infrastructure and on kprobes, without which it will fail to build with: In file included from kernel/trace/bpf_trace.c:14:0: kernel/trace/trace.h: In function ‘trace_test_and_set_recursion’: kernel/trace/trace.h:491:28: error: ‘struct task_struct’ has no member named ‘trace_recursion’ unsigned int val = current->trace_recursion; [...] It took quite some time to trigger this build failure, because right now BPF_SYSCALL is very obscure, depends on CONFIG_EXPERT. So also make BPF_SYSCALL more configurable, not just under CONFIG_EXPERT. If BPF_SYSCALL, tracing and kprobes are enabled then enable the bpf_tracing gateway as well. We might want to make this an interactive option later on, although I'd not complicate it unnecessarily: enabling BPF_SYSCALL is enough of an indicator that the user wants BPF support. Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02clockevents: Fix cpu_down() race for hrtimer based broadcastingPreeti U Murthy
It was found when doing a hotplug stress test on POWER, that the machine either hit softlockups or rcu_sched stall warnings. The issue was traced to commit: 7cba160ad789 ("powernv/cpuidle: Redesign idle states management") which exposed the cpu_down() race with hrtimer based broadcast mode: 5d1638acb9f6 ("tick: Introduce hrtimer based broadcast") The race is the following: Assume CPU1 is the CPU which holds the hrtimer broadcasting duty before it is taken down. CPU0 CPU1 cpu_down() take_cpu_down() disable_interrupts() cpu_die() while (CPU1 != CPU_DEAD) { msleep(100); switch_to_idle(); stop_cpu_timer(); schedule_broadcast(); } tick_cleanup_cpu_dead() take_over_broadcast() So after CPU1 disabled interrupts it cannot handle the broadcast hrtimer anymore, so CPU0 will be stuck forever. Fix this by explicitly taking over broadcast duty before cpu_die(). This is a temporary workaround. What we really want is a callback in the clockevent device which allows us to do that from the dying CPU by pushing the hrtimer onto a different cpu. That might involve an IPI and is definitely more complex than this immediate fix. Changelog was picked up from: https://lkml.org/lkml/2015/2/16/213 Suggested-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: mpe@ellerman.id.au Cc: nicolas.pitre@linaro.org Cc: peterz@infradead.org Cc: rjw@rjwysocki.net Fixes: http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html Link: http://lkml.kernel.org/r/20150330092410.24979.59887.stgit@preeti.in.ibm.com [ Merged it to the latest timer tree, renamed the callback, tidied up the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02tracing: Allow BPF programs to call bpf_trace_printk()Alexei Starovoitov
Debugging of BPF programs needs some form of printk from the program, so let programs call limited trace_printk() with %d %u %x %p modifiers only. Similar to kernel modules, during program load verifier checks whether program is calling bpf_trace_printk() and if so, kernel allocates trace_printk buffers and emits big 'this is debug only' banner. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427312966-8434-6-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02tracing: Allow BPF programs to call bpf_ktime_get_ns()Alexei Starovoitov
bpf_ktime_get_ns() is used by programs to compute time delta between events or as a timestamp Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427312966-8434-5-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02tracing, perf: Implement BPF programs attached to kprobesAlexei Starovoitov
BPF programs, attached to kprobes, provide a safe way to execute user-defined BPF byte-code programs without being able to crash or hang the kernel in any way. The BPF engine makes sure that such programs have a finite execution time and that they cannot break out of their sandbox. The user interface is to attach to a kprobe via the perf syscall: struct perf_event_attr attr = { .type = PERF_TYPE_TRACEPOINT, .config = event_id, ... }; event_fd = perf_event_open(&attr,...); ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); 'prog_fd' is a file descriptor associated with BPF program previously loaded. 'event_id' is an ID of the kprobe created. Closing 'event_fd': close(event_fd); ... automatically detaches BPF program from it. BPF programs can call in-kernel helper functions to: - lookup/update/delete elements in maps - probe_read - wraper of probe_kernel_read() used to access any kernel data structures BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is architecture dependent) and return 0 to ignore the event and 1 to store kprobe event into the ring buffer. Note, kprobes are a fundamentally _not_ a stable kernel ABI, so BPF programs attached to kprobes must be recompiled for every kernel version and user must supply correct LINUX_VERSION_CODE in attr.kern_version during bpf_prog_load() call. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02tracing: Add kprobe flagAlexei Starovoitov
add TRACE_EVENT_FL_KPROBE flag to differentiate kprobe type of tracepoints, since bpf programs can only be attached to kprobe type of PERF_TYPE_TRACEPOINT perf events. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427312966-8434-3-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-02tick: Further simplify tick-internal.hIngo Molnar
Move the broadcasting related section to the GENERIC_CLOCKEVENTS=y section - this also solves build failures on architectures that don't use generic clockevents yet. Also standardize include file style to make it easier to read, and use nesting depth aware preprocessor directives to make future merges easier. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-01Merge tag 'lazytime_fix' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull lazytime fixes from Ted Ts'o: "This fixes a problem in the lazy time patches, which can cause frequently updated inods to never have their timestamps updated. These changes guarantee that no timestamp on disk will be stale by more than 24 hours" * tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: fs: add dirtytime_expire_seconds sysctl fs: make sure the timestamps for lazytime inodes eventually get written
2015-04-01arm/bL_switcher: Kill tick suspend hackeryThomas Gleixner
Use the new tick_suspend/resume_local() and get rid of the homebrewn implementation of these in the ARM bL switcher. The check for the cpumask is completely pointless. There is no harm to suspend a per cpu tick device unconditionally. If that's a real issue then we fix it proper at the core level and not with some completely undocumented hacks in some random core code. Move the tick internals to the core code, now that this nuisance is gone. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ rjw: Rebase, changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Link: http://lkml.kernel.org/r/1655112.Ws17YsMfN7@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-01tick/xen: Provide and use tick_suspend_local() and tick_resume_local()Thomas Gleixner
Xen calls on every cpu into tick_resume() which is just wrong. tick_resume() is for the syscore global suspend/resume invocation. What XEN really wants is a per cpu local resume function. Provide a tick_resume_local() function and use it in XEN. Also provide a complementary tick_suspend_local() and modify tick_unfreeze() and tick_freeze(), respectively, to use the new local tick resume/suspend functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ Combined two patches, rebased, modified subject/changelog. ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1698741.eezk9tnXtG@vostro.rjw.lan [ Merged to latest timers/core. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-01tick: Make tick_resume_broadcast_oneshot() staticThomas Gleixner
Solely used in tick-broadcast.c and the return value is hardcoded 0. Make it static and void. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1689058.QkHYDJSRKu@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-01clockevents: Make suspend/resume calls explicitThomas Gleixner
clockevents_notify() is a leftover from the early design of the clockevents facility. It's really not a notification mechanism, it's a multiplex call. We are way better off to have explicit calls instead of this monstrosity. Split out the suspend/resume() calls and invoke them directly from the call sites. No locking required at this point because these calls happen with interrupts disabled and a single cpu online. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ Rebased on top of 4.0-rc5. ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/713674030.jVm1qaHuPf@vostro.rjw.lan [ Rebased on top of latest timers/core. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-01clockevents: Remove extra local_irq_save() in clockevents_exchange_device()Thomas Gleixner
Called with 'clockevents_lock' held and interrupts disabled already. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/51005827.yXt5tjZMBs@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-01tick: Move core only declarations and functions to coreThomas Gleixner
No point to expose everything to the world. People just believe such functions can be abused for whatever purposes. Sigh. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ Rebased on top of 4.0-rc5 ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Nicolas Pitre <nico@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/28017337.VbCUc39Gme@vostro.rjw.lan [ Merged to latest timers/core ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-01tick: Simplify tick-internal.hThomas Gleixner
tick-internal.h is pretty confusing as a lot of the stub inlines are there several times. Distangle the maze and make clear functional sections. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/16068264.vcNp79HLaT@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-01tick: Move clocksource related stuff to timekeeping.hThomas Gleixner
Move clocksource related stuff to timekeeping.h and remove the pointless include from ntp.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/2714218.nM5AEfAHj0@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-01clockevents: Remove CONFIG_GENERIC_CLOCKEVENTS_BUILDThomas Gleixner
This option was for simpler migration to the clock events code. Most architectures have been converted and the option has been disfunctional as a standalone option for quite some time. Remove it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5021859.jl9OC1medj@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-31Merge tag 'v4.0-rc6' into timers/core, before applying new patchesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-30ring-buffer: Remove duplicate use of '&' in recursive codeSteven Rostedt (Red Hat)
A clean up of the recursive protection code changed val = this_cpu_read(current_context); val--; val &= this_cpu_read(current_context); to val = this_cpu_read(current_context); val &= val & (val - 1); Which has a duplicate use of '&' as the above is the same as val = val & (val - 1); Actually, it would be best to remove that line altogether and just add it to where it is used. And Christoph even mentioned that it can be further compacted to just a single line: __this_cpu_and(current_context, __this_cpu_read(current_context) - 1); Link: http://lkml.kernel.org/alpine.DEB.2.11.1503271423580.23114@gentwo.org Suggested-by: Christoph Lameter <cl@linux.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-03-29tc: bpf: generalize pedit actionAlexei Starovoitov
existing TC action 'pedit' can munge any bits of the packet. Generalize it for use in bpf programs attached as cls_bpf and act_bpf via bpf_skb_store_bytes() helper function. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-28Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Two clocksource driver fixes, and an idle loop RCU warning fix" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/sun5i: Fix cpufreq interaction with sched_clock() clocksource/drivers: Fix various !CONFIG_HAS_IOMEM build errors timers/tick/broadcast-hrtimer: Fix suspicious RCU usage in idle loop
2015-03-28Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "A single sched/rt corner case fix for RLIMIT_RTIME correctness" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix RLIMIT_RTTIME when PI-boosting to RT
2015-03-28Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Ingo Molnar: "A perf kernel side fix for a fuzzer triggered lockup" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix irq_work 'tail' recursion
2015-03-28Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "A module unload lockdep race fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Fix the module unload key range freeing logic
2015-03-27clockevents: Don't validate dev->mode against CLOCK_EVT_MODE_UNUSED for new ↵Viresh Kumar
interface It was a requirement in the legacy interface that drivers must initialize ->mode field to 'CLOCK_EVT_MODE_UNUSED'. This field isn't used anymore by the new interface and so should be only checked for the legacy interface. Probably it can be dropped as well as core doesn't rely on it anymore, but lets keep it to support legacy interface. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: linaro-kernel@lists.linaro.org Cc: linaro-networking@linaro.org Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/c6604fa1a77fe1fc8dcab87769857228fb1dadd5.1425037853.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27clockevents: Manage device's state separately for the coreViresh Kumar
'enum clock_event_mode' is used for two purposes today: - to pass mode to the driver of clockevent device::set_mode(). - for managing state of the device for clockevents core. For supporting new modes/states we have moved away from the legacy set_mode() callback to new per-mode/state callbacks. New modes/states shouldn't be exposed to the legacy (now OBSOLOTE) callbacks and so we shouldn't add new states to 'enum clock_event_mode'. Lets have separate enums for the two use cases mentioned above. Keep using the earlier enum for legacy set_mode() callback and mark it OBSOLETE. And add another enum to clearly specify the possible states of a clockevent device. This also renames the newly added per-mode callbacks to reflect state changes. We haven't got rid of 'mode' member of 'struct clock_event_device' as it is used by some of the clockevent drivers and it would automatically die down once we migrate those drivers to the new interface. It ('mode') is only updated now for the drivers using the legacy interface. Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: linaro-kernel@lists.linaro.org Cc: linaro-networking@linaro.org Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/b6b0143a8a57bd58352ad35e08c25424c879c0cb.1425037853.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27clockevents: Handle tick device's resume separatelyViresh Kumar
Upcoming patch will redefine possible states of a clockevent device. The RESUME mode is a special case only for tick's clockevent devices. In future it can be replaced by ->resume() callback already available for clockevent devices. Lets handle it separately so that clockevents_set_mode() only handles states valid across all devices. This also renames set_mode_resume() to tick_resume() to make it more explicit. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: linaro-kernel@lists.linaro.org Cc: linaro-networking@linaro.org Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/c1b0112410870f49e7bf06958e1483eac6c15e20.1425037853.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27perf: Add per event clockid supportPeter Zijlstra
While thinking on the whole clock discussion it occurred to me we have two distinct uses of time: 1) the tracking of event/ctx/cgroup enabled/running/stopped times which includes the self-monitoring support in struct perf_event_mmap_page. 2) the actual timestamps visible in the data records. And we've been conflating them. The first is all about tracking time deltas, nobody should really care in what time base that happens, its all relative information, as long as its internally consistent it works. The second however is what people are worried about when having to merge their data with external sources. And here we have the discussion on MONOTONIC vs MONOTONIC_RAW etc.. Where MONOTONIC is good for correlating between machines (static offset), MONOTNIC_RAW is required for correlating against a fixed rate hardware clock. This means configurability; now 1) makes that hard because it needs to be internally consistent across groups of unrelated events; which is why we had to have a global perf_clock(). However, for 2) it doesn't really matter, perf itself doesn't care what it writes into the buffer. The below patch makes the distinction between these two cases by adding perf_event_clock() which is used for the second case. It further makes this configurable on a per-event basis, but adds a few sanity checks such that we cannot combine events with different clocks in confusing ways. And since we then have per-event configurability we might as well retain the 'legacy' behaviour as a default. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27Merge branch 'perf/core' into perf/timer, before applying new changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27Merge branch 'timers/core' into perf/timer, to apply dependent patchIngo Molnar
An upcoming patch will depend on tai_ns() and NMI-safe ktime_get_raw_fast(), so merge timers/core here in a separate topic branch until it's all cooked and timers/core is merged upstream. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Documentation updates. - Changes permitting use of call_rcu() and friends very early in boot, for example, before rcu_init() is invoked. - Miscellaneous fixes. - Add in-kernel API to enable and disable expediting of normal RCU grace periods. - Improve RCU's handling of (hotplug-) outgoing CPUs. Note: ARM support is lagging a bit here, and these improved diagnostics might generate (harmless) splats. - NO_HZ_FULL_SYSIDLE fixes. - Tiny RCU updates to make it more tiny. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27perf: Fix racy group accessPeter Zijlstra
While looking at some fuzzer output I noticed that we do not hold any locks on leader->ctx and therefore the sibling_list iteration is unsafe. Acquire the relevant ctx->mutex before calling into the pmu specific code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Link: http://lkml.kernel.org/r/20150225151639.GL5029@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27Merge branch 'perf/x86' into perf/core, because it's readyIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27Merge branch 'perf/urgent' into perf/core, to pick up fixes and to refresh ↵Ingo Molnar
the tree Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27time: Introduce tk_fast_rawPeter Zijlstra
Add the NMI safe CLOCK_MONOTONIC_RAW accessor.. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150319093400.562746929@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27time: Parametrize all tk_fast_mono usersPeter Zijlstra
In preparation for more tk_fast instances, remove all hard-coded tk_fast_mono references. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150319093400.484279927@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27time: Add timerkeeper::tkr_rawPeter Zijlstra
Introduce tkr_raw and make use of it. base_raw -> tkr_raw.base clock->{mult,shift} -> tkr_raw.{mult.shift} Kill timekeeping_get_ns_raw() in favour of timekeeping_get_ns(&tkr_raw), this removes all mono_raw special casing. Duplicate the updates to tkr_mono.cycle_last into tkr_raw.cycle_last, both need the same value. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150319093400.422589590@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27time: Rename timekeeper::tkr to timekeeper::tkr_monoPeter Zijlstra
In preparation of adding another tkr field, rename this one to tkr_mono. Also rename tk_read_base::base_mono to tk_read_base::base, since the structure is not specific to CLOCK_MONOTONIC and the mono name got added to the tk_read_base instance. Lots of trivial churn. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150319093400.344679419@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched/deadline: Fix rt runtime corruption when dl fails its global constraintsWanpeng Li
One version of sched_rt_global_constaints() (the !rt-cgroup one) changes state, therefore if we fail the later sched_dl_global_constraints() call the state is left in an inconsistent state. Fix this by changing the order of the calls. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> [ Improved the changelog. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/1426590931-4639-2-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched/deadline: Avoid a superfluous checkWanpeng Li
Since commit 40767b0dc768 ("sched/deadline: Fix deadline parameter modification handling") we clear the thottled state when switching from a dl task, therefore we should never find it set in switching to a dl task. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> [ Improved the changelog. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/1426590931-4639-1-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched: Improve load balancing in the presence of idle CPUsPreeti U Murthy
When a CPU is kicked to do nohz idle balancing, it wakes up to do load balancing on itself, followed by load balancing on behalf of idle CPUs. But it may end up with load after the load balancing attempt on itself. This aborts nohz idle balancing. As a result several idle CPUs are left without tasks till such a time that an ILB CPU finds it unfavorable to pull tasks upon itself. This delays spreading of load across idle CPUs and worse, clutters only a few CPUs with tasks. The effect of the above problem was observed on an SMT8 POWER server with 2 levels of numa domains. Busy loops equal to number of cores were spawned. Since load balancing on fork/exec is discouraged across numa domains, all busy loops would start on one of the numa domains. However it was expected that eventually one busy loop would run per core across all domains due to nohz idle load balancing. But it was observed that it took as long as 10 seconds to spread the load across numa domains. Further investigation showed that this was a consequence of the following: 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle load balancing [Given the experiment, upto 6 CPUs per core could be potentially idle in this domain.] 2. However the ILB CPU would call load_balance() on itself before initiating nohz idle load balancing. 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull tasks from its sibling cores to even out load. 4. Now that the ILB CPU was no longer idle, it would abort nohz idle load balancing As a result the opportunities to spread load across numa domains were lost until such a time that the cores within the first numa domain had equal number of tasks among themselves. This is a pretty bad scenario, since the cores within the first numa domain would have as many as 4 tasks each, while cores in the neighbouring numa domains would all remain idle. Fix this, by checking if a CPU was woken up to do nohz idle load balancing, before it does load balancing upon itself. This way we allow idle CPUs across the system to do load balancing which results in quicker spread of load, instead of performing load balancing within the local sched domain hierarchy of the ILB CPU alone under circumstances such as above. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jason Low <jason.low2@hp.com> Cc: benh@kernel.crashing.org Cc: daniel.lezcano@linaro.org Cc: efault@gmx.de Cc: iamjoonsoo.kim@lge.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: riel@redhat.com Cc: srikar@linux.vnet.ibm.com Cc: svaidy@linux.vnet.ibm.com Cc: tim.c.chen@linux.intel.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched: Optimize freq invariant accountingPeter Zijlstra
Currently the freq invariant accounting (in __update_entity_runnable_avg() and sched_rt_avg_update()) get the scale factor from a weak function call, this means that even for archs that default on their implementation the compiler cannot see into this function and optimize the extra scaling math away. This is sad, esp. since its a 64-bit multiplication which can be quite costly on some platforms. So replace the weak function with #ifdef and __always_inline goo. This is not quite as nice from an arch support PoV but should at least result in compile time errors if done wrong. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ben Segall <bsegall@google.com> Cc: Morten.Rasmussen@arm.com Cc: Paul Turner <pjt@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched: Move CFS tasks to CPUs with higher capacityVincent Guittot
When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining capacity for CFS tasks can be significantly reduced. Once we detect such situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle load balance to check if it's worth moving its tasks on an idle CPU. It's worth trying to move the task before the CPU is fully utilized to minimize the preemption by irq or RT tasks. Once the idle load_balance has selected the busiest CPU, it will look for an active load balance for only two cases: - There is only 1 task on the busiest CPU. - We haven't been able to move a task of the busiest rq. A CPU with a reduced capacity is included in the 1st case, and it's worth to actively migrate its task if the idle CPU has got more available capacity for CFS tasks. This test has been added in need_active_balance. As a sidenote, this will not generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task. The nohz_kick_needed function has been cleaned up a bit while adding the new test env.src_cpu and env.src_rq must be set unconditionnally because they are used in need_active_balance which is called even if busiest->nr_running equals 1 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425052454-25797-12-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched: Add SD_PREFER_SIBLING for SMT levelVincent Guittot
Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that the scheduler will place at least one task per core. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425052454-25797-11-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched: Remove unused struct sched_group_capacity::capacity_origVincent Guittot
The 'struct sched_group_capacity::capacity_orig' field is no longer used in the scheduler so we can remove it. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425378903-5349-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched: Replace capacity_factor by usageVincent Guittot
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is SCHED_CAPACITY_SCALE. 'struct sg_lb_stats:group_capacity_factor' divides the capacity of the group by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it compares this value with the sum of nr_running to decide if the group is overloaded or not. But the 'group_capacity_factor' concept is hardly working for SMT systems, it sometimes works for big cores but fails to do the right thing for little cores. Below are two examples to illustrate the problem that this patch solves: 1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 (div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded even if we have only one task per CPU. 2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4 (at max and thanks to the fix [0] for SMT system that prevent the apparition of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is reduced to nearly nothing), the capacity factor of the group will still be 4 (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]). So, this patch tries to solve this issue by removing capacity_factor and replacing it with the 2 following metrics: - The available CPU's capacity for CFS tasks which is already used by load_balance(). - The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib has been re-introduced to compute the usage of a CPU by CFS tasks. 'group_capacity_factor' and 'group_has_free_capacity' has been removed and replaced by 'group_no_capacity'. We compare the number of task with the number of CPUs and we evaluate the level of utilization of the CPUs to define if a group is overloaded or if a group has capacity to handle more tasks. For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task so it will be selected in priority (among the overloaded groups). Since [1], SD_PREFER_SIBLING is no more concerned by the computation of 'load_above_capacity' because local is not overloaded. [1] 9a5d9ba6a363 ("sched/fair: Allow calculate_imbalance() to move idle cpus") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1425052454-25797-9-git-send-email-vincent.guittot@linaro.org [ Tidied up the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched: Calculate CPU's usage statistic and put it into struct ↵Vincent Guittot
sg_lb_stats::group_usage Monitor the usage level of each group of each sched_domain level. The usage is the portion of cpu_capacity_orig that is currently used on a CPU or group of CPUs. We use the utilization_load_avg to evaluate the usage level of each group. The utilization_load_avg only takes into account the running time of the CFS tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully utilized. Nevertheless, we must cap utilization_load_avg which can be temporally greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and until the metrics are stabilized. The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the running load on the CPU whereas the available capacity for the CFS task is in the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range of the CPU to get the usage of the latter. The usage can then be compared with the available capacity (ie cpu_capacity) to deduct the usage level of a CPU. The frequency scaling invariance of the usage is not taken into account in this patch, it will be solved in another patch which will deal with frequency scaling invariance on the utilization_load_avg. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425455327-13508-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched: Add struct rq::cpu_capacity_origVincent Guittot
This new field 'cpu_capacity_orig' reflects the original capacity of a CPU before being altered by rt tasks and/or IRQ The cpu_capacity_orig will be used: - to detect when the capacity of a CPU has been noticeably reduced so we can trig load balance to look for a CPU with better capacity. As an example, we can detect when a CPU handles a significant amount of irq (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by scheduler whereas CPUs, which are really idle, are available. - evaluate the available capacity for CFS tasks Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched: Make scale_rt invariant with frequencyVincent Guittot
The average running time of RT tasks is used to estimate the remaining compute capacity for CFS tasks. This remaining capacity is the original capacity scaled down by a factor (aka scale_rt_capacity). This estimation of available capacity must also be invariant with frequency scaling. A frequency scaling factor is applied on the running time of the RT tasks for computing scale_rt_capacity. In sched_rt_avg_update(), we now scale the RT execution time like below: rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT Then, scale_rt_capacity can be summarized by: scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total with available = total - rq->rt_avg This has been been optimized in current code by: scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT) But we can also developed the equation like below: scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total) and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in the computation of rq->rt_avg and scale_rt_capacity(). so rq->rt_avg += rt_delta * arch_scale_freq_capacity() and scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total) arch_scale_frequency_capacity() will be called in the hot path of the scheduler which implies to have a short and efficient function. As an example, arch_scale_frequency_capacity() should return a cached value that is updated periodically outside of the hot path. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27sched: Make sched entity usage tracking scale-invariantMorten Rasmussen
Apply frequency scale-invariance correction factor to usage tracking. Each segment of the running_avg_sum geometric series is now scaled by the current frequency so the utilization_avg_contrib of each entity will be invariant with frequency scaling. As a result, utilization_load_avg which is the sum of utilization_avg_contrib, becomes invariant too. So the usage level that is returned by get_cpu_usage(), stays relative to the max frequency as the cpu_capacity which is is compared against. Then, we want the keep the load tracking values in a 32-bit type, which implies that the max value of {runnable|running}_avg_sum must be lower than 2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742, arch_scale_freq_capacity() must return a value less than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow. Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Morten.Rasmussen@arm.com Cc: Paul Turner <pjt@google.com> Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425455186-13451-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>