summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-02-03ftrace: Have set_graph_functions handle write with RDWRSteven Rostedt (VMware)
Since reading the set_graph_functions uses seq functions, which sets the file->private_data pointer to a seq_file descriptor. On writes the ftrace_graph_data descriptor is set to file->private_data. But if the file is opened for RDWR, the ftrace_graph_write() will incorrectly use the file->private_data descriptor instead of ((struct seq_file *)file->private_data)->private pointer, and this can crash the kernel. Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03ftrace: Reset fgd->hash in ftrace_graph_write()Steven Rostedt (VMware)
fgd->hash is saved and then freed, but is never reset to either ftrace_graph_hash nor ftrace_graph_notrace_hash. But if multiple writes are performed, then the freed hash could be accessed again. # cd /sys/kernel/debug/tracing # head -1000 available_filter_functions > /tmp/funcs # cat /tmp/funcs > set_graph_function Causes: general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: [...] CPU: 2 PID: 1337 Comm: cat Not tainted 4.10.0-rc2-test-00010-g6b052e9 #32 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 task: ffff880113a12200 task.stack: ffffc90001940000 RIP: 0010:free_ftrace_hash+0x7c/0x160 RSP: 0018:ffffc90001943db0 EFLAGS: 00010246 RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: 6b6b6b6b6b6b6b6b RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff8800ce1e1d40 RBP: ffff8800ce1e1d50 R08: 0000000000000000 R09: 0000000000006400 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8800ce1e1d40 R14: 0000000000004000 R15: 0000000000000001 FS: 00007f9408a07740(0000) GS:ffff88011e500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000aee1f0 CR3: 0000000116bb4000 CR4: 00000000001406e0 Call Trace: ? ftrace_graph_write+0x150/0x190 ? __vfs_write+0x1f6/0x210 ? __audit_syscall_entry+0x17f/0x200 ? rw_verify_area+0xdb/0x210 ? _cond_resched+0x2b/0x50 ? __sb_start_write+0xb4/0x130 ? vfs_write+0x1c8/0x330 ? SyS_write+0x62/0xf0 ? do_syscall_64+0xa3/0x1b0 ? entry_SYSCALL64_slow_path+0x25/0x25 Code: 01 48 85 db 0f 84 92 00 00 00 b8 01 00 00 00 d3 e0 85 c0 7e 3f 83 e8 01 48 8d 6f 10 45 31 e4 4c 8d 34 c5 08 00 00 00 49 8b 45 08 <4a> 8b 34 20 48 85 f6 74 13 48 8b 1e 48 89 ef e8 20 fa ff ff 48 RIP: free_ftrace_hash+0x7c/0x160 RSP: ffffc90001943db0 ---[ end trace 999b48216bf4b393 ]--- Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03ftrace: Replace (void *)1 with a meaningful macro name FTRACE_GRAPH_EMPTYSteven Rostedt (VMware)
When the set_graph_function or set_graph_notrace contains no records, a banner is displayed of either "#### all functions enabled ####" or "#### all functions disabled ####" respectively. To tell the seq operations to do this, (void *)1 is passed as a return value. Instead of using a hardcoded meaningless variable, define it as a macro. Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03ftrace: Create a slight optimization on searching the ftrace_hashSteven Rostedt (VMware)
This is a micro-optimization, but as it has to deal with a fast path of the function tracer, these optimizations can be noticed. The ftrace_lookup_ip() returns true if the given ip is found in the hash. If it's not found or the hash is NULL, it returns false. But there's some cases that a NULL hash is a true, and the ftrace_hash_empty() is tested before calling ftrace_lookup_ip() in those cases. But as ftrace_lookup_ip() tests that first, that adds a few extra unneeded instructions in those cases. A new static "always_inlined" function is created that does not perform the hash empty test. This most only be used by callers that do the check first anyway, as an empty or NULL hash could cause a crash if a lookup is performed on it. Also add kernel doc for the ftrace_lookup_ip() main function. Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03tracing: Add ftrace_hash_key() helper functionSteven Rostedt (VMware)
Replace the couple of use cases that has small logic to produce the ftrace function key id with a helper function. No need for duplicate code. Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-03prctl: propagate has_child_subreaper flag to every descendantPavel Tikhomirov
If process forks some children when it has is_child_subreaper flag enabled they will inherit has_child_subreaper flag - first group, when is_child_subreaper is disabled forked children will not inherit it - second group. So child-subreaper does not reparent all his descendants when their parents die. Having these two differently behaving groups can lead to confusion. Also it is a problem for CRIU, as when we restore process tree we need to somehow determine which descendants belong to which group and much harder - to put them exactly to these group. To simplify these we can add a propagation of has_child_subreaper flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child- subreaper to setup has_child_subreaper flag. In common cases when process like systemd first sets itself to be a child-subreaper and only after that forks its services, we will have zero-length list of descendants to walk. Testing with binary subtree of 2^15 processes prctl took < 0.007 sec and has shown close to linear dependency(~0.2 * n * usec) on lower numbers of processes. Moreover, I doubt someone intentionaly pre-forks the children whitch should reparent to init before becoming subreaper, because some our ancestor migh have had is_child_subreaper flag while forking our sub-tree and our childs will all inherit has_child_subreaper flag, and we have no way to influence it. And only way to check if we have no has_child_subreaper flag is to create some childs, kill them and see where they will reparent to. Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing seems to be the same. Optimize: a) When descendant already has has_child_subreaper flag all his subtree has it too already. * for a) to be true need to move has_child_subreaper inheritance under the same tasklist_lock with adding task to its ->real_parent->children as without it process can inherit zero has_child_subreaper, then we set 1 to it's parent flag, check that parent has no more children, and only after child with wrong flag is added to the tree. * Also make these inheritance more clear by using real_parent instead of current, as on clone(CLONE_PARENT) if current has is_child_subreaper and real_parent has no is_child_subreaper or has_child_subreaper, child will have has_child_subreaper flag set without actually having a subreaper in it's ancestors. b) When some descendant is child_reaper, it's subtree is in different pidns from us(original child-subreaper) and processes from other pidns will never reparent to us. So we can skip their(a,b) subtree from walk. v2: switch to walk_process_tree() general helper, move has_child_subreaper inheritance v3: remove csr_descendant leftover, change current to real_parent in has_child_subreaper inheritance v4: small commit message fix Fixes: ebec18a6d3aa ("prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision") Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-02-03introduce the walk_process_tree() helperOleg Nesterov
Add the new helper to walk the process tree, the next patch adds a user. Note that it visits the group leaders only, proc_visitor can do for_each_thread itself or we can trivially extend walk_process_tree() to do this. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-02-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
All merge conflicts were simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Five kernel fixes: - an mmap tracing ABI fix for certain mappings - a use-after-free fix, found via KASAN - three CPU hotplug related x86 PMU driver fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Make package handling more robust perf/x86/intel/uncore: Clean up hotplug conversion fallout perf/x86/intel/rapl: Make package handling more robust perf/core: Fix PERF_RECORD_MMAP2 prot/flags for anonymous memory perf/core: Fix use-after-free bug
2017-02-02Merge branch 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11Tejun Heo
Merge in to resolve conflicts in Documentation/cgroup-v2.txt. The conflicts are from multiple section additions and trivial to resolve. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-02-02cgroup: drop the matching uid requirement on migration for cgroup v2Tejun Heo
Along with the write access to the cgroup.procs or tasks file, cgroup has required the writer's euid, unless root, to match [s]uid of the target process or task. On cgroup v1, this is necessary because there's nothing preventing a delegatee from pulling in tasks or processes from all over the system. If a user has a cgroup subdirectory delegated to it, the user would have write access to the cgroup.procs or tasks file. If there are no further checks than file write access check, the user would be able to pull processes from all over the system into its subhierarchy which is clearly not the intended behavior. The matching [s]uid requirement partially prevents this problem by allowing a delegatee to pull in the processes that belongs to it. This isn't a sufficient protection however, because a user would still be able to jump processes across two disjoint sub-hierarchies that has been delegated to them. cgroup v2 resolves the issue by requiring the writer to have access to the common ancestor of the cgroup.procs file of the source and target cgroups. This confines each delegatee to their own sub-hierarchy proper and bases all permission decisions on the cgroup filesystem rather than having to pull in explicit uid matching. cgroup v2 has still been applying the matching [s]uid requirement just for historical reasons. On cgroup2, the requirement doesn't serve any purpose while unnecessarily complicating the permission model. Let's drop it. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-02-02cgroup, perf_event: make perf_event controller work on cgroup2 hierarchyTejun Heo
perf_event is a utility controller whose primary role is identifying cgroup membership to filter perf events; however, because it also tracks some per-css state, it can't be replaced by pure cgroup membership test. Mark the controller as implicitly enabled on the default hierarchy so that perf events can always be filtered based on cgroup v2 path as long as the controller is not mounted on a legacy hierarchy. "perf record" is updated accordingly so that it searches for both v1 and v2 hierarchies. A v1 hierarchy is used if perf_event is mounted on it; otherwise, it uses the v2 hierarchy. v2: Doc updated to reflect more flexible rebinding behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
2017-02-02blktrace: use existing disk debugfs directoryOmar Sandoval
We may already have a directory to put the blktrace stuff in if 1. The disk uses blk-mq 2. CONFIG_BLK_DEBUG_FS is enabled 3. We are tracing the whole disk and not a partition Instead of hardcoding this very specific case, let's use the new debugfs_lookup(). If the directory exists, we use it, otherwise we create one and clean it up later. Fixes: 07e4fead45e6 ("blk-mq: create debugfs directory tree") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02block: use same block debugfs directory for blk-mq and blktraceOmar Sandoval
When I added the blk-mq debugging information to debugfs, I didn't notice that blktrace also creates a "block" directory in debugfs. Make them use the same dentry, now created in the core block code. Based on a patch from Jens. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02blktrace: make do_blk_trace_setup() staticOmar Sandoval
This isn't used outside of blktrace.c anymore. Fixes: 62c2a7d969f3 ("block: push BKL into blktrace ioctls") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-02tracing/kprobes: Fix __init annotationArnd Bergmann
clang complains about "__init" being attached to a struct name: kernel/trace/trace_kprobe.c:1375:15: error: '__section__' attribute only applies to functions and global variables The intention must have been to mark the function as __init instead of the type, so move the attribute there. Link: http://lkml.kernel.org/r/20170201165826.2625888-1-arnd@arndb.de Fixes: f18f97ac43d7 ("tracing/kprobes: Add a helper method to return number of probe hits") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-02-02fs: Better permission checking for submountsEric W. Biederman
To support unprivileged users mounting filesystems two permission checks have to be performed: a test to see if the user allowed to create a mount in the mount namespace, and a test to see if the user is allowed to access the specified filesystem. The automount case is special in that mounting the original filesystem grants permission to mount the sub-filesystems, to any user who happens to stumble across the their mountpoint and satisfies the ordinary filesystem permission checks. Attempting to handle the automount case by using override_creds almost works. It preserves the idea that permission to mount the original filesystem is permission to mount the sub-filesystem. Unfortunately using override_creds messes up the filesystems ordinary permission checks. Solve this by being explicit that a mount is a submount by introducing vfs_submount, and using it where appropriate. vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let sget and friends know that a mount is a submount so they can take appropriate action. sget and sget_userns are modified to not perform any permission checks on submounts. follow_automount is modified to stop using override_creds as that has proven problemantic. do_mount is modified to always remove the new MS_SUBMOUNT flag so that we know userspace will never by able to specify it. autofs4 is modified to stop using current_real_cred that was put in there to handle the previous version of submount permission checking. cifs is modified to pass the mountpoint all of the way down to vfs_submount. debugfs is modified to pass the mountpoint all of the way down to trace_automount by adding a new parameter. To make this change easier a new typedef debugfs_automount_t is introduced to capture the type of the debugfs automount function. Cc: stable@vger.kernel.org Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid") Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-02-01sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in ↵Shile Zhang
milliseconds We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit: ce0dbbbb30ae ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice") ... which name suggests to users that it's in milliseconds, while in reality it's being set in milliseconds but the result is shown in jiffies. This is obviously confusing when HZ is not 1000, it makes it appear like the value set failed, such as HZ=100: root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms root# cat /proc/sys/kernel/sched_rr_timeslice_ms 10 Fix this to be milliseconds all around. Signed-off-by: Shile Zhang <shile.zhang@nokia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime, vtime: Return nsecs instead of cputime_t to accountFrederic Weisbecker
Turn the full dynticks cputime clock source to return nsec while keeping its very internals jiffies based for performance reasons. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-27-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Complete nsec conversion of tick based accountingFrederic Weisbecker
This is the final step toward tick based cputime conversion. Now that the whole cputime accounting engine accounts in nsecs, we can convert the very source of the cputime to account in nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-26-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Push time to account_system_time() in nsecsFrederic Weisbecker
This is one more step toward converting cputime accounting to pure nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-25-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Push time to account_idle_time() in nsecsFrederic Weisbecker
This is one more step toward converting cputime accounting to pure nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-24-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Push time to account_steal_time() in nsecsFrederic Weisbecker
This is one more step toward converting cputime accounting to pure nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-23-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Push time to account_user_time() in nsecsFrederic Weisbecker
This is one more step toward converting cputime accounting to pure nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-22-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01timers/itimer: Convert internal cputime_t units to nsecFrederic Weisbecker
Use the new nsec based cputime accessors as part of the whole cputime conversion from cputime_t to nsecs. Also convert itimers to use nsec based internal counters. This simplifies it and removes the whole game with error/inc_error which served to deal with cputime_t random granularity. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-20-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01timers/posix-timers: Convert internals to use nsecsFrederic Weisbecker
Use the new nsec based cputime accessors as part of the whole cputime conversion from cputime_t to nsecs. Also convert posix-cpu-timers to use nsec based internal counters to simplify it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-19-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01timers/posix-timers: Use TICK_NSEC instead of a dynamically ad-hoc ↵Frederic Weisbecker
calculated version Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-18-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Increment kcpustat directly on irqtime accountFrederic Weisbecker
The irqtime is accounted is nsecs and stored in cpu_irq_time.hardirq_time and cpu_irq_time.softirq_time. Once the accumulated amount reaches a new jiffy, this one gets accounted to the kcpustat. This was necessary when kcpustat was stored in cputime_t, which could at worst have jiffies granularity. But now kcpustat is stored in nsecs so this whole discretization game with temporary irqtime storage has become unnecessary. We can now directly account the irqtime to the kcpustat. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-17-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01signal: Convert obsolete cputime type to nsecsFrederic Weisbecker
Use the new nsec based cputime accessors as part of the whole cputime conversion from cputime_t to nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-16-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01tsacct: Convert obsolete cputime type to nsecsFrederic Weisbecker
Use the new nsec based cputime accessors as part of the whole cputime conversion from cputime_t to nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-15-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01delaycct: Convert obsolete cputime type to nsecsFrederic Weisbecker
Use the new nsec based cputime accessors as part of the whole cputime conversion from cputime_t to nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-14-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01acct: Convert obsolete cputime type to nsecsFrederic Weisbecker
Use the new nsec based cputime accessors as part of the whole cputime conversion from cputime_t to nsecs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-13-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Convert task/group cputime to nsecsFrederic Weisbecker
Now that most cputime readers use the transition API which return the task cputime in old style cputime_t, we can safely store the cputime in nsecs. This will eventually make cputime statistics less opaque and more granular. Back and forth convertions between cputime_t and nsecs in order to deal with cputime_t random granularity won't be needed anymore. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Introduce special task_cputime_t() API to return old-typed ↵Frederic Weisbecker
cputime This API returns a task's cputime in cputime_t in order to ease the conversion of cputime internals to use nsecs units instead. Blindly converting all cputime readers to use this API now will later let us convert more smoothly and step by step all these places to use the new nsec based cputime. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-7-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Convert guest time accounting to nsecs (u64)Frederic Weisbecker
cputime_t is being obsolete and replaced by nsecs units in order to make internal timestamps less opaque and more granular. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-6-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01sched/cputime: Convert kcpustat to nsecsFrederic Weisbecker
Kernel CPU stats are stored in cputime_t which is an architecture defined type, and hence a bit opaque and requiring accessors and mutators for any operation. Converting them to nsecs simplifies the code and is one step toward the removal of cputime_t in the core code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01time: Introduce jiffies64_to_nsecs()Frederic Weisbecker
This will be needed for the cputime_t to nsec conversion. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFYFrederic Weisbecker
NSEC_PER_JIFFY is an ad-hoc redefinition of TICK_NSEC. Let's rather use a unique and well maintained version. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01Merge branch 'linus' into sched/core, to pick up fixes and refresh the branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01exit: fix the setns() && PR_SET_CHILD_SUBREAPER interactionOleg Nesterov
find_new_reaper() checks same_thread_group(reaper, child_reaper) to prevent the cross-namespace reparenting but this is not enough if the exiting parent was injected by setns() + fork(). Suppose we have a process P in the root namespace and some namespace X. P does setns() to enter the X namespace, and forks the child C. C forks a grandchild G and exits. The grandchild G should be re-parented to X->child_reaper, but in this case the ->real_parent chain does not lead to ->child_reaper, so it will be wrongly reparanted to P's sub-reaper or a global init. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-01-31Merge tag 'trace-4.10-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "It was reported to me that the thread created by the hwlat tracer does not migrate after the first instance. I found that there was as small bug in the logic, and fixed it. It's minor, but should be fixed regardless. There's not much impact outside the hwlat tracer" * tag 'trace-4.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix hwlat kthread migration
2017-01-31Merge branch 'for-4.10-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "The cgroup creation path was getting the order of operations wrong and exposing cgroups which don't have their names set yet to controllers which can lead to NULL derefs. This contains the fix for the bug" * 'for-4.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: don't online subsystems before cgroup_name/path() are operational
2017-01-31tracing: Clean up the hwlat binding codeSteven Rostedt (VMware)
Instead of initializing the affinity of the hwlat kthread in the thread itself, simply set up the initial affinity at thread creation. This simplifies the code. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-31block: introduce blk_rq_is_passthroughChristoph Hellwig
This can be used to check for fs vs non-fs requests and basically removes all knowledge of BLOCK_PC specific from the block layer, as well as preparing for removing the cmd_type field in struct request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31tracing: Fix hwlat kthread migrationSteven Rostedt (VMware)
The hwlat tracer creates a kernel thread at start of the tracer. It is pinned to a single CPU and will move to the next CPU after each period of running. If the user modifies the migration thread's affinity, it will not change after that happens. The original code created the thread at the first instance it was called, but later was changed to destroy the thread after the tracer was finished, and would not be created until the next instance of the tracer was established. The code that initialized the affinity was only called on the initial instantiation of the tracer. After that, it was not initialized, and the previous affinity did not match the current newly created one, making it appear that the user modified the thread's affinity when it did not, and the thread failed to migrate again. Cc: stable@vger.kernel.org Fixes: 0330f7aa8ee6 ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-01-31Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU changes from Paul E. McKenney: - Dynticks updates, consolidating open-coded counter accesses into a well-defined API - SRCU updates: Simplify algorithm, add formal verification - Documentation updates - Miscellaneous fixes - Torture-test updates Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-30livepatch/module: print notice of TAINT_LIVEPATCHJoe Lawrence
Add back the "tainting kernel with TAINT_LIVEPATCH" kernel log message that commit 2992ef29ae01 ("livepatch/module: make TAINT_LIVEPATCH module-specific") dropped. Now that it's a module-specific taint flag, include the module name. Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-01-30cgroup: misc cleanupsTejun Heo
* cgrp_dfl_implicit_ss_mask is ulong instead of u16 unlike other ss_masks. Make it a u16. * Move have_canfork_callback together with other callback ss_masks. Signed-off-by: Tejun Heo <tj@kernel.org>
2017-01-30Merge branch 'iommu/guest-msi' of ↵Joerg Roedel
git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/core
2017-01-30irqdomain: Avoid activating interrupts more than onceMarc Zyngier
Since commit f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early"), we can end-up activating a PCI/MSI twice (once at allocation time, and once at startup time). This is normally of no consequences, except that there is some HW out there that may misbehave if activate is used more than once (the GICv3 ITS, for example, uses the activate callback to issue the MAPVI command, and the architecture spec says that "If there is an existing mapping for the EventID-DeviceID combination, behavior is UNPREDICTABLE"). While this could be worked around in each individual driver, it may make more sense to tackle the issue at the core level. In order to avoid getting in that situation, let's have a per-interrupt flag to remember if we have already activated that interrupt or not. Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early") Reported-and-tested-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1484668848-24361-1-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>