summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2013-08-29cgroup: fix rmdir EBUSY regression in 3.11Hugh Dickins
On 3.11-rc we are seeing cgroup directories left behind when they should have been removed. Here's a trivial reproducer: cd /sys/fs/cgroup/memory mkdir parent parent/child; rmdir parent/child parent rmdir: failed to remove `parent': Device or resource busy It's because cgroup_destroy_locked() (step 1 of destruction) leaves cgroup on parent's children list, letting cgroup_offline_fn() (step 2 of destruction) remove it; but step 2 is run by work queue, which may not yet have removed the children when parent destruction checks the list. Fix that by checking through a non-empty list of children: if every one of them has already been marked CGRP_DEAD, then it's safe to proceed: those children are invisible to userspace, and should not obstruct rmdir. (I didn't see any reason to keep the cgrp->children checks under the unrelated css_set_lock, so moved them out.) tj: Flattened nested ifs a bit and updated comment so that it's correct on both for-3.11-fixes and for-3.12. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-08-29workqueue: cond_resched() after processing each work itemTejun Heo
If !PREEMPT, a kworker running work items back to back can hog CPU. This becomes dangerous when a self-requeueing work item which is waiting for something to happen races against stop_machine. Such self-requeueing work item would requeue itself indefinitely hogging the kworker and CPU it's running on while stop_machine would wait for that CPU to enter stop_machine while preventing anything else from happening on all other CPUs. The two would deadlock. Jamie Liu reports that this deadlock scenario exists around scsi_requeue_run_queue() and libata port multiplier support, where one port may exclude command processing from other ports. With the right timing, scsi_requeue_run_queue() can end up requeueing itself trying to execute an IO which is asked to be retried while another device has an exclusive access, which in turn can't make forward progress due to stop_machine. Fix it by invoking cond_resched() after executing each work item. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jamie Liu <jamieliu@google.com> References: http://thread.gmane.org/gmane.linux.kernel/1552567 Cc: stable@vger.kernel.org -- kernel/workqueue.c | 9 +++++++++ 1 file changed, 9 insertions(+)
2013-08-29Merge branch 'linus' into perf/coreIngo Molnar
Pick up the latest upstream fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-29padata - Register hotcpu notifier after initializationRichard Weinberger
padata_cpu_callback() takes pinst->lock, to avoid taking an uninitialized lock, register the notifier after it's initialization. Signed-off-by: Richard Weinberger <richard@nod.at> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-08-29padata - share code between CPU_ONLINE and CPU_DOWN_FAILED, same to ↵Chen Gang
CPU_DOWN_PREPARE and CPU_UP_CANCELED Share code between CPU_ONLINE and CPU_DOWN_FAILED, same to CPU_DOWN_PREPARE and CPU_UP_CANCELED. It will fix 2 bugs: "not check the return value of __padata_remove_cpu() and __padata_add_cpu()". "need add 'break' between CPU_UP_CANCELED and CPU_DOWN_FAILED". Signed-off-by: Chen Gang <gang.chen@asianux.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-08-28timer_list: correct the iterator for timer_listNathan Zimmer
Correct an issue with /proc/timer_list reported by Holger. When reading from the proc file with a sufficiently small buffer, 2k so not really that small, there was one could get hung trying to read the file a chunk at a time. The timer_list_start function failed to account for the possibility that the offset was adjusted outside the timer_list_next. Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Reported-by: Holger Hans Peter Freyther <holger@freyther.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Berke Durak <berke.durak@xiphos.com> Cc: Jeff Layton <jlayton@redhat.com> Tested-by: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> # 3.10.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-27cgroup: fix cgroup_css() invocation in css_from_id()Tejun Heo
ca8bdcaff0 ("cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys") missed one conversion in css_from_id(), which was newly added. As css_from_id() doesn't have any user yet, this doesn't break anything other than generating a build warning. Convert it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: kbuild test robot <fengguang.wu@intel.com>
2013-08-27Rename nsproxy.pid_ns to nsproxy.pid_ns_for_childrenAndy Lutomirski
nsproxy.pid_ns is *not* the task's pid namespace. The name should clarify that. This makes it more obvious that setns on a pid namespace is weird -- it won't change the pid namespace shown in procfs. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-26userns: Better restrictions on when proc and sysfs can be mountedEric W. Biederman
Rely on the fact that another flavor of the filesystem is already mounted and do not rely on state in the user namespace. Verify that the mounted filesystem is not covered in any significant way. I would love to verify that the previously mounted filesystem has no mounts on top but there are at least the directories /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly for other filesystems to mount on top of. Refactor the test into a function named fs_fully_visible and call that function from the mount routines of proc and sysfs. This makes this test local to the filesystems involved and the results current of when the mounts take place, removing a weird threading of the user namespace, the mount namespace and the filesystems themselves. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-26kernel/nsproxy.c: Improving a snippet of code.Raphael S.Carvalho
It seems GCC generates a better code in that way, so I changed that statement. Btw, they have the same semantic, so I'm sending this patch due to performance issues. Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: Raphael S.Carvalho <raphael.scarv@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2013-08-27Merge branch 'pm-sleep'Rafael J. Wysocki
* pm-sleep: PM / Sleep: new trace event to print device suspend and resume times PM / Sleep: increase ftrace coverage in suspend/resume
2013-08-27Merge branch 'acpi-processor'Rafael J. Wysocki
* acpi-processor: ACPI / processor: Acquire writer lock to update CPU maps ACPI / processor: Remove acpi_processor_get_limit_info()
2013-08-26cgroup: make cgroup_write_event_control() use css_from_dir() instead of ↵Tejun Heo
__d_cgrp() cgroup_event will be moved to its only user - memcg. Replace __d_cgrp() usage with css_from_dir(), which is already exported. This also simplifies the code a bit. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
2013-08-26cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroupTejun Heo
Currently, each registered cgroup_event holds an extra reference to the cgroup. This is a bit weird as events are subsystem specific and will also be incorrect in the planned unified hierarchy as css (cgroup_subsys_state) may come and go dynamically across the lifetime of a cgroup. Holding onto cgroup won't prevent the target css from going away. Update cgroup_event to hold onto the css the traget file belongs to instead of cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
2013-08-26cgroup: implement CFTYPE_NO_PREFIXTejun Heo
When cgroup files are created, cgroup core automatically prepends the name of the subsystem as prefix. This patch adds CFTYPE_NO_ which disables the automatic prefix. This is to work around historical baggages and shouldn't be used for new files. This will be used to move "cgroup.event_control" from cgroup core to memcg. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Glauber Costa <glommer@gmail.com>
2013-08-26cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsysTejun Heo
cgroup_css() is no longer used in hot paths. Make it take struct cgroup_subsys * and allow the users to specify NULL subsys to obtain the dummy_css. This removes open-coded NULL subsystem testing in a couple users and generally simplifies the code. After this patch, css_from_dir() also allows NULL @ss and returns the matching dummy_css. This behavior change doesn't affect its only user - perf. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
2013-08-26cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntaxTejun Heo
cgroup_css_from_dir() will grow another user. In preparation, make the following changes. * All css functions are prefixed with just "css_", rename it to css_from_dir(). * Take dentry * instead of file * as dentry is what ultimately identifies a cgroup and file may not always be available. Note that the function now checkes whether @dentry->d_inode is NULL as the caller now may specify a negative dentry. * Make it take cgroup_subsys * instead of integer subsys_id. This simplifies the function and allows specifying no subsystem for cgroup->dummy_css. * Make return section a bit less verbose. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com>
2013-08-23workqueue: convert bus code to use dev_groupsGreg Kroah-Hartman
The dev_attrs field of struct bus_type is going away soon, dev_groups should be used instead. This converts the workqueue bus code to use the correct field. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-23Merge branch 'for-3.11-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "A late fix for cgroup. This fixes a behavior regression visible to userland which was created by a commit merged during -rc1. While the behavior change isn't too likely to be noticeable, the fix is relatively low risk and we'll need to backport it through -stable anyway if the bug gets released" * 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: fix a regression in validating config change
2013-08-22tracing: Make tracing_cpumask available for all instancesAlexander Z Lam
Allow tracer instances to disable tracing by cpu by moving the static global tracing_cpumask into trace_array. Link: http://lkml.kernel.org/r/921622317f239bfc2283cac2242647801ef584f2.1375980149.git.azl@google.com Cc: Vaibhav Nagarnaik <vnagarnaik@google.com> Cc: David Sharp <dhsharp@google.com> Cc: Alexander Z Lam <lambchop468@gmail.com> Signed-off-by: Alexander Z Lam <azl@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-08-21tracing: Kill the !CONFIG_MODULES code in trace_events.cOleg Nesterov
Move trace_module_nb under CONFIG_MODULES and kill the dummy trace_module_notify(). Imho it doesn't make sense to define "struct notifier_block" and its .notifier_call just to avoid "ifdef" in event_trace_init(), and all other !CONFIG_MODULES code has already gone away. Link: http://lkml.kernel.org/r/20130731173137.GA31043@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-08-21tracing: Don't pass file_operations array to event_create_dir()Oleg Nesterov
Now that event_create_dir() and __trace_add_new_event() always use the same file_operations we can kill these arguments and simplify the code. Link: http://lkml.kernel.org/r/20130731173135.GA31040@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-08-21tracing: Kill trace_create_file_ops() and friendsOleg Nesterov
trace_create_file_ops() allocates the copy of id/filter/format/enable file_operations to set "f_op->owner = mod" for fops_get(). However after the recent changes there is no reason to prevent rmmod even if one of these files is opened. A file operation can do nothing but fail after remove_event_file_dir() clears ->i_private for every file removed by trace_module_remove_events(). Kill "struct ftrace_module_file_ops" and fix the compilation errors. Link: http://lkml.kernel.org/r/20130731173132.GA31033@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-08-21tracing/syscalls: Annotate raw_init function with __initLi Zefan
init_syscall_trace() can only be called during kernel bootup only, so we can mark it and the functions it calls as __init. Link: http://lkml.kernel.org/r/51528E89.6080508@huawei.com Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-08-21workqueue: Fix manage_workers() RETURNS descriptionLibin
No functional change. The comment of function manage_workers() RETURNS description is obvious wrong, same as the CONTEXT. Fix it. Signed-off-by: Libin <huawei.libin@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-08-21workqueue: Comment correction in file headerLibin
No functional change. There are two worker pools for each cpu in current implementation (one for normal work items and the other for high priority ones). tj: Whitespace adjustments. Signed-off-by: Libin <huawei.libin@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-08-21cpuset: fix a regression in validating config changeLi Zefan
It's not allowed to clear masks of a cpuset if there're tasks in it, but it's broken: # mkdir /cgroup/sub # echo 0 > /cgroup/sub/cpuset.cpus # echo 0 > /cgroup/sub/cpuset.mems # echo $$ > /cgroup/sub/tasks # echo > /cgroup/sub/cpuset.cpus (should fail) This bug was introduced by commit 88fa523bff295f1d60244a54833480b02f775152 ("cpuset: allow to move tasks to empty cpusets"). tj: Dropped temp bool variables and nestes the conditionals directly. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-08-20rcu: Simplify _rcu_barrier() processingPaul E. McKenney
This commit drops an unneeded ACCESS_ONCE() and simplifies an "our work is done" check in _rcu_barrier(). This applies feedback from Linus (https://lkml.org/lkml/2013/7/26/777) that he gave to similar code in an unrelated patch. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> [ paulmck: Fix comment to match code, reported by Lai Jiangshan. ]
2013-08-20rcu: Make rcutorture emit online failures if verbosePaul E. McKenney
Although rcutorture counts CPU-hotplug online failures, it does not explicitly record which CPUs were having trouble coming online. This commit therefore emits a console message when online failure occurs. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-08-20rcu: Remove unused variable from rcu_torture_writer()Paul E. McKenney
The oldbatch variable in rcu_torture_writer() is stored to, but never loaded from. This commit therefore removes it. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-08-20rcu: Sort rcutorture module parametersPaul E. McKenney
There are getting to be too many module parameters to permit the current semi-random order, so this patch orders them. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-08-20rcu: Increase rcutorture test coveragePaul E. McKenney
Currently, rcutorture has separate torture_types to test synchronous, asynchronous, and expedited grace-period primitives. This has two disadvantages: (1) Three times the number of runs to cover the combinations and (2) Little testing of concurrent combinations of the three options. This commit therefore adds a pair of module parameters that control normal and expedited state, with the default being both types, randomly selected, by the fakewriter processes, thus reducing source-code size and increasing test coverage. In addtion, the writer task switches between asynchronous-normal and expedited grace-period primitives driven by the same pair of module parameters. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-08-20rcu: Add duplicate-callback tests to rcutorturePaul E. McKenney
This commit adds a object_debug option to rcutorture to allow the debug-object-based checks for duplicate call_rcu() invocations to be deterministically tested. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> [ paulmck: Banish mid-function ifdef, more or less per Josh Triplett. ] Reviewed-by: Josh Triplett <josh@joshtriplett.org> [ paulmck: Improve duplicate-callback test, per Lai Jiangshan. ]
2013-08-20workqueue: fix some scripts/kernel-doc warningsYacine Belkadi
When building the htmldocs (in verbose mode), scripts/kernel-doc reports the following type of warnings: Warning(kernel/workqueue.c:653): No description found for return value of 'get_work_pool' Fix them by: - Using "Return:" sections to introduce descriptions of return values - Adding some missing descriptions Signed-off-by: Yacine Belkadi <yacine.belkadi.1@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-20kernel/params.c: use scnprintf() instead of sprintf()Chen Gang
For some strings (e.g. version string), they are permitted to be larger than PAGE_SIZE (although meaningless), so recommend to use scnprintf() instead of sprintf(). Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-08-20kernel/module.c: use scnprintf() instead of sprintf()Chen Gang
For some strings, they are permitted to be larger than PAGE_SIZE, so need use scnprintf() instead of sprintf(), or it will cause issue. One case is: if a module version is crazy defined (length more than PAGE_SIZE), 'modinfo' command is still OK (print full contents), but for "cat /sys/modules/'modname'/version", will cause issue in kernel. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-08-20module: Add NOARG flag for ops with param_set_bool_enable_only() set functionSteven Rostedt
The ops that uses param_set_bool_enable_only() as its set function can easily handle being used without an argument. There's no reason to fail the loading of the module if it does not have one. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-08-20module: Add flag to allow mod params to have no argumentsSteven Rostedt
Currently the params.c code allows only two "set" functions to have no arguments. If a parameter does not have an argument, then it looks at the set function and tests if it is either param_set_bool() or param_set_bint(). If it is not one of these functions, then it fails the loading of the module. But there may be module parameters that have different set functions and still allow no arguments. But unless each of these cases adds their function to the if statement, it wont be allowed to have no arguments. This method gets rather messing and does not scale. Instead, introduce a flags field to the kernel_param_ops, where if the flag KERNEL_PARAM_FL_NOARG is set, the parameter will not fail if it does not contain an argument. It will be expected that the corresponding set function can handle a NULL pointer as "val". Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-08-20module: fix sprintf format specifier in param_get_byte()Christoph Jaeger
In param_get_byte(), to which the macro STANDARD_PARAM_DEF(byte, ...) expands, "%c" is used to print an unsigned char. So it gets printed as a character what is not intended here. Use "%hhu" instead. [Rusty: note drivers which would be effected: drivers/net/wireless/cw1200/main.c drivers/ntb/ntb_transport.c:68 drivers/scsi/lpfc/lpfc_attr.c drivers/usb/atm/speedtch.c drivers/usb/gadget/g_ffs.c ] Acked-by: Jon Mason <jon.mason@intel.com> (for ntb) Acked-by: Michal Nazarewicz <mina86@mina86.com> (for g_ffs.c) Signed-off-by: Christoph Jaeger <christophjaeger@linux.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-08-19Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Three small fixlets" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: nohz: fix compile warning in tick_nohz_init() nohz: Do not warn about unstable tsc unless user uses nohz_full sched_clock: Fix integer overflow
2013-08-19kernel: fix new kernel-doc warning in wait.cRandy Dunlap
Fix new kernel-doc warnings in kernel/wait.c: Warning(kernel/wait.c:374): No description found for parameter 'p' Warning(kernel/wait.c:374): Excess function parameter 'word' description in 'wake_up_atomic_t' Warning(kernel/wait.c:374): Excess function parameter 'bit' description in 'wake_up_atomic_t' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-19cgroup: fix cgroup_write_event_control()Tejun Heo
81eeaf0411 ("cgroup: make cftype->[un]register_event() deal with cgroup_subsys_state inst ead of cgroup") updated the cftype event methods to take @css (cgroup_subsys_state) instead of @cgroup; however, it incorrectly used @css passed to cgroup_write_event_control(), which the dummy_css for the cgroup as the file is a cgroup core file. This leads to oops on event registration. Fix it by using the css matching the event target file. Note that cgroup_write_event_control() now disallows cgroup core files from being event sources. This is for simplicity and doesn't matter as cgroup_event will be moved and made specific to memcg. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-19cgroup: fix subsystem file accesses on the root cgroupTejun Heo
105347ba5 ("cgroup: make cgroup_file_open() rcu_read_lock() around cgroup_css() and add cfent->css") added cfent->css to cache the associted cgroup_subsys_state across file operations. A cfent is associated with single css throughout its lifetime and the origimal commit initialized the cache pointer during cgroup_add_file() and verified that it matches the actual one in cgroup_file_open(). While this works fine for !root cgroups, it's broken for root cgroups as files in a root cgroup are created before the css's are associated with the cgroup and thus cgroup_css() call in cgroup_add_file() returns NULL associating all cfents in the root cgroup with NULL css. This makes cgroup_file_open() trigger WARN and fail with -ENODEV for all !core subsystem files in the root cgroups. There's no reason to initialize cfent->css separately from cgroup_add_file(). As the association never changes, cgroup_file_open() can set it unconditionally every time and containing the logic in cgroup_file_open() makes more sense anyway as the only reason it's necessary is file->private_data being already occupied. Fix it by setting cfent->css unconditionally from cgroup_file_open(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-19cgroup: change cgroup_from_id() to css_from_id()Li Zefan
Now we want cgroup core to always provide the css to use to the subsystems, so change this API to css_from_id(). Uninline css_from_id(), because it's getting bigger and cgroup_css() has been unexported. While at it, remove the #ifdef, and shuffle the order of the args. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-08-19generic-ipi/locking: Fix misleading smp_call_function_any() descriptionXie XiuQi
Fix locking description: after commit 8969a5ede0f9e17da4b9437 ("generic-ipi: remove kmalloc()"), wait = 0 can be guaranteed because we don't kmalloc() anymore. Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com> Cc: Sheng Yang <sheng@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Link: http://lkml.kernel.org/r/51F5E6F8.1000801@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-18nohz_full: Add full-system-idle arguments to APIPaul E. McKenney
This commit adds an isidle and jiffies argument to force_qs_rnp(), dyntick_save_progress_counter(), and rcu_implicit_dynticks_qs() to enable RCU's force-quiescent-state process to check for full-system idle. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ] Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-08-18nohz_full: Add full-system idle states and variablesPaul E. McKenney
This commit adds control variables and states for full-system idle. The system will progress through the states in numerical order when the system is fully idle (other than the timekeeping CPU), and reset down to the initial state if any non-timekeeping CPU goes non-idle. The current state is kept in full_sysidle_state. One flavor of RCU will be in charge of driving the state machine, defined by rcu_sysidle_state. This should be the busiest flavor of RCU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-08-18nohz_full: Add per-CPU idle-state trackingPaul E. McKenney
This commit adds the code that updates the rcu_dyntick structure's new fields to track the per-CPU idle state based on interrupts and transitions into and out of the idle loop (NMIs are ignored because NMI handlers cannot cleanly read out the time anyway). This code is similar to the code that maintains RCU's idea of per-CPU idleness, but differs in that RCU treats CPUs running in user mode as idle, where this new code does not. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-08-18nohz_full: Add rcu_dyntick data for scalable detection of all-idle statePaul E. McKenney
This commit adds fields to the rcu_dyntick structure that are used to detect idle CPUs. These new fields differ from the existing ones in that the existing ones consider a CPU executing in user mode to be idle, where the new ones consider CPUs executing in user mode to be busy. The handling of these new fields is otherwise quite similar to that for the exiting fields. This commit also adds the initialization required for these fields. So, why is usermode execution treated differently, with RCU considering it a quiescent state equivalent to idle, while in contrast the new full-system idle state detection considers usermode execution to be non-idle? It turns out that although one of RCU's quiescent states is usermode execution, it is not a full-system idle state. This is because the purpose of the full-system idle state is not RCU, but rather determining when accurate timekeeping can safely be disabled. Whenever accurate timekeeping is required in a CONFIG_NO_HZ_FULL kernel, at least one CPU must keep the scheduling-clock tick going. If even one CPU is executing in user mode, accurate timekeeping is requires, particularly for architectures where gettimeofday() and friends do not enter the kernel. Only when all CPUs are really and truly idle can accurate timekeeping be disabled, allowing all CPUs to turn off the scheduling clock interrupt, thus greatly improving energy efficiency. This naturally raises the question "Why is this code in RCU rather than in timekeeping?", and the answer is that RCU has the data and infrastructure to efficiently make this determination. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-08-18nohz_full: Add Kconfig parameter for scalable detection of all-idle statePaul E. McKenney
At least one CPU must keep the scheduling-clock tick running for timekeeping purposes whenever there is a non-idle CPU. However, with the new nohz_full adaptive-idle machinery, it is difficult to distinguish between all CPUs really being idle as opposed to all non-idle CPUs being in adaptive-ticks mode. This commit therefore adds a Kconfig parameter as a first step towards enabling a scalable detection of full-system idle state. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> [ paulmck: Update help text per Frederic Weisbecker. ] Reviewed-by: Josh Triplett <josh@joshtriplett.org>