summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2013-04-07cgroup: kill cgroup_[un]lock()Tejun Heo
Now that locking interface is unexported, there's no reason to keep around these thin wrappers. Kill them and use mutex operations directly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07cgroup: unexport locking interface and cgroup_attach_task()Tejun Heo
Now that all external cgroup_lock() users are gone, we can finally unexport the locking interface and prevent future abuse of cgroup_mutex. Make cgroup_[un]lock() and cgroup_lock_live_group() static. Also, cgroup_attach_task() doesn't have any user left and can't be used without locking interface anyway. Make it static too. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07cgroup: relocate cgroup_lock_live_group() and cgroup_attach_task_all()Tejun Heo
cgroup_lock_live_group() and cgroup_attach_task() are scheduled to be made static. Relocate the former and cgroup_attach_task_all() so that we don't need forward declarations. This patch is pure relocation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-07cgroup, cpuset: replace move_member_tasks_to_cpuset() with ↵Tejun Heo
cgroup_transfer_tasks() When a cpuset becomes empty (no CPU or memory), its tasks are transferred with the nearest ancestor with execution resources. This is implemented using cgroup_scan_tasks() with a callback which grabs cgroup_mutex and invokes cgroup_attach_task() on each task. Both cgroup_mutex and cgroup_attach_task() are scheduled to be unexported. Implement cgroup_transfer_tasks() in cgroup proper which is essentially the same as move_member_tasks_to_cpuset() except that it takes cgroups instead of cpusets and @to comes before @from like normal functions with those arguments, and replace move_member_tasks_to_cpuset() with it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2013-04-05PM / sleep: invalidate TEST_CPUS and TEST_CORE support for freeze stateZhang Rui
freeze state is a software suspend state that does not run into low-level platform callbacks which may interact with BIOS. And freeze state does not need to disable the processors. But the current pm_test support misleads users because users can enter freeze state with pm_test set to TEST_CPUS/TEST_CORE, while this pm_test setting never takes actions. So, invalidate TEST_CPUS/TEST_CORE for freeze state in this patch. Then users will get an error instead, when trying to enter freeze state with pm_test mode set to TEST_CPUS/TEST_CORE. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-04-05PM / sleep: add TEST_PLATFORM support for freeze stateZhang Rui
Invoke freeze_enter() after suspend_test(TEST_PLATFORM) being invoked. So when setting /sys/power/pm_test to "platform", it can be used to check if freeze state is working well after all devices are suspended and before processors are blocked, Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-04-05Merge tag 'drm-intel-next-2013-03-23' of ↵Dave Airlie
git://people.freedesktop.org/~danvet/drm-intel into drm-next Daniel writes: Highlights: - Imre's for_each_sg_pages rework (now also with the stolen mem backed case fixed with a hack) plus the drm prime sg list coalescing patch from Rahul Sharma. I have some follow-up cleanups pending, already acked by Andrew Morton. - Some prep-work for the crazy no-pch/display-less platform by Ben. - Some vlv patches, by far not all (Jesse et al). - Clean up the HDMI/SDVO #define confusion (Paulo) - gen2-4 vblank fixes from Ville. - Unclaimed register warning fixes for hsw (Paulo). More still to come ... - Complete pageflips which have been stuck in a gpu hang, should prevent stuck gl compositors (Ville). - pm patches for vt-switchless resume (Jesse). Note that the i915 enabling is not (yet) included, that took a bit longer to settle. PM patches are acked by Rafael Wysocki. - Minor fixlets all over from various people. * tag 'drm-intel-next-2013-03-23' of git://people.freedesktop.org/~danvet/drm-intel: (79 commits) drm/i915: Implement WaSwitchSolVfFArbitrationPriority drm/i915: Set the VIC in AVI infoframe for SDVO drm/i915: Kill a strange comment about DPMS functions drm/i915: Correct sandybrige overclocking drm/i915: Introduce GEN7_FEATURES for device info drm/i915: Move num_pipes to intel info drm/i915: fixup pd vs pt confusion in gen6 ppgtt code style nit: Align function parameter continuation properly. drm/i915: VLV doesn't have HDMI on port C drm/i915: DSPFW and BLC regs are in the display offset range drm/i915: set conservative clock gating values on VLV v2 drm/i915: fix WaDisablePSDDualDispatchEnable on VLV v2 drm/i915: add more VLV IDs drm/i915: use VLV DIP routines on VLV v2 drm/i915: add media well to VLV force wake routines v2 drm/i915: don't use plane pipe select on VLV drm: modify pages_to_sg prime helper to create optimized SG table drm/i915: use for_each_sg_page for setting up the gtt ptes drm/i915: create compact dma scatter lists for gem objects drm/i915: handle walking compact dma scatter lists ...
2013-04-04timekeeping: Shorten seq_count regionThomas Gleixner
Shorten the seqcount write hold region to the actual update of the timekeeper and the related data (e.g vsyscall). On a contemporary x86 system this reduces the maximum latencies on Preempt-RT from 8us to 4us on the non-timekeeping cores. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04timekeeping: Implement a shadow timekeeperThomas Gleixner
Use the shadow timekeeper to do the update_wall_time() adjustments and then copy it over to the real timekeeper. Keep the shadow timekeeper in sync when updating stuff outside of update_wall_time(). This allows us to limit the timekeeper_seq hold time to the update of the real timekeeper and the vsyscall data in the next patch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04timekeeping: Delay update of clock->cycle_lastThomas Gleixner
For calculating the new timekeeper values store the new cycle_last value in the timekeeper and update the clock->cycle_last just when we actually update the new values. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04timekeeping: Store cycle_last value in timekeeper struct as wellThomas Gleixner
For implementing a shadow timekeeper and a split calculation/update region we need to store the cycle_last value in the timekeeper and update the value in the clocksource struct only in the update region. Add the extra storage to the timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04ntp: Remove ntp_lock, using the timekeeping locks to protect ntp stateJohn Stultz
In order to properly handle the NTP state in future changes to the timekeeping lock management, this patch moves the management of all of the ntp state under the timekeeping locks. This allows us to remove the ntp_lock. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04timekeeping: Simplify tai updating from do_adjtimexJohn Stultz
Since we are taking the timekeeping locks, just go ahead and update any tai change directly, rather then dropping the lock and calling a function that will just take it again. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04timekeeping: Hold timekeepering locks in do_adjtimex and hardppsJohn Stultz
In moving the NTP state to be protected by the timekeeping locks, be sure to acquire the timekeeping locks prior to calling ntp functions. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex()John Stultz
Since ADJ_SETOFFSET adjusts the timekeeping state, process it as part of the top level do_adjtimex() function in timekeeping.c. This avoids deadlocks that could occur once we change the ntp locking rules. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04ntp: Rework do_adjtimex to take timespec and tai argumentsJohn Stultz
In order to change the locking rules, we need to provide the timespec and tai values rather then having the ntp logic acquire these values itself. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04ntp: Move timex validation to timekeeping do_adjtimex call.John Stultz
Move logic that does not need the ntp state to be done in the timekeeping do_adjtimex() call. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04ntp: Move do_adjtimex() and hardpps() functions to timekeeping.cJohn Stultz
In preparation for changing the ntp locking rules, move do_adjtimex and hardpps accessor functions to timekeeping.c, but keep the code logic in ntp.c. This patch also introduces a ntp_internal.h file so timekeeping specific interfaces of ntp.c can be more limitedly shared with timekeeping.c. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04ntp: Split out timex validation from do_adjtimexJohn Stultz
Split out the timex validation done in do_adjtimex into a separate function. This will help simplify logic in following patches. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-04-04workqueue: avoid false negative WARN_ON() in destroy_workqueue()Lai Jiangshan
destroy_workqueue() performs several sanity checks before proceeding with destruction of a workqueue. One of the checks verifies that refcnt of each pwq (pool_workqueue) is over 1 as at that point there should be no in-flight work items and the only holder of pwq refs is the workqueue itself. This worked fine as a workqueue used to hold only one reference to its pwqs; however, since 4c16bd327c ("workqueue: implement NUMA affinity for unbound workqueues"), a workqueue may hold multiple references to its default pwq triggering this sanity check spuriously. Fix it by not triggering the pwq->refcnt assertion on default pwqs. An example spurious WARN trigger follows. WARNING: at kernel/workqueue.c:4201 destroy_workqueue+0x6a/0x13e() Hardware name: 4286C12 Modules linked in: sdhci_pci sdhci mmc_core usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video Pid: 361, comm: umount Not tainted 3.9.0-rc5+ #29 Call Trace: [<c04314a7>] warn_slowpath_common+0x7c/0x93 [<c04314e0>] warn_slowpath_null+0x22/0x24 [<c044796a>] destroy_workqueue+0x6a/0x13e [<c056dc01>] ext4_put_super+0x43/0x2c4 [<c04fb7b8>] generic_shutdown_super+0x4b/0xb9 [<c04fb848>] kill_block_super+0x22/0x60 [<c04fb960>] deactivate_locked_super+0x2f/0x56 [<c04fc41b>] deactivate_super+0x2e/0x31 [<c050f1e6>] mntput_no_expire+0x103/0x108 [<c050fdce>] sys_umount+0x2a2/0x2c4 [<c050fe0e>] sys_oldumount+0x1e/0x20 [<c085ba4d>] sysenter_do_call+0x12/0x38 tj: Rewrote description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com>
2013-04-04uprobes: Change write_opcode() to use copy_*page()Oleg Nesterov
Change write_opcode() to use copy_highpage() + copy_to_page() and simplify the code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-04uprobes: Introduce copy_to_page()Oleg Nesterov
Extract the kmap_atomic/memcpy/kunmap_atomic code from xol_get_insn_slot() into the new simple helper, copy_to_page(). It will have more users soon. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-04uprobes: Kill the unnecesary filp != NULL check in __copy_insn()Oleg Nesterov
__copy_insn(filp) can only be called after valid_vma() returns T, vma->vm_file passed as "filp" can not be NULL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-04uprobes: Change __copy_insn() to use copy_from_page()Oleg Nesterov
Change __copy_insn() to use copy_from_page() and simplify the code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-04uprobes: Turn copy_opcode() into copy_from_page()Oleg Nesterov
No functional changes. Rename copy_opcode() into copy_from_page() and add the new "int len" argument to make it more more generic for the new users. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-04uprobes: Add trap variant helperAnanth N Mavinakayanahalli
Some architectures like powerpc have multiple variants of the trap instruction. Introduce an additional helper is_trap_insn() for run-time handling of non-uprobe traps on such architectures. While there, change is_swbp_at_addr() to is_trap_at_addr() for reading clarity. With this change, the uprobe registration path will supercede any trap instruction inserted at the requested location, while taking care of delivering the SIGTRAP for cases where the trap notification came in for an address without a uprobe. See [1] for a more detailed explanation. [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2013-March/104771.html This change was suggested by Oleg Nesterov. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2013-04-04uprobes: Use file_inode()Oleg Nesterov
Cleanup. Now that we have f_inode/file_inode() we can use it instead of vm_file->f_mapping->host. This should not make any difference for uprobes, but in theory this change is more correct. We use this inode as a key, to compare it with uprobe->inode set by uprobe_register(inode), and the caller uses d_inode. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
2013-04-03cgroup: remove unused parameter in cgroup_task_migrate().Kevin Wilson
This patch removes unused parameter from cgroup_task_migrate(). Signed-off-by: Kevin Wilson <wkevils@gmail.com> Acked-by: Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-03nohz: Print final full dynticks CPUs range on bootFrederic Weisbecker
Given that we apply a few restrictions on the full dynticks CPUs range (keep an online timekeeper oustide the range, then in the future have the range be an RCU nocb CPUs subset), let's print the final resulting range of full dynticks CPUs to the user so that he knows what's really going to run. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-03nohz: Pack nohz Kconfig option in a menu of choicesFrederic Weisbecker
Now the user has the choice between three implementations of the timer tick: * Static periodic tick * Idle dynticks * Full dynticks At least for now, these are mutually exclusive choices, so let's rely on the proper Kconfig feature to display these to the user. A new entry CONFIG_NO_HZ_IDLE is created and the old CONFIG_NO_HZ maps to it for config file backward compatibility. The old name was too general now that we have more granular dynticks implementations. While at it, add some explanation to help the user on his decision between the 3 entries. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-03nohz: Rename CONFIG_NO_HZ to CONFIG_NO_HZ_COMMONFrederic Weisbecker
We are planning to convert the dynticks Kconfig options layout into a choice menu. The user must be able to easily pick any of the following implementations: constant periodic tick, idle dynticks, full dynticks. As this implies a mutual exclusion, the two dynticks implementions need to converge on the selection of a common Kconfig option in order to ease the sharing of a common infrastructure. It would thus seem pretty natural to reuse CONFIG_NO_HZ to that end. It already implements all the idle dynticks code and the full dynticks depends on all that code for now. So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ. On the other hand we want to stay backward compatible: if CONFIG_NO_HZ is set in an older config file, we want to enable CONFIG_NO_HZ_IDLE by default. But we can't afford both at the same time or we run into a circular dependency: 1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select CONFIG_NO_HZ 2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE We might be able to support that from Kconfig/Kbuild but it may not be wise to introduce such a confusing behaviour. So to solve this, create a new CONFIG_NO_HZ_COMMON option which gathers the common code between idle and full dynticks (that common code for now is simply the idle dynticks code) and select it from their referring Kconfig. Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ to it for backward compatibility. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-03Merge branch 'fortglx/3.10/time' of ↵Thomas Gleixner
git://git.linaro.org/people/jstultz/linux into timers/core
2013-04-02nohz: Unhide full dynticks feature from its dependenciesFrederic Weisbecker
The full dynticks feature only shows up when all its Kconfig dependencies are met (RCU nocbs, RCU user mode, ...) This is far from being user friendly as those who want to activate this feature need to look into the Kconfig files and iterate through each dependency then activate these by hand in order to show and select the full dynticks Kconfig option. So process the other way around: show up the Kconfig option if the minimal low level dependencies are met and activate the high level ones when we enable the feature. Note there is one exception in the picture: CONFIG_VIRT_CPU_ACCOUNTING_GEN is part of a Kconfig choice menu and it appears we can't select it from another Kconfig selection when it's under such layout. So for now this particular item stays as a passive dependency. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-01Merge tag 'v3.9-rc5' into wq/for-3.10Tejun Heo
Writeback conversion to workqueue will be based on top of wq/for-3.10 branch to take advantage of custom attrs and NUMA support for unbound workqueues. Mainline currently contains two commits which result in non-trivial merge conflicts with wq/for-3.10 and because block/for-3.10/core is based on v3.9-rc3 which contains one of the conflicting commits, we need a pre-merge-window merge anyway. Let's pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer from workqueue merge conflicts. The two conflicts and their resolutions: * e68035fb65 ("workqueue: convert to idr_alloc()") in mainline changes worker_pool_assign_id() to use idr_alloc() instead of the old idr interface. worker_pool_assign_id() goes through multiple locking changes in wq/for-3.10 causing the following conflict. static int worker_pool_assign_id(struct worker_pool *pool) { int ret; <<<<<<< HEAD lockdep_assert_held(&wq_pool_mutex); do { if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL)) return -ENOMEM; ret = idr_get_new(&worker_pool_idr, pool, &pool->id); } while (ret == -EAGAIN); ======= mutex_lock(&worker_pool_idr_mutex); ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL); if (ret >= 0) pool->id = ret; mutex_unlock(&worker_pool_idr_mutex); >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89 return ret < 0 ? ret : 0; } We want locking from the former and idr_alloc() usage from the latter, which can be combined to the following. static int worker_pool_assign_id(struct worker_pool *pool) { int ret; lockdep_assert_held(&wq_pool_mutex); ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL); if (ret >= 0) { pool->id = ret; return 0; } return ret; } * eb2834285c ("workqueue: fix possible pool stall bug in wq_unbind_fn()") updated wq_unbind_fn() such that it has single larger for_each_std_worker_pool() loop instead of two separate loops with a schedule() call inbetween. wq/for-3.10 renamed pool->assoc_mutex to pool->manager_mutex causing the following conflict (earlier function body and comments omitted for brevity). static void wq_unbind_fn(struct work_struct *work) { ... spin_unlock_irq(&pool->lock); <<<<<<< HEAD mutex_unlock(&pool->manager_mutex); } ======= mutex_unlock(&pool->assoc_mutex); >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89 schedule(); <<<<<<< HEAD for_each_cpu_worker_pool(pool, cpu) ======= >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89 atomic_set(&pool->nr_running, 0); spin_lock_irq(&pool->lock); wake_up_worker(pool); spin_unlock_irq(&pool->lock); } } The resolution is mostly trivial. We want the control flow of the latter with the rename of the former. static void wq_unbind_fn(struct work_struct *work) { ... spin_unlock_irq(&pool->lock); mutex_unlock(&pool->manager_mutex); schedule(); atomic_set(&pool->nr_running, 0); spin_lock_irq(&pool->lock); wake_up_worker(pool); spin_unlock_irq(&pool->lock); } } Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-01workqueue: update sysfs interface to reflect NUMA awareness and a kernel ↵Tejun Heo
param to disable NUMA affinity Unbound workqueues are now NUMA aware. Let's add some control knobs and update sysfs interface accordingly. * Add kernel param workqueue.numa_disable which disables NUMA affinity globally. * Replace sysfs file "pool_id" with "pool_ids" which contain node:pool_id pairs. This change is userland-visible but "pool_id" hasn't seen a release yet, so this is okay. * Add a new sysf files "numa" which can toggle NUMA affinity on individual workqueues. This is implemented as attrs->no_numa whichn is special in that it isn't part of a pool's attributes. It only affects how apply_workqueue_attrs() picks which pools to use. After "pool_ids" change, first_pwq() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: implement NUMA affinity for unbound workqueuesTejun Heo
Currently, an unbound workqueue has single current, or first, pwq (pool_workqueue) to which all new work items are queued. This often isn't optimal on NUMA machines as workers may jump around across node boundaries and work items get assigned to workers without any regard to NUMA affinity. This patch implements NUMA affinity for unbound workqueues. Instead of mapping all entries of numa_pwq_tbl[] to the same pwq, apply_workqueue_attrs() now creates a separate pwq covering the intersecting CPUs for each NUMA node which has online CPUs in @attrs->cpumask. Nodes which don't have intersecting possible CPUs are mapped to pwqs covering whole @attrs->cpumask. As CPUs come up and go down, the pool association is changed accordingly. Changing pool association may involve allocating new pools which may fail. To avoid failing CPU_DOWN, each workqueue always keeps a default pwq which covers whole attrs->cpumask which is used as fallback if pool creation fails during a CPU hotplug operation. This ensures that all work items issued on a NUMA node is executed on the same node as long as the workqueue allows execution on the CPUs of the node. As this maps a workqueue to multiple pwqs and max_active is per-pwq, this change the behavior of max_active. The limit is now per NUMA node instead of global. While this is an actual change, max_active is already per-cpu for per-cpu workqueues and primarily used as safety mechanism rather than for active concurrency control. Concurrency is usually limited from workqueue users by the number of concurrently active work items and this change shouldn't matter much. v2: Fixed pwq freeing in apply_workqueue_attrs() error path. Spotted by Lai. v3: The previous version incorrectly made a workqueue spanning multiple nodes spread work items over all online CPUs when some of its nodes don't have any desired cpus. Reimplemented so that NUMA affinity is properly updated as CPUs go up and down. This problem was spotted by Lai Jiangshan. v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it; however, wq may be freed at any time after dfl_pwq is put making the clearing use-after-free. Clear wq->dfl_pwq before putting it. v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and @pwq_tbl after success. Fixed. Retry loop in wq_update_unbound_numa_attrs() isn't necessary as application of new attrs is excluded via CPU hotplug. Removed. Documentation on CPU affinity guarantee on CPU_DOWN added. All changes are suggested by Lai Jiangshan. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: introduce put_pwq_unlocked()Tejun Heo
Factor out lock pool, put_pwq(), unlock sequence into put_pwq_unlocked(). The two existing places are converted and there will be more with NUMA affinity support. This is to prepare for NUMA affinity support for unbound workqueues and doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: introduce numa_pwq_tbl_install()Tejun Heo
Factor out pool_workqueue linking and installation into numa_pwq_tbl[] from apply_workqueue_attrs() into numa_pwq_tbl_install(). link_pwq() is made safe to call multiple times. numa_pwq_tbl_install() links the pwq, installs it into numa_pwq_tbl[] at the specified node and returns the old entry. @last_pwq is removed from link_pwq() as the return value of the new function can be used instead. This is to prepare for NUMA affinity support for unbound workqueues. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: use NUMA-aware allocation for pool_workqueuesTejun Heo
Use kmem_cache_alloc_node() with @pool->node instead of kmem_cache_zalloc() when allocating a pool_workqueue so that it's allocated on the same node as the associated worker_pool. As there's no no kmem_cache_zalloc_node(), move zeroing to init_pwq(). This was suggested by Lai Jiangshan. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: break init_and_link_pwq() into two functions and introduce ↵Tejun Heo
alloc_unbound_pwq() Break init_and_link_pwq() into init_pwq() and link_pwq() and move unbound-workqueue specific handling into apply_workqueue_attrs(). Also, factor out unbound pool and pool_workqueue allocation into alloc_unbound_pwq(). This reorganization is to prepare for NUMA affinity and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: map an unbound workqueues to multiple per-node pool_workqueuesTejun Heo
Currently, an unbound workqueue has only one "current" pool_workqueue associated with it. It may have multple pool_workqueues but only the first pool_workqueue servies new work items. For NUMA affinity, we want to change this so that there are multiple current pool_workqueues serving different NUMA nodes. Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and points to the pool_workqueue to use for each possible node. This replaces first_pwq() in __queue_work() and workqueue_congested(). numa_pwq_tbl[] is currently initialized to point to the same pool_workqueue as first_pwq() so this patch doesn't make any behavior changes. v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function may be called only with wq->mutex held. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: move hot fields of workqueue_struct to the endTejun Heo
Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align them to the cacheline. These two fields are used in the work item issue path and thus hot. The scheduled NUMA affinity support will add dispatch table at the end of workqueue_struct and relocating these two fields will allow us hitting only single cacheline on hot paths. Note that wq->pwqs isn't moved although it currently is being used in the work item issue path for unbound workqueues. The dispatch table mentioned above will replace its use in the issue path, so it will become cold once NUMA support is implemented. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: make workqueue->name[] fixed lenTejun Heo
Currently workqueue->name[] is of flexible length. We want to use the flexible field for something more useful and there isn't much benefit in allowing arbitrary name length anyway. Make it fixed len capping at 24 bytes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: add workqueue->unbound_attrsTejun Heo
Currently, when exposing attrs of an unbound workqueue via sysfs, the workqueue_attrs of first_pwq() is used as that should equal the current state of the workqueue. The planned NUMA affinity support will make unbound workqueues make use of multiple pool_workqueues for different NUMA nodes and the above assumption will no longer hold. Introduce workqueue->unbound_attrs which records the current attrs in effect and use it for sysfs instead of first_pwq()->attrs. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: determine NUMA node of workers accourding to the allowed cpumaskTejun Heo
When worker tasks are created using kthread_create_on_node(), currently only per-cpu ones have the matching NUMA node specified. All unbound workers are always created with NUMA_NO_NODE. Now that an unbound worker pool may have an arbitrary cpumask associated with it, this isn't optimal. Add pool->node which is determined by the pool's cpumask. If the pool's cpumask is contained inside a NUMA node proper, the pool is associated with that node, and all workers of the pool are created on that node. This currently only makes difference for unbound worker pools with cpumask contained inside single NUMA node, but this will serve as foundation for making all unbound pools NUMA-affine. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: drop 'H' from kworker names of unbound worker poolsTejun Heo
Currently, all workqueue workers which have negative nice value has 'H' postfixed to their names. This is necessary for per-cpu workers as they use the CPU number instead of pool->id to identify the pool and the 'H' postfix is the only thing distinguishing normal and highpri workers. As workers for unbound pools use pool->id, the 'H' postfix is purely informational. TASK_COMM_LEN is 16 and after the static part and delimiters, there are only five characters left for the pool and worker IDs. We're expecting to have more unbound pools with the scheduled NUMA awareness support. Let's drop the non-essential 'H' postfix from unbound kworker name. While at it, restructure kthread_create*() invocation to help future NUMA related changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]Tejun Heo
Unbound workqueues are going to be NUMA-affine. Add wq_numa_tbl_len and wq_numa_possible_cpumask[] in preparation. The former is the highest NUMA node ID + 1 and the latter is masks of possibles CPUs for each NUMA node. This patch only introduces these. Future patches will make use of them. v2: NUMA initialization move into wq_numa_init(). Also, the possible cpumask array is not created if there aren't multiple nodes on the system. wq_numa_enabled bool added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: move pwq_pool_locking outside of get/put_unbound_pool()Tejun Heo
The scheduled NUMA affinity support for unbound workqueues would need to walk workqueues list and pool related operations on each workqueue. Move wq_pool_mutex locking out of get/put_unbound_pool() to their callers so that pool operations can be performed while walking the workqueues list, which is also protected by wq_pool_mutex. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: fix memory leak in apply_workqueue_attrs()Tejun Heo
apply_workqueue_attrs() wasn't freeing temp attrs variable @new_attrs in its success path. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2013-04-01workqueue: fix unbound workqueue attrs hashing / comparisonTejun Heo
29c91e9912b ("workqueue: implement attribute-based unbound worker_pool management") implemented attrs based worker_pool matching. It tried to avoid false negative when comparing cpumasks with custom hash function; unfortunately, the hash and comparison functions fail to ignore CPUs which are not possible. It incorrectly assumed that bitmap_copy() skips leftover bits in the last word of bitmap and cpumask_equal() ignores impossible CPUs. This patch updates attrs->cpumask handling such that impossible CPUs are properly ignored. * Hash and copy functions no longer do anything special. They expect their callers to clear impossible CPUs. * alloc_workqueue_attrs() initializes the cpumask to cpu_possible_mask instead of setting all bits and explicit cpumask_setall() for unbound_std_wq_attrs[] in init_workqueues() is dropped. * apply_workqueue_attrs() is now responsible for ignoring impossible CPUs. It makes a copy of @attrs and clears impossible CPUs before doing anything else. Signed-off-by: Tejun Heo <tj@kernel.org>