summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2011-10-31writeback: trace event bdi_dirty_ratelimitWu Fengguang
It helps understand how various throttle bandwidths are updated. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-11writeback: fix ppc compile warnings on do_div(long long, unsigned long)Wu Fengguang
Fix powerpc compile warnings mm/page-writeback.c: In function 'bdi_position_ratio': mm/page-writeback.c:622:3: warning: comparison of distinct pointer types lacks a cast [enabled by default] page-writeback.c:635:4: warning: comparison of distinct pointer types lacks a cast [enabled by default] Also fix gcc "uninitialized var" warnings. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: dirty position control - bdi reserve areaWu Fengguang
Keep a minimal pool of dirty pages for each bdi, so that the disk IO queues won't underrun. Also gently increase a small bdi_thresh to avoid it stuck in 0 for some light dirtied bdi. It's particularly useful for JBOD and small memory system. It may result in (pos_ratio > 1) at the setpoint and push the dirty pages high. This is more or less intended because the bdi is in the danger of IO queue underflow. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: control dirty pause timeWu Fengguang
The dirty pause time shall ultimately be controlled by adjusting nr_dirtied_pause, since there is relationship pause = pages_dirtied / task_ratelimit Assuming pages_dirtied ~= nr_dirtied_pause task_ratelimit ~= dirty_ratelimit We get nr_dirtied_pause ~= dirty_ratelimit * desired_pause Here dirty_ratelimit is preferred over task_ratelimit because it's more stable. It's also important to limit possible large transitional errors: - bw is changing quickly - pages_dirtied << nr_dirtied_pause on entering dirty exceeded area - pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a separate fix, but still expect non-trivial errors) So we end up using the above formula inside clamp_val(). The best test case for this code is to run 100 "dd bs=4M" tasks on btrfs and check its pause time distribution. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: limit max dirty pause timeWu Fengguang
Apply two policies to scale down the max pause time for 1) small number of concurrent dirtiers 2) small memory system (comparing to storage bandwidth) MAX_PAUSE=200ms may only be suitable for high end servers with lots of concurrent dirtiers, where the large pause time can reduce much overheads. Otherwise, smaller pause time is desirable whenever possible, so as to get good responsiveness and smooth user experiences. It's actually required for good disk utilization in the case when all the dirty pages can be synced to disk within MAX_PAUSE=200ms. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: IO-less balance_dirty_pages()Wu Fengguang
As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to throttle the dirtying task. In the mean while, kick off the per-bdi flusher thread to do background writeback IO. RATIONALS ========= - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less. - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". * "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the global page states) * the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path * "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s. - IO size too small for fast arrays and too large for slow USB sticks The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control. Now it's possible to increase writeback chunk size proportional to the disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, the larger writeback size dramatically reduces the seek count to 1/10 (far beyond my expectation) and improves the write throughput by 24%. - long block time in balance_dirty_pages() hurts desktop responsiveness Many of us may have the experience: it often takes a couple of seconds or even long time to stop a heavy writing dd/cp/tar command with Ctrl-C or "kill -9". - IO pipeline broken by bumpy write() progress There are a broad class of "loop {read(buf); write(buf);}" applications whose read() pipeline will be under-utilized or even come to a stop if the write()s have long latencies _or_ don't progress in a constant rate. The current threshold based throttling inherently transfers the large low level IO completion fluctuations to bumpy application write()s, and further deteriorates with increasing number of dirtiers and/or bdi's. For example, when doing 50 dd's + 1 remote rsync to an XFS partition, the rsync progresses very bumpy in legacy kernel, and throughput is improved by 67% by this patchset. (plus the larger write chunk size, it will be 93% speedup). The new rate based throttling can support 1000+ dd's with excellent smoothness, low latency and low overheads. For the above reasons, it's much better to do IO-less and low latency pauses in balance_dirty_pages(). Jan Kara, Dave Chinner and me explored the scheme to let balance_dirty_pages() wait for enough writeback IO completions to safeguard the dirty limit. However it's found to have two problems: - in large NUMA systems, the per-cpu counters may have big accounting errors, leading to big throttle wait time and jitters. - NFS may kill large amount of unstable pages with one single COMMIT. Because NFS server serves COMMIT with expensive fsync() IOs, it is desirable to delay and reduce the number of COMMITs. So it's not likely to optimize away such kind of bursty IO completions, and the resulted large (and tiny) stall times in IO completion based throttling. So here is a pause time oriented approach, which tries to control the pause time in each balance_dirty_pages() invocations, by controlling the number of pages dirtied before calling balance_dirty_pages(), for smooth and efficient dirty throttling: - avoid useless (eg. zero pause time) balance_dirty_pages() calls - avoid too small pause time (less than 4ms, which burns CPU power) - avoid too large pause time (more than 200ms, which hurts responsiveness) - avoid big fluctuations of pause times It can control pause times at will. The default policy (in a followup patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. BEHAVIOR CHANGE =============== (1) dirty threshold Users will notice that the applications will get throttled once crossing the global (background + dirty)/2=15% threshold, and then balanced around 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable memory in 1-dd case. Since the task will be soft throttled earlier than before, it may be perceived by end users as performance "slow down" if his application happens to dirty more than 15% dirtyable memory. (2) smoothness/responsiveness Users will notice a more responsive system during heavy writeback. "killall dd" will take effect instantly. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: per task dirty rate limitWu Fengguang
Add two fields to task_struct. 1) account dirtied pages in the individual tasks, for accuracy 2) per-task balance_dirty_pages() call intervals, for flexibility The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will scale near-sqrt to the safety gap between dirty pages and threshold. The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying pages at exactly the same time, each task will be assigned a large initial nr_dirtied_pause, so that the dirty threshold will be exceeded long before each task reached its nr_dirtied_pause and hence call balance_dirty_pages(). The solution is to watch for the number of pages dirtied on each CPU in between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages (3% dirty threshold), force call balance_dirty_pages() for a chance to set bdi->dirty_exceeded. In normal situations, this safeguarding condition is not expected to trigger at all. On the sqrt in dirty_poll_interval(): It will serve as an initial guess when dirty pages are still in the freerun area. When dirty pages are floating inside the dirty control scope [freerun, limit], a followup patch will use some refined dirty poll interval to get the desired pause time. thresh-dirty (MB) sqrt 1 16 2 22 4 32 8 45 16 64 32 90 64 128 128 181 256 256 512 362 1024 512 The above table means, given 1MB (or 1GB) gap and the dd tasks polling balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't be exceeded as long as there are less than 16 (or 512) concurrent dd's. So sqrt naturally leads to less overheads and more safe concurrent tasks for large memory servers, which have large (thresh-freerun) gaps. peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Andrea Righi <andrea@betterlinux.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: stabilize bdi->dirty_ratelimitWu Fengguang
There are some imperfections in balanced_dirty_ratelimit. 1) large fluctuations The dirty_rate used for computing balanced_dirty_ratelimit is merely averaged in the past 200ms (very small comparing to the 3s estimation period for write_bw), which makes rather dispersed distribution of balanced_dirty_ratelimit. It's pretty hard to average out the singular points by increasing the estimation period. Considering that the averaging technique will introduce very undesirable time lags, I give it up totally. (btw, the 3s write_bw averaging time lag is much more acceptable because its impact is one-way and therefore won't lead to oscillations.) The more practical way is filtering -- most singular balanced_dirty_ratelimit points can be filtered out by remembering some prev_balanced_rate and prev_prev_balanced_rate. However the more reliable way is to guard balanced_dirty_ratelimit with task_ratelimit. 2) due to truncates and fs redirties, the (write_bw <=> dirty_rate) match could become unbalanced, which may lead to large systematical errors in balanced_dirty_ratelimit. The truncates, due to its possibly bumpy nature, can hardly be compensated smoothly. So let's face it. When some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit high, dirty pages will go higher than the setpoint. task_ratelimit will in turn become lower than dirty_ratelimit. So if we consider both balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit only when they are on the same side of dirty_ratelimit, the systematical errors in balanced_dirty_ratelimit won't be able to bring dirty_ratelimit far away. The balanced_dirty_ratelimit estimation may also be inaccurate near @limit or @freerun, however is less an issue. 3) since we ultimately want to - keep the fluctuations of task ratelimit as small as possible - keep the dirty pages around the setpoint as long time as possible the update policy used for (2) also serves the above goals nicely: if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit), and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit), there is no point to bring up dirty_ratelimit in a hurry only to hurt both the above two goals. So, we make use of task_ratelimit to limit the update of dirty_ratelimit in two ways: 1) avoid changing dirty rate when it's against the position control target (the adjusted rate will slow down the progress of dirty pages going back to setpoint). 2) limit the step size. task_ratelimit is changing values step by step, leaving a consistent trace comparing to the randomly jumping balanced_dirty_ratelimit. task_ratelimit also has the nice smaller errors in stable state and typically larger errors when there are big errors in rate. So it's a pretty good limiting factor for the step size of dirty_ratelimit. Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit. task_ratelimit is merely used as a limiting factor. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: dirty rate controlWu Fengguang
It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N) when there are N dd tasks. On write() syscall, use bdi->dirty_ratelimit ============================================ balance_dirty_pages(pages_dirtied) { task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio(); pause = pages_dirtied / task_ratelimit; sleep(pause); } On every 200ms, update bdi->dirty_ratelimit =========================================== bdi_update_dirty_ratelimit() { task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio(); balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate; bdi->dirty_ratelimit = balanced_dirty_ratelimit } Estimation of balanced bdi->dirty_ratelimit =========================================== balanced task_ratelimit ----------------------- balance_dirty_pages() needs to throttle tasks dirtying pages such that the total amount of dirty pages stays below the specified dirty limit in order to avoid memory deadlocks. Furthermore we desire fairness in that tasks get throttled proportionally to the amount of pages they dirty. IOW we want to throttle tasks such that we match the dirty rate to the writeout bandwidth, this yields a stable amount of dirty pages: dirty_rate == write_bw (1) The fairness requirement gives us: task_ratelimit = balanced_dirty_ratelimit == write_bw / N (2) where N is the number of dd tasks. We don't know N beforehand, but still can estimate balanced_dirty_ratelimit within 200ms. Start by throttling each dd task at rate task_ratelimit = task_ratelimit_0 (3) (any non-zero initial value is OK) After 200ms, we measured dirty_rate = # of pages dirtied by all dd's / 200ms write_bw = # of pages written to the disk / 200ms For the aggressive dd dirtiers, the equality holds dirty_rate == N * task_rate == N * task_ratelimit_0 (4) Or task_ratelimit_0 == dirty_rate / N (5) Now we conclude that the balanced task ratelimit can be estimated by write_bw balanced_dirty_ratelimit = task_ratelimit_0 * ---------- (6) dirty_rate Because with (4) and (5) we can get the desired equality (1): write_bw balanced_dirty_ratelimit == (dirty_rate / N) * ---------- dirty_rate == write_bw / N Then using the balanced task ratelimit we can compute task pause times like: task_pause = task->nr_dirtied / task_ratelimit task_ratelimit with position control ------------------------------------ However, while the above gives us means of matching the dirty rate to the writeout bandwidth, it at best provides us with a stable dirty page count (assuming a static system). In order to control the dirty page count such that it is high enough to provide performance, but does not exceed the specified limit we need another control. The dirty position control works by extending (2) to task_ratelimit = balanced_dirty_ratelimit * pos_ratio (7) where pos_ratio is a negative feedback function that subjects to 1) f(setpoint) = 1.0 2) df/dx < 0 That is, if the dirty pages are ABOVE the setpoint, we throttle each task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty pages are created less fast than they are cleaned, thus DROP to the setpoints (and the reverse). Based on (7) and the assumption that both dirty_ratelimit and pos_ratio remains CONSTANT for the past 200ms, we get task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio (8) Putting (8) into (6), we get the formula used in bdi_update_dirty_ratelimit(): write_bw balanced_dirty_ratelimit *= pos_ratio * ---------- (9) dirty_rate Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: add bg_threshold parameter to __bdi_update_bandwidth()Wu Fengguang
No behavior change. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: dirty position controlWu Fengguang
bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so that the resulted task rate limit can drive the dirty pages back to the global/bdi setpoints. Old scheme is, | free run area | throttle area ----------------------------------------+----------------------------> thresh^ dirty pages New scheme is, ^ task rate limit | | * | * | * |[free run] * [smooth throttled] | * | * | * ..bdi->dirty_ratelimit..........* | . * | . * | . * | . * | . * +-------------------------------.-----------------------*------------> setpoint^ limit^ dirty pages The slope of the bdi control line should be 1) large enough to pull the dirty pages to setpoint reasonably fast 2) small enough to avoid big fluctuations in the resulted pos_ratio and hence task ratelimit Since the fluctuation range of the bdi dirty pages is typically observed to be within 1-second worth of data, the bdi control line's slope is selected to be a linear function of bdi write bandwidth, so that it can adapt to slow/fast storage devices well. Assume the bdi control line pos_ratio = 1.0 + k * (dirty - bdi_setpoint) where k is the negative slope. If targeting for 12.5% fluctuation range in pos_ratio when dirty pages are fluctuating in range [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2], we get slope k = - 1 / (8 * write_bw) Let pos_ratio(x_intercept) = 0, we get the parameter used in code: x_intercept = bdi_setpoint + 8 * write_bw The global/bdi slopes are nicely complementing each other when the system has only one major bdi (indicated by bdi_thresh ~= thresh): 1) slope of global control line => scaling to the control scope size 2) slope of main bdi control line => scaling to the writeout bandwidth so that - in memory tight systems, (1) becomes strong enough to squeeze dirty pages inside the control scope - in large memory systems where the "gravity" of (1) for pulling the dirty pages to setpoint is too weak, (2) can back (1) up and drive dirty pages to bdi_setpoint ~= setpoint reasonably fast. Unfortunately in JBOD setups, the fluctuation range of bdi threshold is related to memory size due to the interferences between disks. In this case, the bdi slope will be weighted sum of write_bw and bdi_thresh. Given equations span = x_intercept - bdi_setpoint k = df/dx = - 1 / span and the extremum values span = bdi_thresh dx = bdi_thresh we get df = - dx / span = - 1.0 That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence task ratelimit will fluctuate by -100%. peter: use 3rd order polynomial for the global control line CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-03writeback: account per-bdi accumulated dirtied pagesWu Fengguang
Introduce the BDI_DIRTIED counter. It will be used for estimating the bdi's dirty bandwidth. CC: Jan Kara <jack@suse.cz> CC: Michael Rubin <mrubin@google.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-09-21Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-block: floppy: use del_timer_sync() in init cleanup blk-cgroup: be able to remove the record of unplugged device block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request mm: Add comment explaining task state setting in bdi_forker_thread() mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread() block: simplify force plug flush code a little bit block: change force plug flush call order block: Fix queue_flag update when rq_affinity goes from 2 to 1 block: separate priority boosting from REQ_META block: remove READ_META and WRITE_META xen-blkback: fixed indentation and comments xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
2011-09-19Merge branch 'slab/urgent' of git://github.com/penberg/linuxLinus Torvalds
* 'slab/urgent' of git://github.com/penberg/linux: slub: add slab with one free object to partial list tail
2011-09-14mm: account skipped entries to avoid looping in find_get_pagesShaohua Li
The found entries by find_get_pages() could be all swap entries. In this case we skip the entries, but make sure the skipped entries are accounted, so we don't keep looping. Using nr_found > nr_skip to simplify code as suggested by Eric. Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14mm: sync vmalloc address space page tables in alloc_vm_area()David Vrabel
Xen backend drivers (e.g., blkback and netback) would sometimes fail to map grant pages into the vmalloc address space allocated with alloc_vm_area(). The GNTTABOP_map_grant_ref would fail because Xen could not find the page (in the L2 table) containing the PTEs it needed to update. (XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000 netback and blkback were making the hypercall from a kernel thread where task->active_mm != &init_mm and alloc_vm_area() was only updating the page tables for init_mm. The usual method of deferring the update to the page tables of other processes (i.e., after taking a fault) doesn't work as a fault cannot occur during the hypercall. This would work on some systems depending on what else was using vmalloc. Fix this by reverting ef691947d8a3 ("vmalloc: remove vmalloc_sync_all() from alloc_vm_area()") and add a comment to explain why it's needed. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Keir Fraser <keir.xen@gmail.com> Cc: <stable@kernel.org> [3.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14memcg: Revert "memcg: add memory.vmscan_stat"Johannes Weiner
Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add memory.vmscan_stat"). The implementation of per-memcg reclaim statistics violates how memcg hierarchies usually behave: hierarchically. The reclaim statistics are accounted to child memcgs and the parent hitting the limit, but not to hierarchy levels in between. Usually, hierarchical statistics are perfectly recursive, with each level representing the sum of itself and all its children. Since this exports statistics to userspace, this may lead to confusion and problems with changing things after the release, so revert it now, we can try again later. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14mm: vmscan: fix force-scanning small targets without swapJohannes Weiner
Without swap, anonymous pages are not scanned. As such, they should not count when considering force-scanning a small target if there is no swap. Otherwise, targets are not force-scanned even when their effective scan number is zero and the other conditions--kswapd/memcg--apply. This fixes 246e87a93934 ("memcg: fix get_scan_count() for small targets"). [akpm@linux-foundation.org: fix comment] Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14numa: fix NUMA compile error when sysfs and procfs are disabledDavid Rientjes
The vmstat_text array is only defined for CONFIG_SYSFS or CONFIG_PROC_FS, yet it is referenced for per-node vmstat with CONFIG_NUMA: drivers/built-in.o: In function `node_read_vmstat': node.c:(.text+0x1106df): undefined reference to `vmstat_text' Introduced in commit fa25c503dfa2 ("mm: per-node vmstat: show proper vmstats"). Define the array for CONFIG_NUMA as well. [akpm@linux-foundation.org: remove unneeded ifdefs] Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Cong Wang <amwang@redhat.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14mm/mempolicy.c: make copy_from_user() provably correctKAMEZAWA Hiroyuki
When compiling mm/mempolicy.c with struct user copy checks the following warning is shown: In file included from arch/x86/include/asm/uaccess.h:572, from include/linux/uaccess.h:5, from include/linux/highmem.h:7, from include/linux/pagemap.h:10, from include/linux/mempolicy.h:70, from mm/mempolicy.c:68: In function `copy_from_user', inlined from `compat_sys_get_mempolicy' at mm/mempolicy.c:1415: arch/x86/include/asm/uaccess_64.h:64: warning: call to `copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct LD mm/built-in.o Fix this by passing correct buffer size value. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-14mm/mempolicy.c: fix pgoff in mbind vma mergeCaspar Zhang
commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") didn't really fix the mbind vma merge problem due to wrong pgoff value passing to vma_merge(), which made vma_merge() always return NULL. Before the patch applied, we are getting a result like: addr = 0x7fa58f00c000 [snip] 7fa58f00c000-7fa58f00d000 rw-p 00000000 00:00 0 7fa58f00d000-7fa58f00e000 rw-p 00000000 00:00 0 7fa58f00e000-7fa58f00f000 rw-p 00000000 00:00 0 here 7fa58f00c000->7fa58f00f000 we get 3 VMAs which are expected to be merged described as described in commit 9d8cebd. Re-testing the patched kernel with the reproducer provided in commit 9d8cebd, we get the correct result: addr = 0x7ffa5aaa2000 [snip] 7ffa5aaa2000-7ffa5aaa6000 rw-p 00000000 00:00 0 7fffd556f000-7fffd5584000 rw-p 00000000 00:00 0 [stack] Signed-off-by: Caspar Zhang <caspar@casparzhang.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-02mm: Add comment explaining task state setting in bdi_forker_thread()Jan Kara
CC: Wu Fengguang <fengguang.wu@intel.com> CC: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-02mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()Jan Kara
bdi_forker_thread() clears BDI_pending bit at the end of the main loop. However clearing of this bit must not be done in some cases which is handled by calling 'continue' from switch statement. That's kind of unusual construct and without a good reason so change the function into more intuitive code flow. CC: Wu Fengguang <fengguang.wu@intel.com> CC: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-27slub: add slab with one free object to partial list tailShaohua Li
The slab has just one free object, adding it to partial list head doesn't make sense. And it can cause lock contentation. For example, 1. CPU takes the slab from partial list 2. fetch an object 3. switch to another slab 4. free an object, then the slab is added to partial list again In this way n->list_lock will be heavily contended. In fact, Alex had a hackbench regression. 3.1-rc1 performance drops about 70% against 3.0. This patch fixes it. Acked-by: Christoph Lameter <cl@linux.com> Reported-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Shaohua Li <shli@kernel.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-25memcg: fix hierarchical oom lockingJohannes Weiner
Commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than counter") tried to oom lock the hierarchy and roll back upon encountering an already locked memcg. The code is confused when it comes to detecting a locked memcg, though, so it would fail and rollback after locking one memcg and encountering an unlocked second one. The result is that oom-locking hierarchies fails unconditionally and that every oom killer invocation simply goes to sleep on the oom waitqueue forever. The tasks practically hang forever without anyone intervening, possibly holding locks that trip up unrelated tasks, too. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25vmscan: clear ZONE_CONGESTED for zone with good watermarkShaohua Li
ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any task. It's possible ZONE_CONGESTED isn't cleared in some cases: 1. the zone is already balanced just entering balance_pgdat() for order-0 because concurrent tasks free memory. In this case, later check will skip the zone as it's balanced so the flag isn't cleared. 2. high order balance fallbacks to order-0. quote from Mel: At the end of balance_pgdat(), kswapd uses the following logic; If reclaiming at high order { for each zone { if all_unreclaimable skip if watermark is not met order = 0 loop again /* watermark is met */ clear congested } } i.e. it clears ZONE_CONGESTED if it the zone is balanced. if not, it restarts balancing at order-0. However, if the higher zones are balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as that only happens after a zone is shrunk. This can mean that wait_iff_congested() stalls unnecessarily. This patch makes kswapd clear ZONE_CONGESTED during its initial highmem->dma scan for zones that are already balanced. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25mm: fix a vmscan warningShaohua Li
I get the below warning: BUG: using smp_processor_id() in preemptible [00000000] code: bash/746 caller is native_sched_clock+0x37/0x6e Pid: 746, comm: bash Tainted: G W 3.0.0+ #254 Call Trace: [<ffffffff813435c6>] debug_smp_processor_id+0xc2/0xdc [<ffffffff8104158d>] native_sched_clock+0x37/0x6e [<ffffffff81116219>] try_to_free_mem_cgroup_pages+0x7d/0x270 [<ffffffff8114f1f8>] mem_cgroup_force_empty+0x24b/0x27a [<ffffffff8114ff21>] ? sys_close+0x38/0x138 [<ffffffff8114ff21>] ? sys_close+0x38/0x138 [<ffffffff8114f257>] mem_cgroup_force_empty_write+0x17/0x19 [<ffffffff810c72fb>] cgroup_file_write+0xa8/0xba [<ffffffff811522d2>] vfs_write+0xb3/0x138 [<ffffffff8115241a>] sys_write+0x4a/0x71 [<ffffffff8114ffd9>] ? sys_close+0xf0/0x138 [<ffffffff8176deab>] system_call_fastpath+0x16/0x1b sched_clock() can't be used with preempt enabled. And we don't need fast approach to get clock here, so let's use ktime API. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25memcg: pin execution to current cpu while draining stockJohannes Weiner
Commit d1a05b6973c7 ("memcg do not try to drain per-cpu caches without pages") added a drain_local_stock() call to a preemptible section. The draining task looks up the cpu-local stock twice to set the draining-flag, then to drain the stock and clear the flag again. If the task is migrated to a different CPU in between, noone will clear the flag on the first stock and it will be forever undrainable. Its charge can not be recovered and the cgroup can not be deleted anymore. Properly pin the task to the executing CPU while draining stocks. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25Merge branch 'urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback * 'urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback: squeeze max-pause area and drop pass-good area
2011-08-19squeeze max-pause area and drop pass-good areaWu Fengguang
Revert the pass-good area introduced in ffd1f609ab10 ("writeback: introduce max-pause and pass-good dirty limits") and make the max-pause area smaller and safe. This fixes ~30% performance regression in the ext3 data=writeback fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are 12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes. Using deadline scheduler also has a regression, but not that big as CFQ, so this suggests we have some write starvation. The test logs show that - the disks are sometimes under utilized - global dirty pages sometimes rush high to the pass-good area for several hundred seconds, while in the mean time some bdi dirty pages drop to very low value (bdi_dirty << bdi_thresh). Then suddenly the global dirty pages dropped under global dirty threshold and bdi_dirty rush very high (for example, 2 times higher than bdi_thresh). During which time balance_dirty_pages() is not called at all. So the problems are 1) The random writes progress so slow that they break the assumption of the max-pause logic that "8 pages per 200ms is typically more than enough to curb heavy dirtiers". 2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh) and then (bdi_thresh >> bdi_dirty) for others. 3) The higher max-pause/pass-good thresholds somehow leads to the bad swing of dirty pages. The fix is to allow the task to slightly dirty over task_bdi_thresh, but no way to exceed bdi_dirty and/or global dirty_thresh. Tests show that it fixed the JBOD regression completely (both behavior and performance), while still being able to cut down large pause times in balance_dirty_pages() for single-disk cases. Reported-by: Li Shaohua <shaohua.li@intel.com> Tested-by: Li Shaohua <shaohua.li@intel.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-08-17mm: make HASHED_PAGE_VIRTUAL page_address' struct page argument const.Ian Campbell
Followup to 33dd4e0ec911 "mm: make some struct page's const" which missed the HASHED_PAGE_VIRTUAL case. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-14mm: fix wrong vmap address calculations with odd NR_CPUS valuesClemens Ladisch
Commit db64fe02258f ("mm: rewrite vmap layer") introduced code that does address calculations under the assumption that VMAP_BLOCK_SIZE is a power of two. However, this might not be true if CONFIG_NR_CPUS is not set to a power of two. Wrong vmap_block index/offset values could lead to memory corruption. However, this has never been observed in practice (or never been diagnosed correctly); what caught this was the BUG_ON in vb_alloc() that checks for inconsistent vmap_block indices. To fix this, ensure that VMAP_BLOCK_SIZE always is a power of two. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=31572 Reported-by: Pavel Kysilka <goldenfish@linuxsoft.cz> Reported-by: Matias A. Fonzo <selk@dragora.org> Signed-off-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: 2.6.28+ <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-09Revert "memcg: get rid of percpu_charge_mutex lock"Michal Hocko
This reverts commit 8521fc50d433507a7cdc96bec280f9e5888a54cc. The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE bit operations is sufficient but that is not true. Johannes Weiner has reported a crash during parallel memory cgroup removal: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70 Oops: 0000 [#1] PREEMPT SMP Pid: 19677, comm: rmdir Tainted: G W 3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3 RIP: 0010:[<ffffffff81083b70>] css_is_ancestor+0x20/0x70 RSP: 0018:ffff880077b09c88 EFLAGS: 00010202 Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310) Call Trace: [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40 [<ffffffff810feccf>] drain_all_stock+0x11f/0x170 [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0 [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20 [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500 [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0 [<ffffffff81114e7b>] do_rmdir+0xfb/0x110 [<ffffffff81114ea6>] sys_rmdir+0x16/0x20 [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b We are crashing because we try to dereference cached memcg when we are checking whether we should wait for draining on the cache. The cache is already cleaned up, though. There is also a theoretical chance that the cached memcg gets freed between we test for the FLUSHING_CACHED_CHARGE and dereference it in mem_cgroup_same_or_subtree: CPU0 CPU1 CPU2 mem=stock->cached stock->cached=NULL clear_bit test_and_set_bit test_bit() ... <preempted> mem_cgroup_destroy use after free The percpu_charge_mutex protected from this race because sync draining is exclusive. It is safer to revert now and come up with a more parallel implementation later. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-09slub: Fix partial count comparison confusionChristoph Lameter
deactivate_slab() has the comparison if more than the minimum number of partial pages are in the partial list wrong. An effect of this may be that empty pages are not freed from deactivate_slab(). The result could be an OOM due to growth of the partial slabs per node. Frees mostly occur from __slab_free which is okay so this would only affect use cases where a lot of switching around of per cpu slabs occur. Switching per cpu slabs occurs with high frequency if debugging options are enabled. Reported-and-tested-by: Xiaotian Feng <xtfeng@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-09slub: fix check_bytes() for slub debuggingAkinobu Mita
The check_bytes() function is used by slub debugging. It returns a pointer to the first unmatching byte for a character in the given memory area. If the character for matching byte is greater than 0x80, check_bytes() doesn't work. Becuase 64-bit pattern is generated as below. value64 = value | value << 8 | value << 16 | value << 24; value64 = value64 | value64 << 32; The integer promotions are performed and sign-extended as the type of value is u8. The upper 32 bits of value64 is 0xffffffff in the first line, and the second line has no effect. This fixes the 64-bit pattern generation. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Matt Mackall <mpm@selenic.com> Reviewed-by: Marcin Slusarz <marcin.slusarz@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-09slub: Fix full list corruption if debugging is onChristoph Lameter
When a slab is freed by __slab_free() and the slab can only contain a single object ever then it was full (and therefore not on the partial lists but on the full list in the debug case) before we reached slab_empty. This caused the following full list corruption when SLUB debugging was enabled: [ 5913.233035] ------------[ cut here ]------------ [ 5913.233097] WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98() [ 5913.233101] Hardware name: Adamo 13 [ 5913.233105] list_del corruption. prev->next should be ffffea000434fd20, but was ffffea0004199520 [ 5913.233108] Modules linked in: nfs fscache fuse ebtable_nat ebtables ppdev parport_pc lp parport ipt_MASQUERADE iptable_nat nf_nat nfsd lockd nfs_acl auth_rpcgss xt_CHECKSUM sunrpc iptable_mangle bridge stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables rfcomm bnep arc4 iwlagn snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel btusb mac80211 snd_hda_codec bluetooth snd_hwdep snd_seq snd_seq_device snd_pcm usb_debug dell_wmi sparse_keymap cdc_ether usbnet cdc_acm uvcvideo cdc_wdm mii cfg80211 snd_timer dell_laptop videodev dcdbas snd microcode v4l2_compat_ioctl32 soundcore joydev tg3 pcspkr snd_page_alloc iTCO_wdt i2c_i801 rfkill iTCO_vendor_support wmi virtio_net kvm_intel kvm ipv6 xts gf128mul dm_crypt i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan] [ 5913.233213] Pid: 0, comm: swapper Not tainted 3.0.0+ #127 [ 5913.233213] Call Trace: [ 5913.233213] <IRQ> [<ffffffff8105df18>] warn_slowpath_common+0x83/0x9b [ 5913.233213] [<ffffffff8105dfd3>] warn_slowpath_fmt+0x46/0x48 [ 5913.233213] [<ffffffff8127e7c1>] __list_del_entry+0x8d/0x98 [ 5913.233213] [<ffffffff8127e7da>] list_del+0xe/0x2d [ 5913.233213] [<ffffffff814e0430>] __slab_free+0x1db/0x235 [ 5913.233213] [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff81133085>] kmem_cache_free+0x88/0x102 [ 5913.233213] [<ffffffff811706ab>] bvec_free_bs+0x35/0x37 [ 5913.233213] [<ffffffff811706e1>] bio_free+0x34/0x64 [ 5913.233213] [<ffffffff813dc390>] dm_bio_destructor+0x12/0x14 [ 5913.233213] [<ffffffff8116fef6>] bio_put+0x2b/0x2d [ 5913.233213] [<ffffffff813dccab>] clone_endio+0x9e/0xb4 [ 5913.233213] [<ffffffff8116f7dd>] bio_endio+0x2d/0x2f [ 5913.233213] [<ffffffffa00148da>] crypt_dec_pending+0x5c/0x8b [dm_crypt] [ 5913.233213] [<ffffffffa00150a9>] crypt_endio+0x78/0x81 [dm_crypt] [ Full discussion here: https://lkml.org/lkml/2011/8/4/375 ] Make sure that we remove such a slab also from the full lists. Reported-and-tested-by: Dave Jones <davej@redhat.com> Reported-and-tested-by: Xiaotian Feng <xtfeng@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-08-04Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: slab, lockdep: Annotate the locks before using them lockdep: Clear whole lockdep_map on initialization slab, lockdep: Annotate slab -> rcu -> debug_object -> slab lockdep: Fix up warning lockdep: Fix trace_hardirqs_on_caller() futex: Fix regression with read only mappings
2011-08-04slab, lockdep: Annotate the locks before using themPeter Zijlstra
Fernando found we hit the regular OFF_SLAB 'recursion' before we annotate the locks, cure this. The relevant portion of the stack-trace: > [ 0.000000] [<c085e24f>] rt_spin_lock+0x50/0x56 > [ 0.000000] [<c04fb406>] __cache_free+0x43/0xc3 > [ 0.000000] [<c04fb23f>] kmem_cache_free+0x6c/0xdc > [ 0.000000] [<c04fb2fe>] slab_destroy+0x4f/0x53 > [ 0.000000] [<c04fb396>] free_block+0x94/0xc1 > [ 0.000000] [<c04fc551>] do_tune_cpucache+0x10b/0x2bb > [ 0.000000] [<c04fc8dc>] enable_cpucache+0x7b/0xa7 > [ 0.000000] [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61 > [ 0.000000] [<c0bba687>] start_kernel+0x24c/0x363 > [ 0.000000] [<c0bba0ba>] i386_start_kernel+0xa9/0xaf Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-04slab, lockdep: Annotate slab -> rcu -> debug_object -> slabPeter Zijlstra
Lockdep thinks there's lock recursion through: kmem_cache_free() cache_flusharray() spin_lock(&l3->list_lock) <----------------. free_block() | slab_destroy() | call_rcu() | debug_object_activate() | debug_object_init() | __debug_object_init() | kmem_cache_alloc() | cache_alloc_refill() | spin_lock(&l3->list_lock) --' Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no actual possibility of recursing. Luckily debug objects marks it slab with SLAB_DEBUG_OBJECTS so we can identify the thing. Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special lockdep key so that lockdep sees its a different cachep. Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble. Reported-and-tested-by: Sebastian Siewior <sebastian@breakpoint.cc> [ fixes to the initial patch ] Reported-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-03Merge branch 'apei-release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'apei-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: ACPI, APEI, EINJ Param support is disabled by default APEI GHES: 32-bit buildfix ACPI: APEI build fix ACPI, APEI, GHES: Add hardware memory error recovery support HWPoison: add memory_failure_queue() ACPI, APEI, GHES, Error records content based throttle ACPI, APEI, GHES, printk support for recoverable error via NMI lib, Make gen_pool memory allocator lockless lib, Add lock-less NULL terminated single list Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG ACPI, APEI, Add WHEA _OSC support ACPI, APEI, Add APEI bit support in generic _OSC call ACPI, APEI, GHES, Support disable GHES at boot time ACPI, APEI, GHES, Prevent GHES to be built as module ACPI, APEI, Use apei_exec_run_optional in APEI EINJ and ERST ACPI, APEI, Add apei_exec_run_optional ACPI, APEI, GHES, Do not ratelimit fatal error printk before panic ACPI, APEI, ERST, Fix erst-dbg long record reading issue ACPI, APEI, ERST, Prevent erst_dbg from loading if ERST is disabled
2011-08-03mm: clarify the radix_tree exceptional casesHugh Dickins
Make the radix_tree exceptional cases, mostly in filemap.c, clearer. It's hard to devise a suitable snappy name that illuminates the use by shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree generality. And akpm points out that /* radix_tree_deref_retry(page) */ comments look like calls that have been commented out for unknown reason. Skirt the naming difficulty by rearranging these blocks to handle the transient radix_tree_deref_retry(page) case first; then just explain the remaining shmem/tmpfs swap case in a comment. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03tmpfs radix_tree: locate_item to speed up swapoffHugh Dickins
We have already acknowledged that swapoff of a tmpfs file is slower than it was before conversion to the generic radix_tree: a little slower there will be acceptable, if the hotter paths are faster. But it was a shock to find swapoff of a 500MB file 20 times slower on my laptop, taking 10 minutes; and at that rate it significantly slows down my testing. Now, most of that turned out to be overhead from PROVE_LOCKING and PROVE_RCU: without those it was only 4 times slower than before; and more realistic tests on other machines don't fare as badly. I've tried a number of things to improve it, including tagging the swap entries, then doing lookup by tag: I'd expected that to halve the time, but in practice it's erratic, and often counter-productive. The only change I've so far found to make a consistent improvement, is to short-circuit the way we go back and forth, gang lookup packing entries into the array supplied, then shmem scanning that array for the target entry. Scanning in place doubles the speed, so it's now only twice as slow as before (or three times slower when the PROVEs are on). So, add radix_tree_locate_item() as an expedient, once-off, single-caller hack to do the lookup directly in place. #ifdef it on CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited applicability as save space in other configurations. And, sadly, #include sched.h for cond_resched(). Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03mm: a few small updates for radix-swapHugh Dickins
Remove PageSwapBacked (!page_is_file_cache) cases from add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now go through shmem_add_to_page_cache(). Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(), and add a comment on swap entries to invalidate_mapping_pages(). And mincore_page() uses find_get_page() on what might be shmem or a tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03tmpfs: use kmemdup for short symlinksHugh Dickins
But we've not yet removed the old swp_entry_t i_direct[16] from shmem_inode_info. That's because it was still being shared with the inline symlink. Remove it now (saving 64 or 128 bytes from shmem inode size), and use kmemdup() for short symlinks, say, those up to 128 bytes. I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode() rather than shmem_evict_inode(), where we usually do such freeing? I guess it doesn't matter, and I'm not into NUMA mpol testing right now. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03tmpfs: convert shmem_writepage and enable swapHugh Dickins
Convert shmem_writepage() to use shmem_delete_from_page_cache() to use shmem_radix_tree_replace() to substitute swap entry for page pointer atomically in the radix tree. As with shmem_add_to_page_cache(), it's not entirely satisfactory to be copying such code from delete_from_swap_cache, but again judged easier to sell than making its other callers go through the extras. Remove the toy implementation's shmem_put_swap() and shmem_get_swap(), now unreferenced, and the hack to disable swap: it's now good to go. The way things have worked out, info->lock no longer helps to guard the shmem_swaplist: we increment swapped under shmem_swaplist_mutex only. That global mutex exclusion between shmem_writepage() and shmem_unuse() is not pretty, and we ought to find another way; but it's been forced on us by recent race discoveries, not a consequence of this patchset. And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a swap entry was found already present? That's no longer possible, the (unknown) one inserting this page into filecache would hit the swap entry occupying that slot. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03tmpfs: convert mem_cgroup shmem to radix-swapHugh Dickins
Remove mem_cgroup_shmem_charge_fallback(): it was only required when we had to move swappage to filecache with GFP_NOWAIT. Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by moving its call out from shmem_add_to_page_cache() to two of thats three callers. But leave it doing mem_cgroup_uncharge_cache_page() on error: although asymmetrical, it's easier for all 3 callers to handle. These two changes would also be appropriate if anyone were to start using shmem_read_mapping_page_gfp() with GFP_NOWAIT. Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test radix_tree_exceptional_entry() to get what it needs for itself. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03tmpfs: convert shmem_getpage_gfp to radix-swapHugh Dickins
Convert shmem_getpage_gfp(), the engine-room of shmem, to expect page or swap entry returned from radix tree by find_lock_page(). Whereas the repetitive old method proceeded mainly under info->lock, dropping and repeating whenever one of the conditions needed was not met, now we can proceed without it, leaving shmem_add_to_page_cache() to check for a race. This way there is no need to preallocate a page, no need for an early radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback(). Move the error unwinding down to the bottom instead of repeating it throughout. ENOSPC handling is a little different from before: there is no longer any race between find_lock_page() and finding swap, but we can arrive at ENOSPC before calling shmem_recalc_inode(), which might occasionally discover freed space. Be stricter to check i_size before returning. info->lock is used for little but alloced, swapped, i_blocks updates. Move i_blocks updates out from under the max_blocks check, so even an unlimited size=0 mount can show accurate du. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03tmpfs: convert shmem_unuse_inode to radix-swapHugh Dickins
Convert shmem_unuse_inode() to use a lockless gang lookup of the radix tree, searching for matching swap. This is somewhat slower than the old method: because of repeated radix tree descents, because of copying entries up, but probably most because the old method noted and skipped once a vector page was cleared of swap. Perhaps we can devise a use of radix tree tagging to achieve that later. shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate for the lockless lookup by checking that the expected entry is in place, under lock. It is not very satisfactory to be copying this much from add_to_page_cache_locked(), but I think easier to sell than insisting that every caller of add_to_page_cache*() go through the extras. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03tmpfs: convert shmem_truncate_range to radix-swapHugh Dickins
Disable the toy swapping implementation in shmem_writepage() - it's hard to support two schemes at once - and convert shmem_truncate_range() to a lockless gang lookup of swap entries along with pages, freeing both. Since the second loop tightens its noose until all entries of either kind have been squeezed out (and we shall make sure that there's not an instant when neither is visible), there is no longer a need for yet another pass below. shmem_radix_tree_replace() compensates for the lockless lookup by checking that the expected entry is in place, under lock, before replacing it. Here it just deletes, but will be used in later patches to substitute swap entry for page or page for swap entry. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03tmpfs: copy truncate_inode_pages_rangeHugh Dickins
Bring truncate.c's code for truncate_inode_pages_range() inline into shmem_truncate_range(), replacing its first call (there's a followup call below, but leave that one, it will disappear next). Don't play with it yet, apart from leaving out the cleancache flush, and (importantly) the nrpages == 0 skip, and moving shmem_setattr()'s partial page preparation into its partial page handling. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>