summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2013-11-13mm/nobootmem.c: have __free_pages_memory() free in larger chunks.Robin Holt
On large memory machines it can take a few minutes to get through free_all_bootmem(). Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. BITS_PER_LONG was originally chosen to keep things similar to mm/nobootmem.c. But it is more efficient to limit it to MAX_ORDER. base new change 8TB 202s 172s 30s 16TB 401s 351s 50s That is around 1%-3% improvement on total boot time. This patch was spun off from the boot time rfc Robin and I had been working on. Signed-off-by: Robin Holt <robin.m.holt@gmail.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Robin Holt <robinmholt@linux.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mike Travis <travis@sgi.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm: add a helper function to check may oom conditionQiang Huang
Use helper function to check if we need to deal with oom condition. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm/memory_hotplug.c: use pfn_to_nid() instead of page_to_nid(pfn_to_page())Xishi Qiu
Use "pfn_to_nid(pfn)" instead of "page_to_nid(pfn_to_page(pfn))". Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm/memory_hotplug.c: rename the function is_memblock_offlined_cb()Xishi Qiu
A is_memblock_offlined() return or 1 means memory block is offlined, but is_memblock_offlined_cb() returning 1 means memory block is not offlined, this will confuse somebody, so rename the function. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm: use populated_zone() instead of if(zone->present_pages)Xishi Qiu
Use "if (zone->present_pages)" instead of "if (zone->present_pages)". Simplify the code, no functional change. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm: use pgdat_end_pfn() to simplify the code in othersXishi Qiu
Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn + pgdat->node_spanned_pages". Simplify the code, no functional change. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm/huge_memory.c: fix stale comments of transparent_hugepage_flagsJianguo Wu
Since commit 13ece886d99c ("thp: transparent hugepage config choice"), transparent hugepage support is disabled by default, and TRANSPARENT_HUGEPAGE_ALWAYS is configured when TRANSPARENT_HUGEPAGE=y. And since commit d39d33c332c6 ("thp: enable direct defrag"), defrag is enable for all transparent hugepage page faults by default, not only in MADV_HUGEPAGE regions. Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm: remove obsolete comments about page table lockNaoya Horiguchi
The callers of free_pgd_range() and hugetlb_free_pgd_range() don't hold page table locks. The comments seems to be obsolete, so let's remove them. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm/compaction.c: update comment about zone lock in isolate_freepages_blockJerome Marchand
Since commit f40d1e42bb98 ("mm: compaction: acquire the zone->lock as late as possible"), isolate_freepages_block() takes the zone->lock itself. The function description however still states that the zone->lock must be held. This patch removes this outdated statement. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm/vmalloc: use NUMA_NO_NODEJianguo Wu
Use more appropriate "if (node == NUMA_NO_NODE)" instead of "if (node < 0)" Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13ksm: remove redundant __GFP_ZERO from kcallocJoe Perches
kcalloc returns zeroed memory. There's no need to use this flag. Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm/readahead.c:do_readhead(): don't check for ->readpageAndrew Morton
The callee force_page_cache_readahead() already does this and unlike do_readahead(), force_page_cache_readahead() remembers to check for ->readpages() as well. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-12Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes in this cycle are: - (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik van Riel, Peter Zijlstra et al. Yay! - optimize preemption counter handling: merge the NEED_RESCHED flag into the preempt_count variable, by Peter Zijlstra. - wait.h fixes and code reorganization from Peter Zijlstra - cfs_bandwidth fixes from Ben Segall - SMP load-balancer cleanups from Peter Zijstra - idle balancer improvements from Jason Low - other fixes and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits) ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED stop_machine: Fix race between stop_two_cpus() and stop_cpus() sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus sched: Fix asymmetric scheduling for POWER7 sched: Move completion code from core.c to completion.c sched: Move wait code from core.c to wait.c sched: Move wait.c into kernel/sched/ sched/wait: Fix __wait_event_interruptible_lock_irq_timeout() sched: Avoid throttle_cfs_rq() racing with period_timer stopping sched: Guarantee new group-entities always have weight sched: Fix hrtimer_cancel()/rq->lock deadlock sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining sched: Fix race on toggling cfs_bandwidth_used sched: Remove extra put_online_cpus() inside sched_setaffinity() sched/rt: Fix task_tick_rt() comment sched/wait: Fix build breakage sched/wait: Introduce prepare_to_wait_event() sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too sched: Remove get_online_cpus() usage sched: Fix race in migrate_swap_stop() ...
2013-11-11mm, slub: fix the typo in mm/slub.cZhi Yong Wu
Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-11-11slub: Handle NULL parameter in kmem_cache_flagsChristoph Lameter
Andreas Herrmann writes: When I've used slub_debug kernel option (e.g. "slub_debug=,skbuff_fclone_cache" or similar) on a debug session I've seen a panic like: Highbank #setenv bootargs console=ttyAMA0 root=/dev/sda2 kgdboc.kgdboc=ttyAMA0,115200 slub_debug=,kmalloc-4096 earlyprintk=ttyAMA0 ... Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = c0004000 [00000000] *pgd=00000000 Internal error: Oops: 5 [#1] SMP ARM Modules linked in: CPU: 0 PID: 0 Comm: swapper Tainted: G W 3.12.0-00048-gbe408cd #314 task: c0898360 ti: c088a000 task.ti: c088a000 PC is at strncmp+0x1c/0x84 LR is at kmem_cache_flags.isra.46.part.47+0x44/0x60 pc : [<c02c6da0>] lr : [<c0110a3c>] psr: 200001d3 sp : c088bea8 ip : c088beb8 fp : c088beb4 r10: 00000000 r9 : 413fc090 r8 : 00000001 r7 : 00000000 r6 : c2984a08 r5 : c0966e78 r4 : 00000000 r3 : 0000006b r2 : 0000000c r1 : 00000000 r0 : c2984a08 Flags: nzCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment kernel Control: 10c5387d Table: 0000404a DAC: 00000015 Process swapper (pid: 0, stack limit = 0xc088a248) Stack: (0xc088bea8 to 0xc088c000) bea0: c088bed4 c088beb8 c0110a3c c02c6d90 c0966e78 00000040 bec0: ef001f00 00000040 c088bf14 c088bed8 c0112070 c0110a04 00000005 c010fac8 bee0: c088bf5c c088bef0 c010fac8 ef001f00 00000040 00000000 00000040 00000001 bf00: 413fc090 00000000 c088bf34 c088bf18 c0839190 c0112040 00000000 ef001f00 bf20: 00000000 00000000 c088bf54 c088bf38 c0839200 c083914c 00000006 c0961c4c bf40: c0961c28 00000000 c088bf7c c088bf58 c08392ac c08391c0 c08a2ed8 c0966e78 bf60: c086b874 c08a3f50 c0961c28 00000001 c088bfb4 c088bf80 c083b258 c0839248 bf80: 2f800000 0f000000 c08935b4 ffffffff c08cd400 ffffffff c08cd400 c0868408 bfa0: c29849c0 00000000 c088bff4 c088bfb8 c0824974 c083b1e4 ffffffff ffffffff bfc0: c08245c0 00000000 00000000 c0868408 00000000 10c5387d c0892bcc c0868404 bfe0: c0899440 0000406a 00000000 c088bff8 00008074 c0824824 00000000 00000000 [<c02c6da0>] (strncmp+0x1c/0x84) from [<c0110a3c>] (kmem_cache_flags.isra.46.part.47+0x44/0x60) [<c0110a3c>] (kmem_cache_flags.isra.46.part.47+0x44/0x60) from [<c0112070>] (__kmem_cache_create+0x3c/0x410) [<c0112070>] (__kmem_cache_create+0x3c/0x410) from [<c0839190>] (create_boot_cache+0x50/0x74) [<c0839190>] (create_boot_cache+0x50/0x74) from [<c0839200>] (create_kmalloc_cache+0x4c/0x88) [<c0839200>] (create_kmalloc_cache+0x4c/0x88) from [<c08392ac>] (create_kmalloc_caches+0x70/0x114) [<c08392ac>] (create_kmalloc_caches+0x70/0x114) from [<c083b258>] (kmem_cache_init+0x80/0xe0) [<c083b258>] (kmem_cache_init+0x80/0xe0) from [<c0824974>] (start_kernel+0x15c/0x318) [<c0824974>] (start_kernel+0x15c/0x318) from [<00008074>] (0x8074) Code: e3520000 01a00002 089da800 e5d03000 (e5d1c000) ---[ end trace 1b75b31a2719ed1d ]--- Kernel panic - not syncing: Fatal exception Problem is that slub_debug option is not parsed before create_boot_cache is called. Solve this by changing slub_debug to early_param. Kernels 3.11, 3.10 are also affected. I am not sure about older kernels. Christoph Lameter explains: kmem_cache_flags may be called with NULL parameter during early boot. Skip the test in that case. Cc: stable@vger.kernel.org # 3.10 and 3.11 Reported-by: Andreas Herrmann <andreas.herrmann@calxeda.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-11-11Merge branch 'slab/struct-page' into slab/nextPekka Enberg
2013-11-08bdi: test bdi_init failureMikulas Patocka
There were two places where return value from bdi_init was not tested. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-06seqcount: Add lockdep functionality to seqcount/seqlock structuresJohn Stultz
Currently seqlocks and seqcounts don't support lockdep. After running across a seqcount related deadlock in the timekeeping code, I used a less-refined and more focused variant of this patch to narrow down the cause of the issue. This is a first-pass attempt to properly enable lockdep functionality on seqlocks and seqcounts. Since seqcounts are used in the vdso gettimeofday code, I've provided non-lockdep accessors for those needs. I've also handled one case where there were nested seqlock writers and there may be more edge cases. Comments and feedback would be appreciated! Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Link: http://lkml.kernel.org/r/1381186321-4906-3-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06Merge branch 'sched/core' into core/locking, to prepare the kernel/locking/ ↵Ingo Molnar
file move Conflicts: kernel/Makefile There are conflicts in kernel/Makefile due to file moving in the scheduler tree - resolve them. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/emulex/benet/be.h drivers/net/netconsole.c net/bridge/br_private.h Three mostly trivial conflicts. The net/bridge/br_private.h conflict was a function signature (argument addition) change overlapping with the extern removals from Joe Perches. In drivers/net/netconsole.c we had one change adjusting a printk message whilst another changed "printk(KERN_INFO" into "pr_info(". Lastly, the emulex change was a new inline function addition overlapping with Joe Perches's extern removals. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-01memcg: remove incorrect underflow checkGreg Thelen
When a memcg is deleted mem_cgroup_reparent_charges() moves charged memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" there's bad pointer read. The goal was to check for counter underflow. The counter is a per cpu counter and there are two problems with the code: (1) per cpu access function isn't used, instead a naked pointer is used which easily causes oops. (2) the check doesn't sum all cpus Test: $ cd /sys/fs/cgroup/memory $ mkdir x $ echo 3 > /proc/sys/vm/drop_caches $ (echo $BASHPID >> x/tasks && exec cat) & [1] 7154 $ grep ^mapped x/memory.stat mapped_file 53248 $ echo 7154 > tasks $ rmdir x <OOPS> The fix is to remove the check. It's currently dangerous and isn't worth fixing it to use something expensive, such as percpu_counter_sum(), for each reparented page. __this_cpu_read() isn't enough to fix this because there's no guarantees of the current cpus count. The only guarantees is that the sum of all per-cpu counter is >= nr_pages. Fixes: 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting") Reported-and-tested-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-01Merge branch 'linus' into sched/coreIngo Molnar
Resolve cherry-picking conflicts: Conflicts: mm/huge_memory.c mm/memory.c mm/mprotect.c See this upstream merge commit for more details: 52469b4fcd4f Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-31Merge branch 'akpm' (fixes from Andrew Morton)Linus Torvalds
Merge four more fixes from Andrew Morton. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: lib/scatterlist.c: don't flush_kernel_dcache_page on slab page mm: memcg: fix test for child groups mm: memcg: lockdep annotation for memcg OOM lock mm: memcg: use proper memcg in limit bypass
2013-10-31mm: memcg: fix test for child groupsJohannes Weiner
When memcg code needs to know whether any given memcg has children, it uses the cgroup child iteration primitives and returns true/false depending on whether the iteration loop is executed at least once or not. Because a cgroup's list of children is RCU protected, these primitives require the RCU read-lock to be held, which is not the case for all memcg callers. This results in the following splat when e.g. enabling hierarchy mode: WARNING: CPU: 3 PID: 1 at kernel/cgroup.c:3043 css_next_child+0xa3/0x160() CPU: 3 PID: 1 Comm: systemd Not tainted 3.12.0-rc5-00117-g83f11a9-dirty #18 Hardware name: LENOVO 3680B56/3680B56, BIOS 6QET69WW (1.39 ) 04/26/2012 Call Trace: dump_stack+0x54/0x74 warn_slowpath_common+0x78/0xa0 warn_slowpath_null+0x1a/0x20 css_next_child+0xa3/0x160 mem_cgroup_hierarchy_write+0x5b/0xa0 cgroup_file_write+0x108/0x2a0 vfs_write+0xbd/0x1e0 SyS_write+0x4c/0xa0 system_call_fastpath+0x16/0x1b In the memcg case, we only care about children when we are attempting to modify inheritable attributes interactively. Racing with deletion could mean a spurious -EBUSY, no problem. Racing with addition is handled just fine as well through the memcg_create_mutex: if the child group is not on the list after the mutex is acquired, it won't be initialized from the parent's attributes until after the unlock. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-31mm: memcg: lockdep annotation for memcg OOM lockJohannes Weiner
The memcg OOM lock is a mutex-type lock that is open-coded due to memcg's special needs. Add annotations for lockdep coverage. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-31mm: memcg: use proper memcg in limit bypassJohannes Weiner
Commit 84235de394d9 ("fs: buffer: move allocation failure loop into the allocator") allowed __GFP_NOFAIL allocations to bypass the limit if they fail to reclaim enough memory for the charge. But because the main test case was on a 3.2-based system, the patch missed the fact that on newer kernels the charge function needs to return root_mem_cgroup when bypassing the limit, and not NULL. This will corrupt whatever memory is at NULL + percpu pointer offset. Fix this quickly before problems are reported. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-31Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull NUMA balancing memory corruption fixes from Ingo Molnar: "So these fixes are definitely not something I'd like to sit on, but as I said to Mel at the KS the timing is quite tight, with Linus planning v3.12-final within a week. Fedora-19 is affected: comet:~> grep NUMA_BALANCING /boot/config-3.11.3-201.fc19.x86_64 CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y CONFIG_NUMA_BALANCING=y AFAICS Ubuntu will be affected as well, once it updates the kernel: hubble:~> grep NUMA_BALANCING /boot/config-3.8.0-32-generic CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y CONFIG_NUMA_BALANCING=y These 6 commits are a minimalized set of cherry-picks needed to fix the memory corruption bugs. All commits are fixes, except "mm: numa: Sanitize task_numa_fault() callsites" which is a cleanup that made two followup fixes simpler. I've done targeted testing with just this SHA1 to try to make sure there are no cherry-picking artifacts. The original non-cherry-picked set of fixes were exposed to linux-next for a couple of weeks" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mm: Account for a THP NUMA hinting update as one PTE update mm: Close races between THP migration and PMD numa clearing mm: numa: Sanitize task_numa_fault() callsites mm: Prevent parallel splits during THP migration mm: Wait for THP migrations to complete during NUMA hinting faults mm: numa: Do not account for a hinting fault if we raced
2013-10-30Merge branch 'akpm' (fixes from Andrew Morton)Linus Torvalds
Merge three fixes from Andrew Morton. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting percpu: fix this_cpu_sub() subtrahend casting for unsigneds mm/pagewalk.c: fix walk_page_range() access of wrong PTEs
2013-10-30memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend castingGreg Thelen
As of commit 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting") memcg counter errors are possible when moving charged memory to a different memcg. Charge movement occurs when processing writes to memory.force_empty, moving tasks to a memcg with memcg.move_charge_at_immigrate=1, or memcg deletion. An example showing error after memory.force_empty: $ cd /sys/fs/cgroup/memory $ mkdir x $ rm /data/tmp/file $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) & [1] 13600 $ grep ^mapped x/memory.stat mapped_file 1048576 $ echo 13600 > tasks $ echo 1 > x/memory.force_empty $ grep ^mapped x/memory.stat mapped_file 4503599627370496 mapped_file should end with 0. 4503599627370496 == 0x10,0000,0000,0000 == 0x100,0000,0000 pages 1048576 == 0x10,0000 == 0x100 pages This issue only affects the source memcg on 64 bit machines; the destination memcg counters are correct. So the rmdir case is not too important because such counters are soon disappearing with the entire memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1 cases are larger problems as the bogus counters are visible for the (possibly long) remaining life of the source memcg. The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which is subtly wrong because it subtracts the unsigned int nr_pages (either -1 or -512 for THP) from a signed long percpu counter. When nr_pages=-1, -nr_pages=0xffffffff. On 64 bit machines stat->count[idx] is signed 64 bit. So memcg's attempt to simply decrement a count (e.g. from 1 to 0) boils down to: long count = 1 unsigned int nr_pages = 1 count += -nr_pages /* -nr_pages == 0xffff,ffff */ count is now 0x1,0000,0000 instead of 0 The fix is to subtract the unsigned page count rather than adding its negation. This only works once "percpu: fix this_cpu_sub() subtrahend casting for unsigneds" is applied to fix this_cpu_sub(). Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-30mm/pagewalk.c: fix walk_page_range() access of wrong PTEsChen LinX
When walk_page_range walk a memory map's page tables, it'll skip VM_PFNMAP area, then variable 'next' will to assign to vma->vm_end, it maybe larger than 'end'. In next loop, 'addr' will be larger than 'next'. Then in /proc/XXXX/pagemap file reading procedure, the 'addr' will growing forever in pagemap_pte_range, pte_to_pagemap_entry will access the wrong pte. BUG: Bad page map in process procrank pte:8437526f pmd:785de067 addr:9108d000 vm_flags:00200073 anon_vma:f0d99020 mapping: (null) index:9108d CPU: 1 PID: 4974 Comm: procrank Tainted: G B W O 3.10.1+ #1 Call Trace: dump_stack+0x16/0x18 print_bad_pte+0x114/0x1b0 vm_normal_page+0x56/0x60 pagemap_pte_range+0x17a/0x1d0 walk_page_range+0x19e/0x2c0 pagemap_read+0x16e/0x200 vfs_read+0x84/0x150 SyS_read+0x4a/0x80 syscall_call+0x7/0xb Signed-off-by: Liu ShuoX <shuox.liu@intel.com> Signed-off-by: Chen LinX <linx.z.chen@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> [3.10.x+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-30mm: list_lru: fix almost infinite loop causing effective livelockRussell King
I've seen a fair number of issues with kswapd and other processes appearing to get stuck in v3.12-rc. Using sysrq-p many times seems to indicate that it gets stuck somewhere in list_lru_walk_node(), called from prune_icache_sb() and super_cache_scan(). I never seem to be able to trigger a calltrace for functions above that point. So I decided to add the following to super_cache_scan(): @@ -81,10 +81,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink, inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid); dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid); total_objects = dentries + inodes + fs_objects + 1; +printk("%s:%u: %s: dentries %lu inodes %lu total %lu\n", current->comm, current->pid, __func__, dentries, inodes, total_objects); /* proportion the scan between the caches */ dentries = mult_frac(sc->nr_to_scan, dentries, total_objects); inodes = mult_frac(sc->nr_to_scan, inodes, total_objects); +printk("%s:%u: %s: dentries %lu inodes %lu\n", current->comm, current->pid, __func__, dentries, inodes); +BUG_ON(dentries == 0); +BUG_ON(inodes == 0); /* * prune the dcache first as the icache is pinned by it, then @@ -99,7 +103,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink, freed += sb->s_op->free_cached_objects(sb, fs_objects, sc->nid); } - +printk("%s:%u: %s: dentries %lu inodes %lu freed %lu\n", current->comm, current->pid, __func__, dentries, inodes, freed); drop_super(sb); return freed; } and shortly thereafter, having applied some pressure, I got this: update-apt-xapi:1616: super_cache_scan: dentries 25632 inodes 2 total 25635 update-apt-xapi:1616: super_cache_scan: dentries 1023 inodes 0 ------------[ cut here ]------------ Kernel BUG at c0101994 [verbose debug info unavailable] Internal error: Oops - BUG: 0 [#3] SMP ARM Modules linked in: fuse rfcomm bnep bluetooth hid_cypress CPU: 0 PID: 1616 Comm: update-apt-xapi Tainted: G D 3.12.0-rc7+ #154 task: daea1200 ti: c3bf8000 task.ti: c3bf8000 PC is at super_cache_scan+0x1c0/0x278 LR is at trace_hardirqs_on+0x14/0x18 Process update-apt-xapi (pid: 1616, stack limit = 0xc3bf8240) ... Backtrace: (super_cache_scan) from [<c00cd69c>] (shrink_slab+0x254/0x4c8) (shrink_slab) from [<c00d09a0>] (try_to_free_pages+0x3a0/0x5e0) (try_to_free_pages) from [<c00c59cc>] (__alloc_pages_nodemask+0x5) (__alloc_pages_nodemask) from [<c00e07c0>] (__pte_alloc+0x2c/0x13) (__pte_alloc) from [<c00e3a70>] (handle_mm_fault+0x84c/0x914) (handle_mm_fault) from [<c001a4cc>] (do_page_fault+0x1f0/0x3bc) (do_page_fault) from [<c001a7b0>] (do_translation_fault+0xac/0xb8) (do_translation_fault) from [<c000840c>] (do_DataAbort+0x38/0xa0) (do_DataAbort) from [<c00133f8>] (__dabt_usr+0x38/0x40) Notice that we had a very low number of inodes, which were reduced to zero my mult_frac(). Now, prune_icache_sb() calls list_lru_walk_node() passing that number of inodes (0) into that as the number of objects to scan: long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan, int nid) { LIST_HEAD(freeable); long freed; freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate, &freeable, &nr_to_scan); which does: unsigned long list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate, void *cb_arg, unsigned long *nr_to_walk) { struct list_lru_node *nlru = &lru->node[nid]; struct list_head *item, *n; unsigned long isolated = 0; spin_lock(&nlru->lock); restart: list_for_each_safe(item, n, &nlru->list) { enum lru_status ret; /* * decrement nr_to_walk first so that we don't livelock if we * get stuck on large numbesr of LRU_RETRY items */ if (--(*nr_to_walk) == 0) break; So, if *nr_to_walk was zero when this function was entered, that means we're wanting to operate on (~0UL)+1 objects - which might as well be infinite. Clearly this is not correct behaviour. If we think about the behaviour of this function when *nr_to_walk is 1, then clearly it's wrong - we decrement first and then test for zero - which results in us doing nothing at all. A post-decrement would give the desired behaviour - we'd try to walk one object and one object only if *nr_to_walk were one. It also gives the correct behaviour for zero - we exit at this point. Fixes: 5cedf721a7cd ("list_lru: fix broken LRU_RETRY behaviour") Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> [ Modified to make sure we never underflow the count: this function gets called in a loop, so the 0 -> ~0ul transition is dangerous - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-30slab: replace non-existing 'struct freelist *' with 'void *'Joonsoo Kim
There is no 'strcut freelist', but codes use pointer to 'struct freelist'. Although compiler doesn't complain anything about this wrong usage and codes work fine, but fixing it is better. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
2013-10-30slab: fix to calm down kmemleak warningJoonsoo Kim
After using struct page as slab management, we should not call kmemleak_scan_area(), since struct page isn't the tracking object of kmemleak. Without this patch and if CONFIG_DEBUG_KMEMLEAK is enabled, so many kmemleak warnings are printed. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
2013-10-29mm: Account for a THP NUMA hinting update as one PTE updateMel Gorman
A THP PMD update is accounted for as 512 pages updated in vmstat. This is large difference when estimating the cost of automatic NUMA balancing and can be misleading when comparing results that had collapsed versus split THP. This patch addresses the accounting issue. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-29mm: Close races between THP migration and PMD numa clearingMel Gorman
THP migration uses the page lock to guard against parallel allocations but there are cases like this still open Task A Task B --------------------- --------------------- do_huge_pmd_numa_page do_huge_pmd_numa_page lock_page mpol_misplaced == -1 unlock_page goto clear_pmdnuma lock_page mpol_misplaced == 2 migrate_misplaced_transhuge pmd = pmd_mknonnuma set_pmd_at During hours of testing, one crashed with weird errors and while I have no direct evidence, I suspect something like the race above happened. This patch extends the page lock to being held until the pmd_numa is cleared to prevent migration starting in parallel while the pmd_numa is being cleared. It also flushes the old pmd entry and orders pagetable insertion before rmap insertion. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-29mm: numa: Sanitize task_numa_fault() callsitesMel Gorman
There are three callers of task_numa_fault(): - do_huge_pmd_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_pmd_numa_page(): Accounts not at all when the page isn't migrated, otherwise accounts against the node we migrated towards. This seems wrong to me; all three sites should have the same sementaics, furthermore we should accounts against where the page really is, we already know where the task is. So modify all three sites to always account; we did after all receive the fault; and always account to where the page is after migration, regardless of success. They all still differ on when they clear the PTE/PMD; ideally that would get sorted too. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-29mm: Prevent parallel splits during THP migrationMel Gorman
THP migrations are serialised by the page lock but on its own that does not prevent THP splits. If the page is split during THP migration then the pmd_same checks will prevent page table corruption but the unlock page and other fix-ups potentially will cause corruption. This patch takes the anon_vma lock to prevent parallel splits during migration. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-29mm: Wait for THP migrations to complete during NUMA hinting faultsMel Gorman
The locking for migrating THP is unusual. While normal page migration prevents parallel accesses using a migration PTE, THP migration relies on a combination of the page_table_lock, the page lock and the existance of the NUMA hinting PTE to guarantee safety but there is a bug in the scheme. If a THP page is currently being migrated and another thread traps a fault on the same page it checks if the page is misplaced. If it is not, then pmd_numa is cleared. The problem is that it checks if the page is misplaced without holding the page lock meaning that the racing thread can be migrating the THP when the second thread clears the NUMA bit and faults a stale page. This patch checks if the page is potentially being migrated and stalls using the lock_page if it is potentially being migrated before checking if the page is misplaced or not. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-6-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-29mm: numa: Do not account for a hinting fault if we racedMel Gorman
If another task handled a hinting fault in parallel then do not double account for it. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-5-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-24file->f_op is never NULL...Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-24slub: proper kmemleak tracking if CONFIG_SLUB_DEBUG disabledRoman Bobniev
Move all kmemleak calls into hook functions, and make it so that all hooks (both inside and outside of #ifdef CONFIG_SLUB_DEBUG) call the appropriate kmemleak routines. This allows for kmemleak to be configured independently of slub debug features. It also fixes a bug where kmemleak was only partially enabled in some configurations. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Roman Bobniev <Roman.Bobniev@sonymobile.com> Signed-off-by: Tim Bird <tim.bird@sonymobile.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
2013-10-24slab: rename slab_bufctl to slab_freelistJoonsoo Kim
Now, bufctl is not proper name to this array. So change it. Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
2013-10-24slab: remove useless statement for checking pfmemallocJoonsoo Kim
Now, virt_to_page(page->s_mem) is same as the page, because slab use this structure for management. So remove useless statement. Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
2013-10-24slab: use struct page for slab managementJoonsoo Kim
Now, there are a few field in struct slab, so we can overload these over struct page. This will save some memory and reduce cache footprint. After this change, slabp_cache and slab_size no longer related to a struct slab, so rename them as freelist_cache and freelist_size. These changes are just mechanical ones and there is no functional change. Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
2013-10-24slab: replace free and inuse in struct slab with newly introduced activeJoonsoo Kim
Now, free in struct slab is same meaning as inuse. So, remove both and replace them with active. Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
2013-10-24slab: remove SLAB_LIMITJoonsoo Kim
It's useless now, so remove it. Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
2013-10-24slab: remove kmem_bufctl_tJoonsoo Kim
Now, we changed the management method of free objects of the slab and there is no need to use special value, BUFCTL_END, BUFCTL_FREE and BUFCTL_ACTIVE. So remove them. Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
2013-10-24slab: change the management method of free objects of the slabJoonsoo Kim
Current free objects management method of the slab is weird, because it touch random position of the array of kmem_bufctl_t when we try to get free object. See following example. struct slab's free = 6 kmem_bufctl_t array: 1 END 5 7 0 4 3 2 To get free objects, we access this array with following pattern. 6 -> 3 -> 7 -> 2 -> 5 -> 4 -> 0 -> 1 -> END If we have many objects, this array would be larger and be not in the same cache line. It is not good for performance. We can do same thing through more easy way, like as the stack. Only thing we have to do is to maintain stack top to free object. I use free field of struct slab for this purpose. After that, if we need to get an object, we can get it at stack top and manipulate top pointer. That's all. This method already used in array_cache management. Following is an access pattern when we use this method. struct slab's free = 0 kmem_bufctl_t array: 6 3 7 2 5 4 0 1 To get free objects, we access this array with following pattern. 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 This may help cache line footprint if slab has many objects, and, in addition, this makes code much much simpler. Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
2013-10-24slab: use __GFP_COMP flag for allocating slab pagesJoonsoo Kim
If we use 'struct page' of first page as 'struct slab', there is no advantage not to use __GFP_COMP. So use __GFP_COMP flag for all the cases. Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>
2013-10-24slab: use well-defined macro, virt_to_slab()Joonsoo Kim
This is trivial change, just use well-defined macro. Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi>