summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2016-10-07mm, compaction: require only min watermarks for non-costly ordersVlastimil Babka
The __compaction_suitable() function checks the low watermark plus a compact_gap() gap to decide if there's enough free memory to perform compaction. Then __isolate_free_page uses low watermark check to decide if particular free page can be isolated. In the latter case, using low watermark is needlessly pessimistic, as the free page isolations are only temporary. For __compaction_suitable() the higher watermark makes sense for high-order allocations where more freepages increase the chance of success, and we can typically fail with some order-0 fallback when the system is struggling to reach that watermark. But for low-order allocation, forming the page should not be that hard. So using low watermark here might just prevent compaction from even trying, and eventually lead to OOM killer even if we are above min watermarks. So after this patch, we use min watermark for non-costly orders in __compaction_suitable(), and for all orders in __isolate_free_page(). [vbabka@suse.cz: clarify __isolate_free_page() comment] Link: http://lkml.kernel.org/r/7ae4baec-4eca-e70b-2a69-94bea4fb19fa@suse.cz Link: http://lkml.kernel.org/r/20160810091226.6709-11-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm, compaction: use proper alloc_flags in __compaction_suitable()Vlastimil Babka
The __compaction_suitable() function checks the low watermark plus a compact_gap() gap to decide if there's enough free memory to perform compaction. This check uses direct compactor's alloc_flags, but that's wrong, since these flags are not applicable for freepage isolation. For example, alloc_flags may indicate access to memory reserves, making compaction proceed, and then fail watermark check during the isolation. A similar problem exists for ALLOC_CMA, which may be part of alloc_flags, but not during freepage isolation. In this case however it makes sense to use ALLOC_CMA both in __compaction_suitable() and __isolate_free_page(), since there's actually nothing preventing the freepage scanner to isolate from CMA pageblocks, with the assumption that a page that could be migrated once by compaction can be migrated also later by CMA allocation. Thus we should count pages in CMA pageblocks when considering compaction suitability and when isolating freepages. To sum up, this patch should remove some false positives from __compaction_suitable(), and allow compaction to proceed when free pages required for compaction reside in the CMA pageblocks. Link: http://lkml.kernel.org/r/20160810091226.6709-10-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm, compaction: create compact_gap wrapperVlastimil Babka
Compaction uses a watermark gap of (2UL << order) pages at various places and it's not immediately obvious why. Abstract it through a compact_gap() wrapper to create a single place with a thorough explanation. [vbabka@suse.cz: clarify the comment of compact_gap()] Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm, compaction: use correct watermark when checking compaction successVlastimil Babka
The __compact_finished() function uses low watermark in a check that has to pass if the direct compaction is to finish and allocation should succeed. This is too pessimistic, as the allocation will typically use min watermark. It may happen that during compaction, we drop below the low watermark (due to parallel activity), but still form the target high-order page. By checking against low watermark, we might needlessly continue compaction. Similarly, __compaction_suitable() uses low watermark in a check whether allocation can succeed without compaction. Again, this is unnecessarily pessimistic. After this patch, these check will use direct compactor's alloc_flags to determine the watermark, which is effectively the min watermark. Link: http://lkml.kernel.org/r/20160810091226.6709-8-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm, compaction: add the ultimate direct compaction priorityVlastimil Babka
During reclaim/compaction loop, it's desirable to get a final answer from unsuccessful compaction so we can either fail the allocation or invoke the OOM killer. However, heuristics such as deferred compaction or pageblock skip bits can cause compaction to skip parts or whole zones and lead to premature OOM's, failures or excessive reclaim/compaction retries. To remedy this, we introduce a new direct compaction priority called COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to: - ignore deferred compaction status for a zone - ignore pageblock skip hints - ignore cached scanner positions and scan the whole zone The new priority should get eventually picked up by should_compact_retry() and this should improve success rates for costly allocations using __GFP_REPEAT, such as hugetlbfs allocations, and reduce some corner-case OOM's for non-costly allocations. Link: http://lkml.kernel.org/r/20160810091226.6709-6-vbabka@suse.cz [vbabka@suse.cz: use the MIN_COMPACT_PRIORITY alias] Link: http://lkml.kernel.org/r/d443b884-87e7-1c93-8684-3a3a35759fb1@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm, compaction: don't recheck watermarks after COMPACT_SUCCESSVlastimil Babka
Joonsoo has reminded me that in a later patch changing watermark checks throughout compaction I forgot to update checks in try_to_compact_pages() and compactd_do_work(). Closer inspection however shows that they are redundant now in the success case, because compact_zone() now reliably reports this with COMPACT_SUCCESS. So effectively the checks just repeat (a subset) of checks that have just passed. So instead of checking watermarks again, just test the return value. Note it's also possible that compaction would declare failure e.g. because its find_suitable_fallback() is more strict than simple watermark check, and then the watermark check we are removing would then still succeed. After this patch this is not possible and it's arguably better, because for long-term fragmentation avoidance we should rather try a different zone than allocate with the unsuitable fallback. If compaction of all zones fail and the allocation is important enough, it will retry and succeed anyway. Also remove the stray "bool success" variable from kcompactd_do_work(). Link: http://lkml.kernel.org/r/20160810091226.6709-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm, compaction: rename COMPACT_PARTIAL to COMPACT_SUCCESSVlastimil Babka
COMPACT_PARTIAL has historically meant that compaction returned after doing some work without fully compacting a zone. It however didn't distinguish if compaction terminated because it succeeded in creating the requested high-order page. This has changed recently and now we only return COMPACT_PARTIAL when compaction thinks it succeeded, or the high-order watermark check in compaction_suitable() passes and no compaction needs to be done. So at this point we can make the return value clearer by renaming it to COMPACT_SUCCESS. The next patch will remove some redundant tests for success where compaction just returned COMPACT_SUCCESS. Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm, compaction: cleanup unused functionsVlastimil Babka
Since kswapd compaction moved to kcompactd, compact_pgdat() is not called anymore, so we remove it. The only caller of __compact_pgdat() is compact_node(), so we merge them and remove code that was only reachable from kswapd. Link: http://lkml.kernel.org/r/20160810091226.6709-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm, compaction: make whole_zone flag ignore cached scanner positionsVlastimil Babka
Patch series "make direct compaction more deterministic") This is mostly a followup to Michal's oom detection rework, which highlighted the need for direct compaction to provide better feedback in reclaim/compaction loop, so that it can reliably recognize when compaction cannot make further progress, and allocation should invoke OOM killer or fail. We've discussed this at LSF/MM [1] where I proposed expanding the async/sync migration mode used in compaction to more general "priorities". This patchset adds one new priority that just overrides all the heuristics and makes compaction fully scan all zones. I don't currently think that we need more fine-grained priorities, but we'll see. Other than that there's some smaller fixes and cleanups, mainly related to the THP-specific hacks. I've tested this with stress-highalloc in GFP_KERNEL order-4 and THP-like order-9 scenarios. There's some improvement for compaction stats for the order-4, which is likely due to the better watermarks handling. In the previous version I reported mostly noise wrt compaction stats, and decreased direct reclaim - now the reclaim is without difference. I believe this is due to the less aggressive compaction priority increase in patch 6. "before" is a mmotm tree prior to 4.7 release plus the first part of the series that was sent and merged separately before after order-4: Compaction stalls 27216 30759 Compaction success 19598 25475 Compaction failures 7617 5283 Page migrate success 370510 464919 Page migrate failure 25712 27987 Compaction pages isolated 849601 1041581 Compaction migrate scanned 143146541 101084990 Compaction free scanned 208355124 144863510 Compaction cost 1403 1210 order-9: Compaction stalls 7311 7401 Compaction success 1634 1683 Compaction failures 5677 5718 Page migrate success 194657 183988 Page migrate failure 4753 4170 Compaction pages isolated 498790 456130 Compaction migrate scanned 565371 524174 Compaction free scanned 4230296 4250744 Compaction cost 215 203 [1] https://lwn.net/Articles/684611/ This patch (of 11): A recent patch has added whole_zone flag that compaction sets when scanning starts from the zone boundary, in order to report that zone has been fully scanned in one attempt. For allocations that want to try really hard or cannot fail, we will want to introduce a mode where scanning whole zone is guaranteed regardless of the cached positions. This patch reuses the whole_zone flag in a way that if it's already passed true to compaction, the cached scanner positions are ignored. Employing this flag during reclaim/compaction loop will be done in the next patch. This patch however converts compaction invoked from userspace via procfs to use this flag. Before this patch, the cached positions were first reset to zone boundaries and then read back from struct zone, so there was a window where a parallel compaction could replace the reset values, making the manual compaction less effective. Using the flag instead of performing reset is more robust. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm/oom_kill.c: fix task_will_free_mem() commentMichal Hocko
Attempt to demystify the task_will_free_mem() loop. Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm: memcontrol: add sanity checks for memcg->id.ref on get/putVladimir Davydov
Link: http://lkml.kernel.org/r/1c5ddb1c171dbdfc3262252769d6138a29b35b70.1470219853.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm/vmalloc.c: fix align value calculation errorzijun_hu
It causes double align requirement for __get_vm_area_node() if parameter size is power of 2 and VM_IOREMAP is set in parameter flags, for example size=0x10000 -> fls_long(0x10000)=17 -> align=0x20000 get_count_order_long() is implemented and can be used instead of fls_long() for fixing the bug, for example size=0x10000 -> get_count_order_long(0x10000)=16 -> align=0x10000 [akpm@linux-foundation.org: s/get_order_long()/get_count_order_long()/] [zijun_hu@zoho.com: fixes] Link: http://lkml.kernel.org/r/57AABC8B.1040409@zoho.com [akpm@linux-foundation.org: locate get_count_order_long() next to get_count_order()] [akpm@linux-foundation.org: move get_count_order[_long] definitions to pick up fls_long()] [zijun_hu@htc.com: move out get_count_order[_long]() from __KERNEL__ scope] Link: http://lkml.kernel.org/r/57B2C4CE.80303@zoho.com Link: http://lkml.kernel.org/r/fc045ecf-20fa-0722-b3ac-9a6140488fad@zoho.com Signed-off-by: zijun_hu <zijun_hu@htc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: zijun_hu <zijun_hu@htc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07mm: oom: deduplicate victim selection code for memcg and global oomVladimir Davydov
When selecting an oom victim, we use the same heuristic for both memory cgroup and global oom. The only difference is the scope of tasks to select the victim from. So we could just export an iterator over all memcg tasks and keep all oom related logic in oom_kill.c, but instead we duplicate pieces of it in memcontrol.c reusing some initially private functions of oom_kill.c in order to not duplicate all of it. That looks ugly and error prone, because any modification of select_bad_process should also be propagated to mem_cgroup_out_of_memory. Let's rework this as follows: keep all oom heuristic related code private to oom_kill.c and make oom_kill.c use exported memcg functions when it's really necessary (like in case of iterating over memcg tasks). Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07Merge branch 'work.splice_read' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS splice updates from Al Viro: "There's a bunch of branches this cycle, both mine and from other folks and I'd rather send pull requests separately. This one is the conversion of ->splice_read() to ITER_PIPE iov_iter (and introduction of such). Gets rid of a lot of code in fs/splice.c and elsewhere; there will be followups, but these are for the next cycle... Some pipe/splice-related cleanups from Miklos in the same branch as well" * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: pipe: fix comment in pipe_buf_operations pipe: add pipe_buf_steal() helper pipe: add pipe_buf_confirm() helper pipe: add pipe_buf_release() helper pipe: add pipe_buf_get() helper relay: simplify relay_file_read() switch default_file_splice_read() to use of pipe-backed iov_iter switch generic_file_splice_read() to use of ->read_iter() new iov_iter flavour: pipe-backed fuse_dev_splice_read(): switch to add_to_pipe() skb_splice_bits(): get rid of callback new helper: add_to_pipe() splice: lift pipe_lock out of splice_to_pipe() splice: switch get_iovec_page_array() to iov_iter splice_to_pipe(): don't open-code wakeup_pipe_readers() consistent treatment of EFAULT on O_DIRECT read/write
2016-10-06Merge tag 'xfs-for-linus-4.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs and iomap updates from Dave Chinner: "The main things in this update are the iomap-based DAX infrastructure, an XFS delalloc rework, and a chunk of fixes to how log recovery schedules writeback to prevent spurious corruption detections when recovery of certain items was not required. The other main chunk of code is some preparation for the upcoming reflink functionality. Most of it is generic and cleanups that stand alone, but they were ready and reviewed so are in this pull request. Speaking of reflink, I'm currently planning to send you another pull request next week containing all the new reflink functionality. I'm working through a similar process to the last cycle, where I sent the reverse mapping code in a separate request because of how large it was. The reflink code merge is even bigger than reverse mapping, so I'll be doing the same thing again.... Summary for this update: - change of XFS mailing list to linux-xfs@vger.kernel.org - iomap-based DAX infrastructure w/ XFS and ext2 support - small iomap fixes and additions - more efficient XFS delayed allocation infrastructure based on iomap - a rework of log recovery writeback scheduling to ensure we don't fail recovery when trying to replay items that are already on disk - some preparation patches for upcoming reflink support - configurable error handling fixes and documentation - aio access time update race fixes for XFS and generic_file_read_iter" * tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (40 commits) fs: update atime before I/O in generic_file_read_iter xfs: update atime before I/O in xfs_file_dio_aio_read ext2: fix possible integer truncation in ext2_iomap_begin xfs: log recovery tracepoints to track current lsn and buffer submission xfs: update metadata LSN in buffers during log recovery xfs: don't warn on buffers not being recovered due to LSN xfs: pass current lsn to log recovery buffer validation xfs: rework log recovery to submit buffers on LSN boundaries xfs: quiesce the filesystem after recovery on readonly mount xfs: remote attribute blocks aren't really userdata ext2: use iomap to implement DAX ext2: stop passing buffer_head to ext2_get_blocks xfs: use iomap to implement DAX xfs: refactor xfs_setfilesize xfs: take the ilock shared if possible in xfs_file_iomap_begin xfs: fix locking for DAX writes dax: provide an iomap based fault handler dax: provide an iomap based dax read/write path dax: don't pass buffer_head to copy_user_dax dax: don't pass buffer_head to dax_insert_mapping ...
2016-10-05switch generic_file_splice_read() to use of ->read_iter()Al Viro
... and kill the ->splice_read() instances that can be switched to it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05mm: filemap: fix mapping->nrpages double accounting in fuseJohannes Weiner
Commit 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()") switched replace_page_cache() from raw radix tree operations to page_cache_tree_insert() but didn't take into account that the latter function, unlike the raw radix tree op, handles mapping->nrpages. As a result, that counter is bumped for each page replacement rather than balanced out even. The mapping->nrpages counter is used to skip needless radix tree walks when invalidating, truncating, syncing inodes without pages, as well as statistics for userspace. Since the error is positive, we'll do more page cache tree walks than necessary; we won't miss a necessary one. And we'll report more buffer pages to userspace than there are. The error is limited to fuse inodes. Fixes: 22f2ac51b6d6 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-05mm: filemap: don't plant shadow entries without radix tree nodeJohannes Weiner
When the underflow checks were added to workingset_node_shadow_dec(), they triggered immediately: kernel BUG at ./include/linux/swap.h:276! invalid opcode: 0000 [#1] SMP Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1 Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016 task: ffff8faa93ecd940 task.stack: ffff8faa7f478000 RIP: page_cache_tree_insert+0xf1/0x100 Call Trace: __add_to_page_cache_locked+0x12e/0x270 add_to_page_cache_lru+0x4e/0xe0 mpage_readpages+0x112/0x1d0 blkdev_readpages+0x1d/0x20 __do_page_cache_readahead+0x1ad/0x290 force_page_cache_readahead+0xaa/0x100 page_cache_sync_readahead+0x3f/0x50 generic_file_read_iter+0x5af/0x740 blkdev_read_iter+0x35/0x40 __vfs_read+0xe1/0x130 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x13/0x8f Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00 RIP page_cache_tree_insert+0xf1/0x100 This is a long-standing bug in the way shadow entries are accounted in the radix tree nodes. The shrinker needs to know when radix tree nodes contain only shadow entries, no pages, so node->count is split in half to count shadows in the upper bits and pages in the lower bits. Unfortunately, the radix tree implementation doesn't know of this and assumes all entries are in node->count. When there is a shadow entry directly in root->rnode and the tree is later extended, the radix tree implementation will copy that entry into the new node and and bump its node->count, i.e. increases the page count bits. Once the shadow gets removed and we subtract from the upper counter, node->count underflows and triggers the warning. Afterwards, without node->count reaching 0 again, the radix tree node is leaked. Limit shadow entries to when we have actual radix tree nodes and can count them properly. That means we lose the ability to detect refaults from files that had only the first page faulted in at eviction time. Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-05mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()zijun_hu
in order to ensure the percpu group areas within a chunk aren't distributed too sparsely, pcpu_embed_first_chunk() goes to error handling path when a chunk spans over 3/4 VMALLOC area, however, during the error handling, it forget to free the memory allocated for all percpu groups by going to label @out_free other than @out_free_areas. it will cause memory leakage issue if the rare scene really happens, in order to fix the issue, we check chunk spanned area immediately after completing memory allocation for all percpu groups, we go to label @out_free_areas to free the memory then return if the checking is failed. in order to verify the approach, we dump all memory allocated then enforce the jump then dump all memory freed, the result is okay after checking whether we free all memory we allocate in this function. BTW, The approach is chosen after thinking over the below scenes - we don't go to label @out_free directly to fix this issue since we maybe free several allocated memory blocks twice - the aim of jumping after pcpu_setup_first_chunk() is bypassing free usable memory other than handling error, moreover, the function does not return error code in any case, it either panics due to BUG_ON() or return 0. Signed-off-by: zijun_hu <zijun_hu@htc.com> Tested-by: zijun_hu <zijun_hu@htc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-05mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()zijun_hu
pcpu_embed_first_chunk() calculates the range a percpu chunk spans into @max_distance and uses it to ensure that a chunk is not too big compared to the total vmalloc area. However, during calculation, it used incorrect top address by adding a unit size to the highest group's base address. This can make the calculated max_distance slightly smaller than the actual distance although given the scale of values involved the error is very unlikely to have an actual impact. Fix this issue by adding the group's size instead of a unit size. BTW, The type of variable max_distance is changed from size_t to unsigned long too based on below consideration: - type unsigned long usually have same width with IP core registers and can be applied at here very well - make @max_distance type consistent with the operand calculated against it such as @ai->groups[i].base_offset and macro VMALLOC_TOTAL - type unsigned long is more universal then size_t, size_t is type defined to unsigned int or unsigned long among various ARCHs usually Signed-off-by: zijun_hu <zijun_hu@htc.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-03Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug updates from Thomas Gleixner: "Yet another batch of cpu hotplug core updates and conversions: - Provide core infrastructure for multi instance drivers so the drivers do not have to keep custom lists. - Convert custom lists to the new infrastructure. The block-mq custom list conversion comes through the block tree and makes the diffstat tip over to more lines removed than added. - Handle unbalanced hotplug enable/disable calls more gracefully. - Remove the obsolete CPU_STARTING/DYING notifier support. - Convert another batch of notifier users. The relayfs changes which conflicted with the conversion have been shipped to me by Andrew. The remaining lot is targeted for 4.10 so that we finally can remove the rest of the notifiers" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) cpufreq: Fix up conversion to hotplug state machine blk/mq: Reserve hotplug states for block multiqueue x86/apic/uv: Convert to hotplug state machine s390/mm/pfault: Convert to hotplug state machine mips/loongson/smp: Convert to hotplug state machine mips/octeon/smp: Convert to hotplug state machine fault-injection/cpu: Convert to hotplug state machine padata: Convert to hotplug state machine cpufreq: Convert to hotplug state machine ACPI/processor: Convert to hotplug state machine virtio scsi: Convert to hotplug state machine oprofile/timer: Convert to hotplug state machine block/softirq: Convert to hotplug state machine lib/irq_poll: Convert to hotplug state machine x86/microcode: Convert to hotplug state machine sh/SH-X3 SMP: Convert to hotplug state machine ia64/mca: Convert to hotplug state machine ARM/OMAP/wakeupgen: Convert to hotplug state machine ARM/shmobile: Convert to hotplug state machine arm64/FP/SIMD: Convert to hotplug state machine ...
2016-10-03Merge branch 'x86-vdso-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 vdso updates from Ingo Molnar: "The main changes in this cycle centered around adding support for 32-bit compatible C/R of the vDSO on 64-bit kernels, by Dmitry Safonov" * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso: Use CONFIG_X86_X32_ABI to enable vdso prctl x86/vdso: Only define map_vdso_randomized() if CONFIG_X86_64 x86/vdso: Only define prctl_map_vdso() if CONFIG_CHECKPOINT_RESTORE x86/signal: Add SA_{X32,IA32}_ABI sa_flags x86/ptrace: Down with test_thread_flag(TIF_IA32) x86/coredump: Use pr_reg size, rather that TIF_IA32 flag x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_* x86/vdso: Replace calculate_addr in map_vdso() with addr x86/vdso: Unmap vdso blob on vvar mapping failure
2016-10-03Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "The main changes are: - irqtime accounting cleanups and enhancements. (Frederic Weisbecker) - schedstat debugging enhancements, make it more broadly runtime available. (Josh Poimboeuf) - More work on asymmetric topology/capacity scheduling. (Morten Rasmussen) - sched/wait fixes and cleanups. (Oleg Nesterov) - PELT (per entity load tracking) improvements. (Peter Zijlstra) - Rewrite and enhance select_idle_siblings(). (Peter Zijlstra) - sched/numa enhancements/fixes (Rik van Riel) - sched/cputime scalability improvements (Stanislaw Gruszka) - Load calculation arithmetics fixes. (Dietmar Eggemann) - sched/deadline enhancements (Tommaso Cucinotta) - Fix utilization accounting when switching to the SCHED_NORMAL policy. (Vincent Guittot) - ... plus misc cleanups and enhancements" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched/irqtime: Consolidate irqtime flushing code sched/irqtime: Consolidate accounting synchronization with u64_stats API u64_stats: Introduce IRQs disabled helpers sched/irqtime: Remove needless IRQs disablement on kcpustat update sched/irqtime: No need for preempt-safe accessors sched/fair: Fix min_vruntime tracking sched/debug: Add SCHED_WARN_ON() sched/core: Fix set_user_nice() sched/fair: Introduce set_curr_task() helper sched/core, ia64: Rename set_curr_task() sched/core: Fix incorrect utilization accounting when switching to fair class sched/core: Optimize SCHED_SMT sched/core: Rewrite and improve select_idle_siblings() sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared sched/core: Introduce 'struct sched_domain_shared' sched/core: Restructure destroy_sched_domain() sched/core: Remove unused @cpu argument from destroy_sched_domain*() sched/wait: Introduce init_wait_entry() sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock() sched/wait: Avoid abort_exclusive_wait() in ___wait_event() ...
2016-10-03Merge tag 'pm-4.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Traditionally, cpufreq is the area with the greatest number of changes, but there are fewer of them than last time. There also is some activity in the generic power domains and the devfreq frameworks, a couple of system suspend and hibernation fixes and some assorted changes in other places. One new feature is the cpufreq change to allow the scheduler to pass hints to the governors' utilization update callbacks and some code rework based on that. Another one is the support for domain removal in the generic power domains framework. Also it is now possible to use hibernation with PAGE_POISONING_ZERO enabled and devfreq supports the RockChip DFI controller and the rk3399 DMC. The rest of the changes is mostly fixes and cleanups in a number of places. Specifics: - Add a mechanism for passing hints from the scheduler to cpufreq governors via their utilization update callbacks and use it to introduce "IOwait boosting" into the schedutil governor and intel_pstate that will make them boost performance if the enqueued task was previously waiting on I/O (Rafael Wysocki). - Fix a schedutil governor problem that causes it to overestimate utilization if SMT is in use (Steve Muckle). - Update defconfigs trying to use the schedutil governor as a module which is not possible any more (Javier Martinez Canillas). - Update the intel_pstate's pstate_sample tracepoint to take "IOwait boosting" into account (Srinivas Pandruvada). - Fix a problem in the cpufreq core causing it to mishandle the initialization of CPUs registered after the cpufreq driver (Viresh Kumar, Rafael Wysocki). - Make the cpufreq-dt driver support per-policy governor tunables, clean it up and update its Kconfig description (Viresh Kumar). - Add support for more ARM platforms to the cpufreq-dt driver (Chanwoo Choi, Dave Gerlach, Geert Uytterhoeven). - Make the cpufreq CPPC driver report frequencies in KHz to avoid user space compatiblility issues (Al Stone, Hoan Tran). - Clean up a few cpufreq drivers (st, kirkwood, SCPI) a bit (Colin Ian King, Markus Elfring). - Constify some local structures in the intel_pstate driver (Julia Lawall). - Add a Documentation/cpu-freq/ entry to MAINTAINERS (Jean Delvare). - Add support for PM domain removal to the generic power domains (genpd) framework, add new DT helper functions to it and make it always enable debugfs support if available (Jon Hunter, Tomeu Vizoso). - Clean up the generic power domains (genpd) framework and make it avoid measuring power-on and power-off latencies during system-wide PM transitions (Ulf Hansson). - Add support for the RockChip DFI controller and the rk3399 DMC to the devfreq framework (Lin Huang, Axel Lin, Arnd Bergmann). - Add COMPILE_TEST to the devfreq framework (Krzysztof Kozlowski, Stephen Rothwell). - Fix a minor issue in the exynos-ppmu devfreq driver and fix up devfreq Kconfig indentation style (Wei Yongjun, Jisheng Zhang). - Fix the system suspend interface to make suspend-to-idle work if platform suspend operations have not been registered (Sudeep Holla). - Make it possible to use hibernation with PAGE_POISONING_ZERO enabled (Anisse Astier). - Increas the default timeout of the system suspend/resume watchdog and make it depend on EXPERT (Chen Yu). - Make the operating performance points (OPP) framework avoid using OPPs that aren't supported by the platform and fix a build warning in it (Dave Gerlach, Arnd Bergmann). - Fix the ARM cpuidle driver's return value (Christophe Jaillet). - Make the SmartReflex AVS (Adaptive Voltage Scaling) driver use more common logging style (Joe Perches)" * tag 'pm-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (58 commits) PM / OPP: Don't support OPP if it provides supported-hw but platform does not cpufreq: st: add missing \n to end of dev_err message cpufreq: kirkwood: add missing \n to end of dev_err messages PM / Domains: Rename pm_genpd_sync_poweron|poweroff() PM / Domains: Don't measure latency of ->power_on|off() during system PM PM / Domains: Remove redundant system PM callbacks PM / Domains: Simplify detaching a device from its genpd PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency PM / OPP: avoid maybe-uninitialized warning PM / Domains: Allow holes in genpd_data.domains array cpufreq: CPPC: Avoid overflow when calculating desired_perf cpufreq: ti: Use generic platdev driver cpufreq: intel_pstate: Add io_boost trace partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage" cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm cpufreq: schedutil: Add iowait boosting cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition PM / Domains: Add support for removing nested PM domains by provider PM / Domains: Add support for removing PM domains ...
2016-10-03Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "It's a bit all over the place this time with no "killer feature" to speak of. Support for mismatched cache line sizes should help people seeing whacky JIT failures on some SoCs, and the big.LITTLE perf updates have been a long time coming, but a lot of the changes here are cleanups. We stray outside arch/arm64 in a few areas: the arch/arm/ arch_timer workaround is acked by Russell, the DT/OF bits are acked by Rob, the arch_timer clocksource changes acked by Marc, CPU hotplug by tglx and jump_label by Peter (all CC'd). Summary: - Support for execute-only page permissions - Support for hibernate and DEBUG_PAGEALLOC - Support for heterogeneous systems with mismatches cache line sizes - Errata workarounds (A53 843419 update and QorIQ A-008585 timer bug) - arm64 PMU perf updates, including cpumasks for heterogeneous systems - Set UTS_MACHINE for building rpm packages - Yet another head.S tidy-up - Some cleanups and refactoring, particularly in the NUMA code - Lots of random, non-critical fixes across the board" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (100 commits) arm64: tlbflush.h: add __tlbi() macro arm64: Kconfig: remove SMP dependence for NUMA arm64: Kconfig: select OF/ACPI_NUMA under NUMA config arm64: fix dump_backtrace/unwind_frame with NULL tsk arm/arm64: arch_timer: Use archdata to indicate vdso suitability arm64: arch_timer: Work around QorIQ Erratum A-008585 arm64: arch_timer: Add device tree binding for A-008585 erratum arm64: Correctly bounds check virt_addr_valid arm64: migrate exception table users off module.h and onto extable.h arm64: pmu: Hoist pmu platform device name arm64: pmu: Probe default hw/cache counters arm64: pmu: add fallback probe table MAINTAINERS: Update ARM PMU PROFILING AND DEBUGGING entry arm64: Improve kprobes test for atomic sequence arm64/kvm: use alternative auto-nop arm64: use alternative auto-nop arm64: alternative: add auto-nop infrastructure arm64: lse: convert lse alternatives NOP padding to use __nops arm64: barriers: introduce nops and __nops macros for NOP sequences arm64: sysreg: replace open-coded mrs_s/msr_s with {read,write}_sysreg_s ...
2016-10-03fs: update atime before I/O in generic_file_read_iterChristoph Hellwig
After the call to ->direct_IO the final reference to the file might have been dropped by aio_complete already, and the call to file_accessed might cause a use after free. Instead update the access time before the I/O, similar to how we update the time stamps before writes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-02Merge branches 'pm-devfreq' and 'pm-sleep'Rafael J. Wysocki
* pm-devfreq: PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage" PM / devfreq: rockchip: add devfreq driver for rk3399 dmc Documentation: bindings: add dt documentation for rk3399 dmc PM / devfreq: event: support rockchip dfi controller Documentation: bindings: add dt documentation for dfi controller PM / devfreq: event: remove duplicate devfreq_event_get_drvdata() PM / devfreq: fix Kconfig indent style PM / devfreq: Add COMPILE_TEST for build coverage PM / devfreq: exynos-ppmu: remove unneeded of_node_put() * pm-sleep: PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO PM / sleep: enable suspend-to-idle even without registered suspend_ops PM / sleep: Increase default DPM watchdog timeout to 120
2016-09-30mm: workingset: fix crash in shadow node shrinker caused by ↵Johannes Weiner
replace_page_cache_page() Antonio reports the following crash when using fuse under memory pressure: kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346! invalid opcode: 0000 [#1] SMP Modules linked in: all of them CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013 task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000 RIP: shadow_lru_isolate+0x181/0x190 Call Trace: __list_lru_walk_one.isra.3+0x8f/0x130 list_lru_walk_one+0x23/0x30 scan_shadow_nodes+0x34/0x50 shrink_slab.part.40+0x1ed/0x3d0 shrink_zone+0x2ca/0x2e0 kswapd+0x51e/0x990 kthread+0xd8/0xf0 ret_from_fork+0x3f/0x70 which corresponds to the following sanity check in the shadow node tracking: BUG_ON(node->count & RADIX_TREE_COUNT_MASK); The workingset code tracks radix tree nodes that exclusively contain shadow entries of evicted pages in them, and this (somewhat obscure) line checks whether there are real pages left that would interfere with reclaim of the radix tree node under memory pressure. While discussing ways how fuse might sneak pages into the radix tree past the workingset code, Miklos pointed to replace_page_cache_page(), and indeed there is a problem there: it properly accounts for the old page being removed - __delete_from_page_cache() does that - but then does a raw raw radix_tree_insert(), not accounting for the replacement page. Eventually the page count bits in node->count underflow while leaving the node incorrectly linked to the shadow node LRU. To address this, make sure replace_page_cache_page() uses the tracked page insertion code, page_cache_tree_insert(). This fixes the page accounting and makes sure page-containing nodes are properly unlinked from the shadow node LRU again. Also, make the sanity checks a bit less obscure by using the helpers for checking the number of pages and shadows in a radix tree node. Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check") Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Antonio SJ Musumeci <trapexit@spawn.link> Debugged-by: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-30Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-28mem-hotplug: use nodes that contain memory as mask in new_node_page()Li Zhong
9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()") prevents allocating from an empty nodemask, but as David points out, it is still wrong. As node_online_map may include memoryless nodes, only allocating from these nodes is meaningless. This patch uses node_states[N_MEMORY] mask to prevent the above case. Fixes: 9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()") Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline") Link: http://lkml.kernel.org/r/1474447117.28370.6.camel@TP420 Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: John Allen <jallen@linux.vnet.ibm.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-28mm,ksm: fix endless looping in allocating memory when ksm enablezhong jiang
I hit the following hung task when runing a OOM LTP test case with 4.1 kernel. Call trace: [<ffffffc000086a88>] __switch_to+0x74/0x8c [<ffffffc000a1bae0>] __schedule+0x23c/0x7bc [<ffffffc000a1c09c>] schedule+0x3c/0x94 [<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350 [<ffffffc000a1e32c>] down_write+0x64/0x80 [<ffffffc00021f794>] __ksm_exit+0x90/0x19c [<ffffffc0000be650>] mmput+0x118/0x11c [<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74 [<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4 [<ffffffc0000d0f34>] get_signal+0x444/0x5e0 [<ffffffc000089fcc>] do_signal+0x1d8/0x450 [<ffffffc00008a35c>] do_notify_resume+0x70/0x78 The oom victim cannot terminate because it needs to take mmap_sem for write while the lock is held by ksmd for read which loops in the page allocator ksm_do_scan scan_get_next_rmap_item down_read get_next_rmap_item alloc_rmap_item #ksmd will loop permanently. There is no way forward because the oom victim cannot release any memory in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve this problem because it would release the memory asynchronously. Nevertheless we can relax alloc_rmap_item requirements and use __GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan would just retry later after the lock got dropped. Such a patch would be also easy to backport to older stable kernels which do not have oom_reaper. While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed by the allocation failure. Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Hugh Dickins <hughd@google.com> Suggested-by: Michal Hocko <mhocko@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-27fs: Replace CURRENT_TIME with current_time() for inode timestampsDeepa Dinamani
CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_time() instead. CURRENT_TIME is also not y2038 safe. This is also in preparation for the patch that transitions vfs timestamps to use 64 bit time and hence make them y2038 safe. As part of the effort current_time() will be extended to do range checks. Hence, it is necessary for all file system timestamps to use current_time(). Also, current_time() will be transitioned along with vfs to be y2038 safe. Note that whenever a single call to current_time() is used to change timestamps in different inodes, it is because they share the same time granularity. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Felipe Balbi <balbi@kernel.org> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27fs: rename "rename2" i_op to "rename"Miklos Szeredi
Generated patch: sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2` sed -i "s/\brename2\b/rename/g" `git grep -wl rename2` Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-25mm: check VMA flags to avoid invalid PROT_NONE NUMA balancingLorenzo Stoakes
The NUMA balancing logic uses an arch-specific PROT_NONE page table flag defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page PMDs respectively as requiring balancing upon a subsequent page fault. User-defined PROT_NONE memory regions which also have this flag set will not normally invoke the NUMA balancing code as do_page_fault() will send a segfault to the process before handle_mm_fault() is even called. However if access_remote_vm() is invoked to access a PROT_NONE region of memory, handle_mm_fault() is called via faultin_page() and __get_user_pages() without any access checks being performed, meaning the NUMA balancing logic is incorrectly invoked on a non-NUMA memory region. A simple means of triggering this problem is to access PROT_NONE mmap'd memory using /proc/self/mem which reliably results in the NUMA handling functions being invoked when CONFIG_NUMA_BALANCING is set. This issue was reported in bugzilla (issue 99101) which includes some simple repro code. There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page() added at commit c0e7cad to avoid accidentally provoking strange behaviour by attempting to apply NUMA balancing to pages that are in fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro. This patch moves the PROT_NONE check into mm/memory.c rather than invoking BUG_ON() as faulting in these pages via faultin_page() is a valid reason for reaching the NUMA check with the PROT_NONE page table flag set and is therefore not always a bug. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101 Reported-by: Trevor Saunders <tbsaunde@tbsaunde.org> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-24Merge branch 'hughd-fixes' (patches from Hugh Dickins)Linus Torvalds
Merge VM fixes from High Dickins: "I get the impression that Andrew is away or busy at the moment, so I'm going to send you three independent uncontroversial little mm fixes directly - though none is strictly a 4.8 regression fix. - shmem: fix tmpfs to handle the huge= option properly from Toshi Kani is a one-liner to fix a major embarrassment in 4.8's hugepages on tmpfs feature: although Hillf pointed it out in June, somehow both Kirill and I repeatedly dropped the ball on this one. You might wonder if the feature got tested at all with that bug in: yes, it did, but for wider testing coverage, Kirill and I had each relied too much on an override which bypasses that condition. - huge tmpfs: fix Committed_AS leak just a run-of-the-mill accounting fix in the same feature. - mm: delete unnecessary and unsafe init_tlb_ubc() is an unrelated fix to 4.3's TLB flush batching in reclaim: the bug would be rare, and none of us will be shamed if this one misses 4.8; but it got such a quick ack from Mel today that I'm inclined to offer it along with the first two" * emailed patches from Hugh Dickins <hughd@google.com>: mm: delete unnecessary and unsafe init_tlb_ubc() huge tmpfs: fix Committed_AS leak shmem: fix tmpfs to handle the huge= option properly
2016-09-24mm: delete unnecessary and unsafe init_tlb_ubc()Hugh Dickins
init_tlb_ubc() looked unnecessary to me: tlb_ubc is statically initialized with zeroes in the init_task, and copied from parent to child while it is quiescent in arch_dup_task_struct(); so I went to delete it. But inserted temporary debug WARN_ONs in place of init_tlb_ubc() to check that it was always empty at that point, and found them firing: because memcg reclaim can recurse into global reclaim (when allocating biosets for swapout in my case), and arrive back at the init_tlb_ubc() in shrink_node_memcg(). Resetting tlb_ubc.flush_required at that point is wrong: if the upper level needs a deferred TLB flush, but the lower level turns out not to, we miss a TLB flush. But fortunately, that's the only part of the protocol that does not nest: with the initialization removed, cpumask collects bits from upper and lower levels, and flushes TLB when needed. Fixes: 72b252aed506 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-24huge tmpfs: fix Committed_AS leakHugh Dickins
Under swapping load on huge tmpfs, /proc/meminfo's Committed_AS grows bigger and bigger: just a cosmetic issue for most users, but disabling for those who run without overcommit (/proc/sys/vm/overcommit_memory 2). shmem_uncharge() was forgetting to unaccount __vm_enough_memory's charge, and shmem_charge() was forgetting it on the filesystem-full error path. Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-24shmem: fix tmpfs to handle the huge= option properlyToshi Kani
shmem_get_unmapped_area() checks SHMEM_SB(sb)->huge incorrectly, which leads to a reversed effect of "huge=" mount option. Fix the check in shmem_get_unmapped_area(). Note, the default value of SHMEM_SB(sb)->huge remains as SHMEM_HUGE_NEVER. User will need to specify "huge=" option to enable huge page mappings. Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-22Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22fs: Give dentry to inode_change_ok() instead of inodeJan Kara
inode_change_ok() will be resposible for clearing capabilities and IMA extended attributes and as such will need dentry. Give it as an argument to inode_change_ok() instead of an inode. Also rename inode_change_ok() to setattr_prepare() to better relect that it does also some modifications in addition to checks. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-20mm: usercopy: Check for module addressesLaura Abbott
While running a compile on arm64, I hit a memory exposure usercopy: kernel memory exposure attempt detected from fffffc0000f3b1a8 (buffer_head) (1 bytes) ------------[ cut here ]------------ kernel BUG at mm/usercopy.c:75! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_security ip6table_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle iptable_security iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ebtable_filter ebtables ip6table_filter ip6_tables vfat fat xgene_edac xgene_enet edac_core i2c_xgene_slimpro i2c_core at803x realtek xgene_dma mdio_xgene gpio_dwapb gpio_xgene_sb xgene_rng mailbox_xgene_slimpro nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sdhci_of_arasan sdhci_pltfm sdhci mmc_core xhci_plat_hcd gpio_keys CPU: 0 PID: 19744 Comm: updatedb Tainted: G W 4.8.0-rc3-threadinfo+ #1 Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang Board, BIOS 3.06.12 Aug 12 2016 task: fffffe03df944c00 task.stack: fffffe00d128c000 PC is at __check_object_size+0x70/0x3f0 LR is at __check_object_size+0x70/0x3f0 ... [<fffffc00082b4280>] __check_object_size+0x70/0x3f0 [<fffffc00082cdc30>] filldir64+0x158/0x1a0 [<fffffc0000f327e8>] __fat_readdir+0x4a0/0x558 [fat] [<fffffc0000f328d4>] fat_readdir+0x34/0x40 [fat] [<fffffc00082cd8f8>] iterate_dir+0x190/0x1e0 [<fffffc00082cde58>] SyS_getdents64+0x88/0x120 [<fffffc0008082c70>] el0_svc_naked+0x24/0x28 fffffc0000f3b1a8 is a module address. Modules may have compiled in strings which could get copied to userspace. In this instance, it looks like "." which matches with a size of 1 byte. Extend the is_vmalloc_addr check to be is_vmalloc_or_module_addr to cover all possible cases. Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-19mm: memcontrol: make per-cpu charge cache IRQ-safe for socket accountingJohannes Weiner
During cgroup2 rollout into production, we started encountering css refcount underflows and css access crashes in the memory controller. Splitting the heavily shared css reference counter into logical users narrowed the imbalance down to the cgroup2 socket memory accounting. The problem turns out to be the per-cpu charge cache. Cgroup1 had a separate socket counter, but the new cgroup2 socket accounting goes through the common charge path that uses a shared per-cpu cache for all memory that is being tracked. Those caches are safe against scheduling preemption, but not against interrupts - such as the newly added packet receive path. When cache draining is interrupted by network RX taking pages out of the cache, the resuming drain operation will put references of in-use pages, thus causing the imbalance. Disable IRQs during all per-cpu charge cache operations. Fixes: f7e1cb6ec51b ("mm: memcontrol: account socket memory in unified hierarchy memory controller") Link: http://lkml.kernel.org/r/20160914194846.11153-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19mm: fix the page_swap_info() BUG_ON checkSantosh Shilimkar
Commit 62c230bc1790 ("mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages") replaced the swap_aops dirty hook from __set_page_dirty_no_writeback() with swap_set_page_dirty(). For normal cases without these special SWP flags code path falls back to __set_page_dirty_no_writeback() so the behaviour is expected to be the same as before. But swap_set_page_dirty() makes use of the page_swap_info() helper to get the swap_info_struct to check for the flags like SWP_FILE, SWP_BLKDEV etc as desired for those features. This helper has BUG_ON(!PageSwapCache(page)) which is racy and safe only for the set_page_dirty_lock() path. For the set_page_dirty() path which is often needed for cases to be called from irq context, kswapd() can toggle the flag behind the back while the call is getting executed when system is low on memory and heavy swapping is ongoing. This ends up with undesired kernel panic. This patch just moves the check outside the helper to its users appropriately to fix kernel panic for the described path. Couple of users of helpers already take care of SwapCache condition so I skipped them. Link: http://lkml.kernel.org/r/1473460718-31013-1-git-send-email-santosh.shilimkar@oracle.com Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Joe Perches <joe@perches.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rik van Riel <riel@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Jens Axboe <axboe@fb.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> [4.7.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19mm: avoid endless recursion in dump_page()Kirill A. Shutemov
dump_page() uses page_mapcount() to get mapcount of the page. page_mapcount() has VM_BUG_ON_PAGE(PageSlab(page)) as mapcount doesn't make sense for slab pages and the field in struct page used for other information. It leads to recursion if dump_page() called for slub page and DEBUG_VM is enabled: dump_page() -> page_mapcount() -> VM_BUG_ON_PAGE() -> dump_page -> ... Let's avoid calling page_mapcount() for slab pages in dump_page(). Link: http://lkml.kernel.org/r/20160908082137.131076-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19mm, thp: fix leaking mapped pte in __collapse_huge_page_swapin()Ebru Akagunduz
Currently, khugepaged does not permit swapin if there are enough young pages in a THP. The problem is when a THP does not have enough young pages, khugepaged leaks mapped ptes. This patch prohibits leaking mapped ptes. Link: http://lkml.kernel.org/r/1472820276-7831-1-git-send-email-ebru.akagunduz@gmail.com Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19khugepaged: fix use-after-free in collapse_huge_page()Kirill A. Shutemov
hugepage_vma_revalidate() tries to re-check if we still should try to collapse small pages into huge one after the re-acquiring mmap_sem. The problem Dmitry Vyukov reported[1] is that the vma found by hugepage_vma_revalidate() can be suitable for huge pages, but not the same vma we had before dropping mmap_sem. And dereferencing original vma can lead to fun results.. Let's use vma hugepage_vma_revalidate() found instead of assuming it's the same as what we had before the lock was dropped. [1] http://lkml.kernel.org/r/CACT4Y+Z3gigBvhca9kRJFcjX0G70V_nRhbwKBU+yGoESBDKi9Q@mail.gmail.com Link: http://lkml.kernel.org/r/20160907122559.GA6542@black.fi.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Sasha Levin <levinsasha928@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: syzkaller <syzkaller@googlegroups.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19mem-hotplug: don't clear the only node in new_node_page()Li Zhong
Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline") introduced new_node_page() for memory hotplug. In new_node_page(), the nid is cleared before calling __alloc_pages_nodemask(). But if it is the only node of the system, and the first round allocation fails, it will not be able to get memory from an empty nodemask, and will trigger oom. The patch checks whether it is the last node on the system, and if it is, then don't clear the nid in the nodemask. Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline") Link: http://lkml.kernel.org/r/1473044391.4250.19.camel@TP420 Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Reported-by: John Allen <jallen@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-14x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*Dmitry Safonov
Add API to change vdso blob type with arch_prctl. As this is usefull only by needs of CRIU, expose this interface under CONFIG_CHECKPOINT_RESTORE. Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: 0x7f454c46@gmail.com Cc: oleg@redhat.com Cc: linux-mm@kvack.org Cc: gorcunov@openvz.org Cc: xemul@virtuozzo.com Link: http://lkml.kernel.org/r/20160905133308.28234-4-dsafonov@virtuozzo.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-13sched/numa, mm: Revert to checking pmd/pte_write instead of VMA flagsRik van Riel
Commit: 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations") changed NUMA balancing from _PAGE_NUMA to using PROT_NONE, and was quickly found to introduce a regression with NUMA grouping. It was followed up by these commits: 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic") bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags") b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault") The first of those two commits try alternate approaches to NUMA grouping, which apparently do not work as well as looking at the PTE write permissions. The latter patch preserves the PTE write permissions across a NUMA protection fault. However, it forgets to revert the condition for whether or not to group tasks together back to what it was before v3.19, even though the information is now preserved in the page tables once again. This patch brings the NUMA grouping heuristic back to what it was before commit 4d9424669946, which the changelogs of subsequent commits suggest worked best. We have all the information again. We should probably use it. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: aarcange@redhat.com Cc: linux-mm@kvack.org Cc: mgorman@suse.de Link: http://lkml.kernel.org/r/20160908213053.07c992a9@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-13PM / Hibernate: allow hibernation with PAGE_POISONING_ZEROAnisse Astier
PAGE_POISONING_ZERO disables zeroing new pages on alloc, they are poisoned (zeroed) as they become available. In the hibernate use case, free pages will appear in the system without being cleared, left there by the loading kernel. This patch will make sure free pages are cleared on resume when PAGE_POISONING_ZERO is enabled. We free the pages just after resume because we can't do it later: going through any device resume code might allocate some memory and invalidate the free pages bitmap. Thus we don't need to disable hibernation when PAGE_POISONING_ZERO is enabled. Signed-off-by: Anisse Astier <anisse@astier.eu> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>