summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2020-12-15mm: hugetlb: fix type of delta parameter and related local variables in ↵Liu Xiang
gather_surplus_pages() On 64-bit machine, delta variable in hugetlb_acct_memory() may be larger than 0xffffffff, but gather_surplus_pages() can only use the low 32-bit value now. So we need to fix type of delta parameter and related local variables in gather_surplus_pages(). Link: https://lkml.kernel.org/r/1605793733-3573-1-git-send-email-liu.xiang@zlingsmart.com Reported-by: Ma Chenggong <ma.chenggong@zlingsmart.com> Signed-off-by: Liu Xiang <liu.xiang@zlingsmart.com> Signed-off-by: Pan Jiagen <pan.jiagen@zlingsmart.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Liu Xiang <liuxiang_1999@126.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15khugepaged: add parameter explanations for kernel-doc markupAlex Shi
Add missed parameter explanation for some kernel-doc warnings: mm/khugepaged.c:102: warning: Function parameter or member 'nr_pte_mapped_thp' not described in 'mm_slot' mm/khugepaged.c:102: warning: Function parameter or member 'pte_mapped_thp' not described in 'mm_slot' mm/khugepaged.c:1424: warning: Function parameter or member 'mm' not described in 'collapse_pte_mapped_thp' mm/khugepaged.c:1424: warning: Function parameter or member 'addr' not described in 'collapse_pte_mapped_thp' mm/khugepaged.c:1626: warning: Function parameter or member 'mm' not described in 'collapse_file' mm/khugepaged.c:1626: warning: Function parameter or member 'file' not described in 'collapse_file' mm/khugepaged.c:1626: warning: Function parameter or member 'start' not described in 'collapse_file' mm/khugepaged.c:1626: warning: Function parameter or member 'hpage' not described in 'collapse_file' mm/khugepaged.c:1626: warning: Function parameter or member 'node' not described in 'collapse_file' Link: https://lkml.kernel.org/r/1605597325-25284-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/hugetlb.c: just use put_page_testzero() instead of page_count()Hui Su
We test the page reference count is zero or not here, it can be a bug here if page refercence count is not zero. So we can just use put_page_testzero() instead of page_count(). Link: https://lkml.kernel.org/r/20201007170949.GA6416@rlk Signed-off-by: Hui Su <sh_def@163.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: return -EBUSY when migration failsOscar Salvador
Currently, we return -EIO when we fail to migrate the page. Migrations' failures are rather transient as they can happen due to several reasons, e.g: high page refcount bump, mapping->migrate_page failing etc. All meaning that at that time the page could not be migrated, but that has nothing to do with an EIO error. Let us return -EBUSY instead, as we do in case we failed to isolate the page. While are it, let us remove the "ret" print as its value does not change. Link: https://lkml.kernel.org/r/20201209092818.30417-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,memory_failure: always pin the page in madvise_inject_errorOscar Salvador
madvise_inject_error() uses get_user_pages_fast to translate the address we specified to a page. After [1], we drop the extra reference count for memory_failure() path. That commit says that memory_failure wanted to keep the pin in order to take the page out of circulation. The truth is that we need to keep the page pinned, otherwise the page might be re-used after the put_page() and we can end up messing with someone else's memory. E.g: CPU0 process X CPU1 madvise_inject_error get_user_pages put_page page gets reclaimed process Y allocates the page memory_failure // We mess with process Y memory madvise() is meant to operate on a self address space, so messing with pages that do not belong to us seems the wrong thing to do. To avoid that, let us keep the page pinned for memory_failure as well. Pages for DAX mappings will release this extra refcount in memory_failure_dev_pagemap. [1] ("23e7b5c2e271: mm, madvise_inject_error: Let memory_failure() optionally take a page reference") Link: https://lkml.kernel.org/r/20201207094818.8518-1-osalvador@suse.de Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference") Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: remove drain_all_pages from shake_pageOscar Salvador
get_hwpoison_page already drains pcplists, previously disabling them when trying to grab a refcount. We do not need shake_page to take care of it anymore. Link: https://lkml.kernel.org/r/20201204102558.31607-4-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Qian Cai <qcai@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: disable pcplists before grabbing a refcountOscar Salvador
Currently, we have a sort of retry mechanism to make sure pages in pcp-lists are spilled to the buddy system, so we can handle those. We can save us this extra checks with the new disable-pcplist mechanism that is available with [1]. zone_pcplist_disable makes sure to 1) disable pcplists, so any page that is freed up from that point onwards will end up in the buddy system and 2) drain pcplists, so those pages that already in pcplists are spilled to buddy. With that, we can make a common entry point for grabbing a refcount from both soft_offline and memory_failure paths that is guarded by zone_pcplist_disable/zone_pcplist_enable. [1] https://patchwork.kernel.org/project/linux-mm/cover/20201111092812.11329-1-vbabka@suse.cz/ Link: https://lkml.kernel.org/r/20201204102558.31607-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Qian Cai <qcai@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: refactor get_any_pageOscar Salvador
Patch series "HWPoison: Refactor get page interface", v2. This patch (of 3): When we want to grab a refcount via get_any_page, we call __get_any_page that calls get_hwpoison_page to get the actual refcount. get_any_page() is only there because we have a sort of retry mechanism in case the page we met is unknown to us or if we raced with an allocation. Also __get_any_page() prints some messages about the page type in case the page was a free page or the page type was unknown, but if anything, we only need to print a message in case the pagetype was unknown, as that is reporting an error down the chain. Let us merge get_any_page() and __get_any_page(), and let the message be printed in soft_offline_page. While we are it, we can also remove the 'pfn' parameter as it is no longer used. Link: https://lkml.kernel.org/r/20201204102558.31607-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20201204102558.31607-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: Vlastimil Babka <Vbabka@suse.cz> Cc: Qian Cai <qcai@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: drop unneeded pcplist drainingOscar Salvador
memory_failure and soft_offline_path paths now drain pcplists by calling get_hwpoison_page. memory_failure flags the page as HWPoison before, so that page cannot longer go into a pcplist, and soft_offline_page only flags a page as HWPoison if 1) we took the page off a buddy freelist 2) the page was in-use and we migrated it 3) was a clean pagecache. Because of that, a page cannot longer be poisoned and be in a pcplist. Link: https://lkml.kernel.org/r/20201013144447.6706-5-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: take free pages off the buddy freelistsOscar Salvador
The crux of the matter is that historically we left poisoned pages in the buddy system because we have some checks in place when allocating a page that are gatekeeper for poisoned pages. Unfortunately, we do have other users (e.g: compaction [1]) that scan buddy freelists and try to get a page from there without checking whether the page is HWPoison. As I stated already, I think it is fundamentally wrong to keep HWPoison pages within the buddy systems, checks in place or not. Let us fix this the same way we did for soft_offline [2], taking the page off the buddy freelist so it is completely unreachable. Note that this is fairly simple to trigger, as we only need to poison free buddy pages (madvise MADV_HWPOISON) and then run some sort of memory stress system. Just for a matter of reference, I put a dump_page() in compaction_alloc() to trigger for HWPoison patches: page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db flags: 0xfffffc0800000(hwpoison) raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000 raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: compaction_alloc CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G E 5.9.0-rc2-mm1-1-default+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0x6d/0x8b compaction_alloc+0xb2/0xc0 migrate_pages+0x2a6/0x12a0 compact_zone+0x5eb/0x11c0 proactive_compact_node+0x89/0xf0 kcompactd+0x2d0/0x3a0 kthread+0x118/0x130 ret_from_fork+0x22/0x30 After that, if e.g: a process faults in the page, it will get killed unexpectedly. Fix it by containing the page immediatelly. Besides that, two more changes can be noticed: * MF_DELAYED no longer suits as we are fixing the issue by containing the page immediately, so it does no longer rely on the allocation-time checks to stop HWPoison to be handed over. gain unless it is unpoisoned, so we fixed the situation. Because of that, let us use MF_RECOVERED from now on. * The second block that handles PageBuddy pages is no longer needed: We call shake_page and then check whether the page is Buddy because shake_page calls drain_all_pages, which sends pcp-pages back to the buddy freelists, so we could have a chance to handle free pages. Currently, get_hwpoison_page already calls drain_all_pages, and we call get_hwpoison_page right before coming here, so we should be on the safe side. [1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u [2] https://patchwork.kernel.org/cover/11792607/ [osalvador@suse.de: take the poisoned subpage off the buddy frelists] Link: https://lkml.kernel.org/r/20201013144447.6706-4-osalvador@suse.de Link: https://lkml.kernel.org/r/20201013144447.6706-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount pageOscar Salvador
Patch series "HWpoison: further fixes and cleanups", v5. This patchset includes some more fixes and a cleanup. Patch#2 and patch#3 are both fixes for taking a HWpoison page off a buddy freelist, since having them there has proved to be bad (see [1] and pathch#2's commit log). Patch#3 does the same for hugetlb pages. [1] https://lkml.org/lkml/2020/9/22/565 This patch (of 4): A page with 0-refcount and !PageBuddy could perfectly be a pcppage. Currently, we bail out with an error if we encounter such a page, meaning that we do not handle pcppages neither from hard-offline nor from soft-offline path. Fix this by draining pcplists whenever we find this kind of page and retry the check again. It might be that pcplists have been spilled into the buddy allocator and so we can handle it. Link: https://lkml.kernel.org/r/20201013144447.6706-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20201013144447.6706-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/page_alloc: speed up the iteration of max_orderMuchun Song
When we free a page whose order is very close to MAX_ORDER and greater than pageblock_order, it wastes some CPU cycles to increase max_order to MAX_ORDER one by one and check the pageblock migratetype of that page repeatedly especially when MAX_ORDER is much larger than pageblock_order. We also should not be checking migratetype of buddy when "order == MAX_ORDER - 1" as the buddy pfn may be invalid, so adjust the condition. With the new check, we don't need the max_order check anymore, so we replace it. Also adjust max_order initialization so that it's lower by one than previously, which makes the code hopefully more clear. Link: https://lkml.kernel.org/r/20201204155109.55451-1-songmuchun@bytedance.com Fixes: d9dddbf55667 ("mm/page_alloc: prevent merging between isolated and other pageblocks") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: page_alloc: refactor setup_per_zone_lowmem_reserve()Lorenzo Stoakes
setup_per_zone_lowmem_reserve() iterates through each zone setting zone->lowmem_reserve[j] = 0 (where j is the zone's index) then iterates backwards through all preceding zones, setting lower_zone->lowmem_reserve[j] = sum(managed pages of higher zones) / lowmem_reserve_ratio[idx] for each (where idx is the lower zone's index). If the lower zone has no managed pages or its ratio is 0 then all of its lowmem_reserve[] entries are effectively zeroed. As these arrays are only assigned here and all lowmem_reserve[] entries for index < this zone's index are implicitly assumed to be 0 (as these are specifically output in show_free_areas() and zoneinfo_show_print() for example) there is no need to additionally zero index == this zone's index too. This patch avoids zeroing unnecessarily. Rather than iterating through zones and setting lowmem_reserve[j] for each lower zone this patch reverse the process and populates each zone's lowmem_reserve[] values in ascending order. This clarifies what is going on especially in the case of zero managed pages or ratio which is now explicitly shown to clear these values. Link: https://lkml.kernel.org/r/20201129162758.115907-1-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15init/main: fix broken buffer_init when DEFERRED_STRUCT_PAGE_INIT setLin Feng
In the booting phase if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set, we have following callchain: start_kernel ... mm_init mem_init memblock_free_all reset_all_zones_managed_pages free_low_memory_core_early ... buffer_init nr_free_buffer_pages zone->managed_pages ... rest_init kernel_init kernel_init_freeable page_alloc_init_late kthread_run(deferred_init_memmap, NODE_DATA(nid), "pgdatinit%d", nid); wait_for_completion(&pgdat_init_all_done_comp); ... files_maxfiles_init It's clear that buffer_init depends on zone->managed_pages, but it's reset in reset_all_zones_managed_pages after that pages are readded into zone->managed_pages, but when buffer_init runs this process is half done and most of them will finally be added till deferred_init_memmap done. In large memory couting of nr_free_buffer_pages drifts too much, also drifting from kernels to kernels on same hardware. Fix is simple, it delays buffer_init run till deferred_init_memmap all done. But as corrected by this patch, max_buffer_heads becomes very large, the value is roughly as many as 4 times of totalram_pages, formula: max_buffer_heads = nrpages * (10%) * (PAGE_SIZE / sizeof(struct buffer_head)); Say in a 64GB memory box we have 16777216 pages, then max_buffer_heads turns out to be roughly 67,108,864. In common cases, should a buffer_head be mapped to one page/block(4KB)? So max_buffer_heads never exceeds totalram_pages. IMO it's likely to make buffer_heads_over_limit bool value alwasy false, then make codes 'if (buffer_heads_over_limit)' test in vmscan unnecessary. So this patch will change the original behavior related to buffer_heads_over_limit in vmscan since we used a half done value of zone->managed_pages before, or should we use a smaller factor(<10%) in previous formula. akpm: I think this is OK - the max_buffer_heads code is only needed on highmem machines, to prevent ZONE_NORMAL from being consumed by large amounts of buffer_heads attached to highmem pagecache. This problem will not occur on 64-bit machines, so this feature's non-functionality on such machines is a feature, not a bug. Link: https://lkml.kernel.org/r/20201123110500.103523-1-linf@wangsu.com Signed-off-by: Lin Feng <linf@wangsu.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/page_alloc: clear all pages in post_alloc_hook() with init_on_alloc=1David Hildenbrand
commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") resulted with init_on_alloc=1 in all pages leaving the buddy via alloc_pages() and friends to be initialized/cleared/zeroed on allocation. However, the same logic is currently not applied to alloc_contig_pages(): allocated pages leaving the buddy aren't cleared with init_on_alloc=1 and init_on_free=0. Let's also properly clear pages on that allocation path. To achieve that, let's move clearing into post_alloc_hook(). This will not only affect alloc_contig_pages() allocations but also any pages used as migration target in compaction code via compaction_alloc(). While this sounds sub-optimal, it's the very same handling as when allocating migration targets via alloc_migration_target() - pages will get properly cleared with init_on_free=1. In case we ever want to optimize migration in that regard, we should tackle all such migration users - if we believe migration code can be fully trusted. With this change, we will see double clearing of pages in some cases. One example are gigantic pages (either allocated via CMA, or allocated dynamically via alloc_contig_pages()) - which is the right thing to do (and to be optimized outside of the buddy in the callers) as discussed in: https://lkml.kernel.org/r/20201019182853.7467-1-gpiccoli@canonical.com This change implies that with init_on_alloc=1 - All CMA allocations will be cleared - Gigantic pages allocated via alloc_contig_pages() will be cleared - virtio-mem memory to be unplugged will be cleared. While this is suboptimal, it's similar to memory balloon drivers handling, where all pages to be inflated will get cleared as well. - Pages isolated for compaction will be cleared Link: https://lkml.kernel.org/r/20201120180452.19071-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Kees Cook <keescook@chromium.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/page_alloc: mark some symbols with static keywordZou Wei
Fix the following sparse warnings: mm/page_alloc.c:3040:6: warning: symbol '__drain_all_pages' was not declared. Should it be static? mm/page_alloc.c:6349:6: warning: symbol '__zone_set_pageset_high_and_batch' was not declared. Should it be static? Link: https://lkml.kernel.org/r/1605517365-65858-1-git-send-email-zou_wei@huawei.com Signed-off-by: Zou Wei <zou_wei@huawei.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/page_alloc: add __free_pages() documentationMatthew Wilcox (Oracle)
Provide some guidance towards when this might not be the right interface to use. Link: https://lkml.kernel.org/r/20201027025523.3235-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm, page_alloc: disable pcplists during memory offlineVlastimil Babka
Memory offlining relies on page isolation to guarantee a forward progress because pages cannot be reused while they are isolated. But the page isolation itself doesn't prevent from races while freed pages are stored on pcp lists and thus can be reused. This can be worked around by repeated draining of pcplists, as done by commit 968318261221 ("mm/memory_hotplug: drain per-cpu pages again during memory offline"). David and Michal would prefer that this race was closed in a way that callers of page isolation who need stronger guarantees don't need to repeatedly drain. David suggested disabling pcplists usage completely during page isolation, instead of repeatedly draining them. To achieve this without adding special cases in alloc/free fastpath, we can use the same approach as boot pagesets - when pcp->high is 0, any pcplist addition will be immediately flushed. The race can thus be closed by setting pcp->high to 0 and draining pcplists once, before calling start_isolate_page_range(). The draining will serialize after processes that already disabled interrupts and read the old value of pcp->high in free_unref_page_commit(), and processes that have not yet disabled interrupts, will observe pcp->high == 0 when they are rescheduled, and skip pcplists. This guarantees no stray pages on pcplists in zones where isolation happens. This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that page isolation users can call before start_isolate_page_range() and after unisolating (or offlining) the isolated pages. Also, drain_all_pages() is optimized to only execute on cpus where pcplists are not empty. The check can however race with a free to pcplist that has not yet increased the pcp->count from 0 to 1. Thus make the drain optionally skip the racy check and drain on all cpus, and use this option in zone_pcp_disable(). As we have to avoid external updates to high and batch while pcplists are disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it in zone_pcp_enable(). This also synchronizes multiple users of zone_pcp_disable()/enable(). Currently the only user of this functionality is offline_pages(). [vbabka@suse.cz: add comment, per David] Link: https://lkml.kernel.org/r/527480ef-ed72-e1c1-52a0-1c5b0113df45@suse.cz Link: https://lkml.kernel.org/r/20201111092812.11329-8-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm, page_alloc: move draining pcplists to page isolation usersVlastimil Babka
Currently, pcplists are drained during set_migratetype_isolate() which means once per pageblock processed start_isolate_page_range(). This is somewhat wasteful. Moreover, the callers might need different guarantees, and the draining is currently prone to races and does not guarantee that no page from isolated pageblock will end up on the pcplist after the drain. Better guarantees are added by later patches and require explicit actions by page isolation users that need them. Thus it makes sense to move the current imperfect draining to the callers also as a preparation step. Link: https://lkml.kernel.org/r/20201111092812.11329-7-vbabka@suse.cz Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm, page_alloc: cache pageset high and batch in struct zoneVlastimil Babka
All per-cpu pagesets for a zone use the same high and batch values, that are duplicated there just for performance (locality) reasons. This patch adds the same variables also to struct zone as a shared copy. This will be useful later for making possible to disable pcplists temporarily by setting high value to 0, while remembering the values for restoring them later. But we can also immediately benefit from not updating pagesets of all possible cpus in case the newly recalculated values (after sysctl change or memory online/offline) are actually unchanged from the previous ones. Link: https://lkml.kernel.org/r/20201111092812.11329-6-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm, page_alloc: simplify pageset_update()Vlastimil Babka
pageset_update() attempts to update pcplist's high and batch values in a way that readers don't observe batch > high. It uses smp_wmb() to order the updates in a way to achieve this. However, without proper pairing read barriers in readers this guarantee doesn't hold, and there are no such barriers in e.g. free_unref_page_commit(). Commit 88e8ac11d2ea ("mm, page_alloc: fix core hung in free_pcppages_bulk()") already showed this is problematic, and solved this by ultimately only trusing pcp->count of the current cpu with interrupts disabled. The update dance with unpaired write barriers thus makes no sense. Replace them with plain WRITE_ONCE to prevent store tearing, and document that the values can change asynchronously and should not be trusted for correctness. All current readers appear to be OK after 88e8ac11d2ea. Convert them to READ_ONCE to prevent unnecessary read tearing, but mainly to alert anybody making future changes to the code that special care is needed. Link: https://lkml.kernel.org/r/20201111092812.11329-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm, page_alloc: remove setup_pageset()Vlastimil Babka
We initialize boot-time pagesets with setup_pageset(), which sets high and batch values that effectively disable pcplists. We can remove this wrapper if we just set these values for all pagesets in pageset_init(). Non-boot pagesets then subsequently update them to the proper values. No functional change. Link: https://lkml.kernel.org/r/20201111092812.11329-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm, page_alloc: calculate pageset high and batch once per zoneVlastimil Babka
We currently call pageset_set_high_and_batch() for each possible cpu, which repeats the same calculations of high and batch values. Instead call the function just once per zone, and make it apply the calculated values to all per-cpu pagesets of the zone. This also allows removing the zone_pageset_init() and __zone_pcp_update() wrappers. No functional change. Link: https://lkml.kernel.org/r/20201111092812.11329-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm, page_alloc: clean up pageset high and batch updateVlastimil Babka
Patch series "disable pcplists during memory offline", v3. As per the discussions [1] [2] this is an attempt to implement David's suggestion that page isolation should disable pcplists to avoid races with page freeing in progress. This is done without extra checks in fast paths, as explained in Patch 9. The repeated draining done by [2] is then no longer needed. Previous version (RFC) is at [3]. The RFC tried to hide pcplists disabling/enabling into page isolation, but it wasn't completely possible, as memory offline does not unisolation. Michal suggested an explicit API in [4] so that's the current implementation and it seems indeed nicer. Once we accept that page isolation users need to do explicit actions around it depending on the needed guarantees, we can also IMHO accept that the current pcplist draining can be also done by the callers, which is more effective. After all, there are only two users of page isolation. So patch 6 does effectively the same thing as Pavel proposed in [5], and patch 7 implement stronger guarantees only for memory offline. If CMA decides to opt-in to the stronger guarantee, it can be added later. Patches 1-5 are preparatory cleanups for pcplist disabling. Patchset was briefly tested in QEMU so that memory online/offline works, but I haven't done a stress test that would prove the race fixed by [2] is eliminated. Note that patch 7 could be avoided if we instead adjusted page freeing in shown in [6], but I believe the current implementation of disabling pcplists is not too much complex, so I would prefer this instead of adding new checks and longer irq-disabled section into page freeing hotpaths. [1] https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com/ [2] https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatashin@soleen.com/ [3] https://lore.kernel.org/linux-mm/20200907163628.26495-1-vbabka@suse.cz/ [4] https://lore.kernel.org/linux-mm/20200909113647.GG7348@dhcp22.suse.cz/ [5] https://lore.kernel.org/linux-mm/20200904151448.100489-3-pasha.tatashin@soleen.com/ [6] https://lore.kernel.org/linux-mm/3d3b53db-aeaa-ff24-260b-36427fac9b1c@suse.cz/ [7] https://lore.kernel.org/linux-mm/20200922143712.12048-1-vbabka@suse.cz/ [8] https://lore.kernel.org/linux-mm/20201008114201.18824-1-vbabka@suse.cz/ This patch (of 7): The updates to pcplists' high and batch values are handled by multiple functions that make the calculations hard to follow. Consolidate everything to pageset_set_high_and_batch() and remove pageset_set_batch() and pageset_set_high() wrappers. The only special case using one of the removed wrappers was: build_all_zonelists_init() setup_pageset() pageset_set_batch() which was hardcoding batch as 0, so we can just open-code a call to pageset_update() with constant parameters instead. No functional change. Link: https://lkml.kernel.org/r/20201111092812.11329-1-vbabka@suse.cz Link: https://lkml.kernel.org/r/20201111092812.11329-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: introduce debug_pagealloc_{map,unmap}_pages() helpersMike Rapoport
Patch series "arch, mm: improve robustness of direct map manipulation", v7. During recent discussion about KVM protected memory, David raised a concern about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC scope [1]. Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is possible that __kernel_map_pages() would fail, but since this function is void, the failure will go unnoticed. Moreover, there's lack of consistency of __kernel_map_pages() semantics across architectures as some guard this function with #ifdef DEBUG_PAGEALLOC, some refuse to update the direct map if page allocation debugging is disabled at run time and some allow modifying the direct map regardless of DEBUG_PAGEALLOC settings. This set straightens this out by restoring dependency of __kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites accordingly. Since currently the only user of __kernel_map_pages() outside DEBUG_PAGEALLOC is hibernation, it is updated to make direct map accesses there more explicit. [1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a40348269d@redhat.com This patch (of 4): When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel direct mapping after free_pages(). The pages than need to be mapped back before they could be used. Theese mapping operations use __kernel_map_pages() guarded with with debug_pagealloc_enabled(). The only place that calls __kernel_map_pages() without checking whether DEBUG_PAGEALLOC is enabled is the hibernation code that presumes availability of this function when ARCH_HAS_SET_DIRECT_MAP is set. Still, on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is not enabled but set_direct_map_invalid_noflush() may render some pages not present in the direct map and hibernation code won't be able to save such pages. To make page allocation debugging and hibernation interaction more robust, the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be made more explicit. Start with combining the guard condition and the call to __kernel_map_pages() into debug_pagealloc_map_pages() and debug_pagealloc_unmap_pages() functions to emphasize that __kernel_map_pages() should not be called without DEBUG_PAGEALLOC and use these new functions to map/unmap pages when page allocation debugging is enabled. Link: https://lkml.kernel.org/r/20201109192128.960-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20201109192128.960-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Len Brown <len.brown@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15arm, arm64: move free_unused_memmap() to generic mmMike Rapoport
ARM and ARM64 free unused parts of the memory map just before the initialization of the page allocator. To allow holes in the memory map both architectures overload pfn_valid() and define HAVE_ARCH_PFN_VALID. Allowing holes in the memory map for FLATMEM may be useful for small machines, such as ARC and m68k and will enable those architectures to cease using DISCONTIGMEM and still support more than one memory bank. Move the functions that free unused memory map to generic mm and enable them in case HAVE_ARCH_PFN_VALID=y. Link: https://lkml.kernel.org/r/20201101170454.9567-10-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Meelis Roos <mroos@linux.ee> Cc: Michael Schmitz <schmitzmic@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODELMike Rapoport
ARM is the only architecture that defines CONFIG_ARCH_HAS_HOLES_MEMORYMODEL which in turn enables memmap_valid_within() function that is intended to verify existence of struct page associated with a pfn when there are holes in the memory map. However, the ARCH_HAS_HOLES_MEMORYMODEL also enables HAVE_ARCH_PFN_VALID and arch-specific pfn_valid() implementation that also deals with the holes in the memory map. The only two users of memmap_valid_within() call this function after a call to pfn_valid() so the memmap_valid_within() check becomes redundant. Remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and memmap_valid_within() and rely entirely on ARM's implementation of pfn_valid() that is now enabled unconditionally. Link: https://lkml.kernel.org/r/20201101170454.9567-9-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Meelis Roos <mroos@linux.ee> Cc: Michael Schmitz <schmitzmic@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15ia64: remove custom __early_pfn_to_nid()Mike Rapoport
The ia64 implementation of __early_pfn_to_nid() essentially relies on the same data as the generic implementation. The correspondence between memory ranges and nodes is set in memblock during early memory initialization in register_active_ranges() function. The initialization of sparsemem that requires early_pfn_to_nid() happens later and it can use the memblock information like the other architectures. Link: https://lkml.kernel.org/r/20201101170454.9567-3-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Meelis Roos <mroos@linux.ee> Cc: Michael Schmitz <schmitzmic@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15kasan: print workqueue stackWalter Wu
The aux_stack[2] is reused to record the call_rcu() call stack and enqueuing work call stacks. So that we need to change the auxiliary stack title for common title, print them in KASAN report. Link: https://lkml.kernel.org/r/20201203022715.30635-1-walter-zh.wu@mediatek.com Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com> Suggested-by: Marco Elver <elver@google.com> Acked-by: Marco Elver <elver@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/vmalloc.c: fix kasan shadow poisoning sizeVincenzo Frascino
The size of vm area can be affected by the presence or not of the guard page. In particular when VM_NO_GUARD is present, the actual accessible size has to be considered like the real size minus the guard page. Currently kasan does not keep into account this information during the poison operation and in particular tries to poison the guard page as well. This approach, even if incorrect, does not cause an issue because the tags for the guard page are written in the shadow memory. With the future introduction of the Tag-Based KASAN, being the guard page inaccessible by nature, the write tag operation on this page triggers a fault. Fix kasan shadow poisoning size invoking get_vm_area_size() instead of accessing directly the field in the data structure to detect the correct value. Link: https://lkml.kernel.org/r/20201027160213.32904-1-vincenzo.frascino@arm.com Fixes: d98c9e83b5e7c ("kasan: fix crashes on access to memory mapped by vm_map_ram()") Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/vmalloc: Fix unlock order in s_stop()Waiman Long
When multiple locks are acquired, they should be released in reverse order. For s_start() and s_stop() in mm/vmalloc.c, that is not the case. s_start: mutex_lock(&vmap_purge_lock); spin_lock(&vmap_area_lock); s_stop : mutex_unlock(&vmap_purge_lock); spin_unlock(&vmap_area_lock); This unlock sequence, though allowed, is not optimal. If a waiter is present, mutex_unlock() will need to go through the slowpath of waking up the waiter with preemption disabled. Fix that by releasing the spinlock first before the mutex. Link: https://lkml.kernel.org/r/20201213180843.16938-1-longman@redhat.com Fixes: e36176be1c39 ("mm/vmalloc: rework vmap_area_lock") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/vmalloc.c: remove unnecessary return statementBaolin Wang
Remove unnecessary return statement for void function. Link: https://lkml.kernel.org/r/ca23f89259c80c3562700ae6e227b2815a195853.1606891153.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/vmalloc: add 'align' parameter explanation for pvm_determine_end_from_reverseAlex Shi
Kernel-doc markup has a issue on pvm_determine_end_from_reverse: mm/vmalloc.c:3145: warning: Function parameter or member 'align' not described in 'pvm_determine_end_from_reverse' Add a explanation for it to remove the warning. Link: https://lkml.kernel.org/r/1605605088-30668-3-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/vmalloc: rework the drain logicUladzislau Rezki (Sony)
A current "lazy drain" model suffers from at least two issues. First one is related to the unsorted list of vmap areas, thus in order to identify the [min:max] range of areas to be drained, it requires a full list scan. What is a time consuming if the list is too long. Second one and as a next step is about merging all fragments with a free space. What is also a time consuming because it has to iterate over entire list which holds outstanding lazy areas. See below the "preemptirqsoff" tracer that illustrates a high latency. It is ~24676us. Our workloads like audio and video are effected by such long latency: <snip> tracer: preemptirqsoff preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+ -------------------------------------------------------------------- latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8) ----------------- | task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16) ----------------- => started at: __purge_vmap_area_lazy => ended at: __purge_vmap_area_lazy _------=> CPU# / _-----=> irqs-off | / _----=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth |||| / delay cmd pid ||||| time | caller \ / ||||| \ | / crtc_com-261 1...1 1us*: _raw_spin_lock <-__purge_vmap_area_lazy [...] crtc_com-261 1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_lazy crtc_com-261 1...1 24677us : trace_preempt_on <-__purge_vmap_area_lazy crtc_com-261 1...1 24683us : <stack trace> => free_vmap_area_noflush => remove_vm_area => __vunmap => vfree => drm_property_free_blob => drm_mode_object_unreference => drm_property_unreference_blob => __drm_atomic_helper_crtc_destroy_state => sde_crtc_destroy_state => drm_atomic_state_default_clear => drm_atomic_state_clear => drm_atomic_state_free => complete_commit => _msm_drm_commit_work_cb => kthread_worker_fn => kthread => ret_from_fork <snip> To address those two issues we can redesign a purging of the outstanding lazy areas. Instead of queuing vmap areas to the list, we replace it by the separate rb-tree. In hat case an area is located in the tree/list in ascending order. It will give us below advantages: a) Outstanding vmap areas are merged creating bigger coalesced blocks, thus it becomes less fragmented. b) It is possible to calculate a flush range [min:max] without scanning all elements. It is O(1) access time or complexity; c) The final merge of areas with the rb-tree that represents a free space is faster because of (a). As a result the lock contention is also reduced. Link: https://lkml.kernel.org/r/20201116220033.1837-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Minchan Kim <minchan@kernel.org> Cc: huang ying <huang.ying.caritas@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/vmalloc: use free_vm_area() if an allocation failsUladzislau Rezki (Sony)
There is a dedicated and separate function that finds and removes a continuous kernel virtual area. As a final step it also releases the "area", a descriptor of corresponding vm_struct. Use free_vmap_area() in the __vmalloc_node_range() instead of open coded steps which are exactly the same, to perform a cleanup. Link: https://lkml.kernel.org/r/20201116220033.1837-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/vmalloc.c:__vmalloc_area_node(): avoid 32-bit overflowAndrew Morton
With a machine with 3 TB (more than 2 TB memory). If you use vmalloc to allocate > 2 TB memory, the array_size below will be overflowed. The array_size is an unsigned int and can only be used to allocate less than 2 TB memory. If you pass 2*1028*1028*1024*1024 = 2 * 2^40 in the argument of vmalloc. The array_size will become 2*2^31 = 2^32. The 2^32 cannot be store with a 32 bit integer. The fix is to change the type of array_size to unsigned long. [akpm@linux-foundation.org: rework for current mainline] Link: https://bugzilla.kernel.org/show_bug.cgi?id=210023 Reported-by: <hsinhuiwu@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: extract might_alloc() debug checkDaniel Vetter
Extracted from slab.h, which seems to have the most complete version including the correct might_sleep() check. Roll it out to slob.c. Motivated by a discussion with Paul about possibly changing call_rcu behaviour to allocate memory, but only roughly every 500th call. There are a lot fewer places in the kernel that care about whether allocating memory is allowed or not (due to deadlocks with reclaim code) than places that care whether sleeping is allowed. But debugging these also tends to be a lot harder, so nice descriptive checks could come in handy. I might have some use eventually for annotations in drivers/gpu. Note that unlike fs_reclaim_acquire/release gfpflags_allow_blocking does not consult the PF_MEMALLOC flags. But there is no flag equivalent for GFP_NOWAIT, hence this check can't go wrong due to memalloc_no*_save/restore contexts. Willy is working on a patch series which might change this: https://lore.kernel.org/linux-mm/20200625113122.7540-7-willy@infradead.org/ I think best would be if that updates gfpflags_allow_blocking(), since there's a ton of callers all over the place for that already. Link: https://lkml.kernel.org/r/20201125162532.1299794-3-daniel.vetter@ffwll.ch Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Michel Lespinasse <walken@google.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Waiman Long <longman@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Qian Cai <cai@lca.pw> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Christian König <christian.koenig@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: track mmu notifiers in fs_reclaim_acquire/releaseDaniel Vetter
fs_reclaim_acquire/release nicely catch recursion issues when allocating GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep the excessive caches in check). For mmu notifier recursions we do have lockdep annotations since 23b68395c7c7 ("mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end"). But these only fire if a path actually results in some pte invalidation - for most small allocations that's very rarely the case. The other trouble is that pte invalidation can happen any time when __GFP_RECLAIM is set. Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good enough to avoid potential mmu notifier recursion. I was pondering whether we should just do the general annotation, but there's always the risk for false positives. Plus I'm assuming that the core fs and io code is a lot better reviewed and tested than random mmu notifier code in drivers. Hence why I decide to only annotate for that specific case. Furthermore even if we'd create a lockdep map for direct reclaim, we'd still need to explicit pull in the mmu notifier map - there's a lot more places that do pte invalidation than just direct reclaim, these two contexts arent the same. Note that the mmu notifiers needing their own independent lockdep map is also the reason we can't hold them from fs_reclaim_acquire to fs_reclaim_release - it would nest with the acquistion in the pte invalidation code, causing a lockdep splat. And we can't remove the annotations from pte invalidation and all the other places since they're called from many other places than page reclaim. Hence we can only do the equivalent of might_lock, but on the raw lockdep map. With this we can also remove the lockdep priming added in 66204f1d2d1b ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly more powerful. Link: https://lkml.kernel.org/r/20201125162532.1299794-2-daniel.vetter@ffwll.ch Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michel Lespinasse <walken@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: forbid splitting special mappingsDmitry Safonov
Don't allow splitting of vm_special_mapping's. It affects vdso/vvar areas. Uprobes have only one page in xol_area so they aren't affected. Those restrictions were enforced by checks in .mremap() callbacks. Restrict resizing with generic .split() callback. Link: https://lkml.kernel.org/r/20201013013416.390574-7-dima@arista.com Signed-off-by: Dmitry Safonov <dima@arista.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mremap: check if it's possible to split original vmaDmitry Safonov
If original VMA can't be split at the desired address, do_munmap() will fail and leave both new-copied VMA and old VMA. De-facto it's MREMAP_DONTUNMAP behaviour, which is unexpected. Currently, it may fail such way for hugetlbfs and dax device mappings. Minimize such unpleasant situations to OOM by checking .may_split() before attempting to create a VMA copy. Link: https://lkml.kernel.org/r/20201013013416.390574-6-dima@arista.com Signed-off-by: Dmitry Safonov <dima@arista.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15vm_ops: rename .split() callback to .may_split()Dmitry Safonov
Rename the callback to reflect that it's not called *on* or *after* split, but rather some time before the splitting to check if it's possible. Link: https://lkml.kernel.org/r/20201013013416.390574-5-dima@arista.com Signed-off-by: Dmitry Safonov <dima@arista.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aioDmitry Safonov
As kernel expect to see only one of such mappings, any further operations on the VMA-copy may be unexpected by the kernel. Maybe it's being on the safe side, but there doesn't seem to be any expected use-case for this, so restrict it now. Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com Fixes: commit e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()") Signed-off-by: Dmitry Safonov <dima@arista.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/mremap: for MREMAP_DONTUNMAP check security_vm_enough_memory_mm()Dmitry Safonov
Currently memory is accounted post-mremap() with MREMAP_DONTUNMAP, which may break overcommit policy. So, check if there's enough memory before doing actual VMA copy. Don't unset VM_ACCOUNT on MREMAP_DONTUNMAP. By semantics, such mremap() is actually a memory allocation. That also simplifies the error-path a little. Also, as it's memory allocation on success don't reset hiwater_vm value. Link: https://lkml.kernel.org/r/20201013013416.390574-3-dima@arista.com Fixes: commit e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()") Signed-off-by: Dmitry Safonov <dima@arista.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/mremap: account memory on do_munmap() failureDmitry Safonov
Patch series "mremap: move_vma() fixes". This patch (of 6): move_vma() copies VMA without adding it to account, then unmaps old part of VMA. On failure it unmaps the new VMA. With hacks accounting in munmap is disabled as it's a copy of existing VMA. Account the memory on munmap() failure which was previously copied into a new VMA. Link: https://lkml.kernel.org/r/20201013013416.390574-1-dima@arista.com Link: https://lkml.kernel.org/r/20201013013416.390574-2-dima@arista.com Fixes: commit e2ea83742133 ("[PATCH] mremap: move_vma fixes and cleanup") Signed-off-by: Dmitry Safonov <dima@arista.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: move free_unref_page to mm/internal.hMatthew Wilcox (Oracle)
Code outside mm/ should not be calling free_unref_page(). Also move free_unref_page_list(). Link: https://lkml.kernel.org/r/20201125034655.27687-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: mmap_lock: add tracepoints around lock acquisitionAxel Rasmussen
The goal of these tracepoints is to be able to debug lock contention issues. This lock is acquired on most (all?) mmap / munmap / page fault operations, so a multi-threaded process which does a lot of these can experience significant contention. We trace just before we start acquisition, when the acquisition returns (whether it succeeded or not), and when the lock is released (or downgraded). The events are broken out by lock type (read / write). The events are also broken out by memcg path. For container-based workloads, users often think of several processes in a memcg as a single logical "task", so collecting statistics at this level is useful. The end goal is to get latency information. This isn't directly included in the trace events. Instead, users are expected to compute the time between "start locking" and "acquire returned", using e.g. synthetic events or BPF. The benefit we get from this is simpler code. Because we use tracepoint_enabled() to decide whether or not to trace, this patch has effectively no overhead unless tracepoints are enabled at runtime. If tracepoints are enabled, there is a performance impact, but how much depends on exactly what e.g. the BPF program does. [axelrasmussen@google.com: fix use-after-free race and css ref leak in tracepoints] Link: https://lkml.kernel.org/r/20201130233504.3725241-1-axelrasmussen@google.com [axelrasmussen@google.com: v3] Link: https://lkml.kernel.org/r/20201207213358.573750-1-axelrasmussen@google.com [rostedt@goodmis.org: in-depth examples of tracepoint_enabled() usage, and per-cpu-per-context buffer design] Link: https://lkml.kernel.org/r/20201105211739.568279-2-axelrasmussen@google.com Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/page_vma_mapped.c: add colon to fix kernel-doc markups error for check_pteAlex Shi
check_pte() needs a correct colon for kernel-doc markup, otherwise, gcc has the following warning for W=1, mm/page_vma_mapped.c:86: warning: Function parameter or member 'pvmw' not described in 'check_pte' Link: https://lkml.kernel.org/r/1605597167-25145-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/mapping_dirty_helpers: enhance the kernel-doc markupsAlex Shi
Add and change parameter explanation for wp_pte and clean_record_pte, to avoid W1 warning: mm/mapping_dirty_helpers.c:34: warning: Function parameter or member 'end' not described in 'wp_pte' mm/mapping_dirty_helpers.c:88: warning: Function parameter or member 'end' not described in 'clean_record_pte' Link: https://lkml.kernel.org/r/1605605088-30668-2-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: cleanup: remove unused tsk arg from __access_remote_vmJohn Hubbard
Despite a comment that said that page fault accounting would be charged to whatever task_struct* was passed into __access_remote_vm(), the tsk argument was actually unused. Making page fault accounting actually use this task struct is quite a project, so there is no point in keeping the tsk argument. Delete both the comment, and the argument. [rppt@linux.ibm.com: changelog addition] Link: https://lkml.kernel.org/r/20201026074137.4147787-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: speedup mremap on 1GB or larger regionsKalesh Singh
Android needs to move large memory regions for garbage collection. The GC requires moving physical pages of multi-gigabyte heap using mremap. During this move, the application threads have to be paused for correctness. It is critical to keep this pause as short as possible to avoid jitters during user interaction. Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD level if the source and destination addresses are PUD-aligned. For CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves PGD entries, since the PUD entry is “folded back” onto the PGD entry. Add HAVE_MOVE_PUD so that architectures where moving at the PUD level isn't supported/tested can turn this off by not selecting the config. Link: https://lkml.kernel.org/r/20201014005320.2233162-4-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Gavin Shan <gshan@redhat.com> Cc: Hassan Naveed <hnaveed@wavecomp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>