summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2022-07-17mm: kfence: pass a pointer to virt_to_page()Linus Walleij
Functions that work on a pointer to virtual memory such as virt_to_pfn() and users of that function such as virt_to_page() are supposed to pass a pointer to virtual memory, ideally a (void *) or other pointer. However since many architectures implement virt_to_pfn() as a macro, this function becomes polymorphic and accepts both a (unsigned long) and a (void *). If we instead implement a proper virt_to_pfn(void *addr) function the following happens (occurred on arch/arm): mm/kfence/core.c:558:30: warning: passing argument 1 of 'virt_to_pfn' makes pointer from integer without a cast [-Wint-conversion] In one case we can refer to __kfence_pool directly (and that is a proper (char *) pointer) and in the other call site we use an explicit cast. Link: https://lkml.kernel.org/r/20220630084124.691207-4-linus.walleij@linaro.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm/highmem: pass a pointer to virt_to_page()Linus Walleij
Functions that work on a pointer to virtual memory such as virt_to_pfn() and users of that function such as virt_to_page() are supposed to pass a pointer to virtual memory, ideally a (void *) or other pointer. However since many architectures implement virt_to_pfn() as a macro, this function becomes polymorphic and accepts both a (unsigned long) and a (void *). If we instead implement a proper virt_to_pfn(void *addr) function the following happens (occurred on arch/arm): mm/highmem.c:153:29: warning: passing argument 1 of 'virt_to_pfn' makes pointer from integer without a cast [-Wint-conversion] We already have a proper void * pointer in the scope of this function named "vaddr" so pass that instead. Link: https://lkml.kernel.org/r/20220630084124.691207-3-linus.walleij@linaro.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm/memcontrol.c: replace cgroup_memory_nokmem with mem_cgroup_kmem_disabled()Xiang Yang
mem_cgroup_kmem_disabled() checks whether the kmem accounting is off. Therefore, replace cgroup_memory_nokmem with mem_cgroup_kmem_disabled(), which is the same work in percpu.c and slab_common.c. Link: https://lkml.kernel.org/r/20220625061844.226764-1-xiangyang3@huawei.com Signed-off-by: Xiang Yang <xiangyang3@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Souptick Joarder (HPE) <jrdr.linux@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm/page_alloc: replace local_lock with normal spinlockMel Gorman
struct per_cpu_pages is no longer strictly local as PCP lists can be drained remotely using a lock for protection. While the use of local_lock works, it goes against the intent of local_lock which is for "pure CPU local concurrency control mechanisms and not suited for inter-CPU concurrency control" (Documentation/locking/locktypes.rst) local_lock protects against migration between when the percpu pointer is accessed and the pcp->lock acquired. The lock acquisition is a preemption point so in the worst case, a task could migrate to another NUMA node and accidentally allocate remote memory. The main requirement is to pin the task to a CPU that is suitable for PREEMPT_RT and !PREEMPT_RT. Replace local_lock with helpers that pin a task to a CPU, lookup the per-cpu structure and acquire the embedded lock. It's similar to local_lock without breaking the intent behind the API. It is not a complete API as only the parts needed for PCP-alloc are implemented but in theory, the generic helpers could be promoted to a general API if there was demand for an embedded lock within a per-cpu struct with a guarantee that the per-cpu structure locked matches the running CPU and cannot use get_cpu_var due to RT concerns. PCP requires these semantics to avoid accidentally allocating remote memory. [mgorman@techsingularity.net: use pcp_spin_trylock_irqsave instead of pcpu_spin_trylock_irqsave] Link: https://lkml.kernel.org/r/20220627084645.GA27531@techsingularity.net Link: https://lkml.kernel.org/r/20220624125423.6126-8-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm/page_alloc: remotely drain per-cpu listsNicolas Saenz Julienne
Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu drain work queued by __drain_all_pages(). So introduce a new mechanism to remotely drain the per-cpu lists. It is made possible by remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new scheme is that drain operations are now migration safe. There was no observed performance degradation vs. the previous scheme. Both netperf and hackbench were run in parallel to triggering the __drain_all_pages(NULL, true) code path around ~100 times per second. The new scheme performs a bit better (~5%), although the important point here is there are no performance regressions vs. the previous mechanism. Per-cpu lists draining happens only in slow paths. Minchan Kim tested an earlier version and reported; My workload is not NOHZ CPUs but run apps under heavy memory pressure so they goes to direct reclaim and be stuck on drain_all_pages until work on workqueue run. unit: nanosecond max(dur) avg(dur) count(dur) 166713013 487511.77786438033 1283 From traces, system encountered the drain_all_pages 1283 times and worst case was 166ms and avg was 487us. The other problem was alloc_contig_range in CMA. The PCP draining takes several hundred millisecond sometimes though there is no memory pressure or a few of pages to be migrated out but CPU were fully booked. Your patch perfectly removed those wasted time. Link: https://lkml.kernel.org/r/20220624125423.6126-7-mgorman@techsingularity.net Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Yu Zhao <yuzhao@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm/page_alloc: protect PCP lists with a spinlockMel Gorman
Currently the PCP lists are protected by using local_lock_irqsave to prevent migration and IRQ reentrancy but this is inconvenient. Remote draining of the lists is impossible and a workqueue is required and every task allocation/free must disable then enable interrupts which is expensive. As preparation for dealing with both of those problems, protect the lists with a spinlock. The IRQ-unsafe version of the lock is used because IRQs are already disabled by local_lock_irqsave. spin_trylock is used in combination with local_lock_irqsave() but later will be replaced with a spin_trylock_irqsave when the local_lock is removed. The per_cpu_pages still fits within the same number of cache lines after this patch relative to before the series. struct per_cpu_pages { spinlock_t lock; /* 0 4 */ int count; /* 4 4 */ int high; /* 8 4 */ int batch; /* 12 4 */ short int free_factor; /* 16 2 */ short int expire; /* 18 2 */ /* XXX 4 bytes hole, try to pack */ struct list_head lists[13]; /* 24 208 */ /* size: 256, cachelines: 4, members: 7 */ /* sum members: 228, holes: 1, sum holes: 4 */ /* padding: 24 */ } __attribute__((__aligned__(64))); There is overhead in the fast path due to acquiring the spinlock even though the spinlock is per-cpu and uncontended in the common case. Page Fault Test (PFT) running on a 1-socket reported the following results on a 1 socket machine. 5.19.0-rc3 5.19.0-rc3 vanilla mm-pcpspinirq-v5r16 Hmean faults/sec-1 869275.7381 ( 0.00%) 874597.5167 * 0.61%* Hmean faults/sec-3 2370266.6681 ( 0.00%) 2379802.0362 * 0.40%* Hmean faults/sec-5 2701099.7019 ( 0.00%) 2664889.7003 * -1.34%* Hmean faults/sec-7 3517170.9157 ( 0.00%) 3491122.8242 * -0.74%* Hmean faults/sec-8 3965729.6187 ( 0.00%) 3939727.0243 * -0.66%* There is a small hit in the number of faults per second but given that the results are more stable, it's borderline noise. [akpm@linux-foundation.org: add missing local_unlock_irqrestore() on contention path] Link: https://lkml.kernel.org/r/20220624125423.6126-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm/page_alloc: remove mistaken page == NULL check in rmqueueMel Gorman
If a page allocation fails, the ZONE_BOOSTER_WATERMARK should be tested, cleared and kswapd woken whether the allocation attempt was via the PCP or directly via the buddy list. Remove the page == NULL so the ZONE_BOOSTED_WATERMARK bit is checked unconditionally. As it is unlikely that ZONE_BOOSTED_WATERMARK is set, mark the branch accordingly. Link: https://lkml.kernel.org/r/20220624125423.6126-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm/page_alloc: split out buddy removal code from rmqueue into separate helperMel Gorman
This is a preparation page to allow the buddy removal code to be reused in a later patch. No functional change. Link: https://lkml.kernel.org/r/20220624125423.6126-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm/page_alloc: use only one PCP list for THP-sized allocationsMel Gorman
The per_cpu_pages is cache-aligned on a standard x86-64 distribution configuration but a later patch will add a new field which would push the structure into the next cache line. Use only one list to store THP-sized pages on the per-cpu list. This assumes that the vast majority of THP-sized allocations are GFP_MOVABLE but even if it was another type, it would not contribute to serious fragmentation that potentially causes a later THP allocation failure. Align per_cpu_pages on the cacheline boundary to ensure there is no false cache sharing. After this patch, the structure sizing is; struct per_cpu_pages { int count; /* 0 4 */ int high; /* 4 4 */ int batch; /* 8 4 */ short int free_factor; /* 12 2 */ short int expire; /* 14 2 */ struct list_head lists[13]; /* 16 208 */ /* size: 256, cachelines: 4, members: 6 */ /* padding: 32 */ } __attribute__((__aligned__(64))); Link: https://lkml.kernel.org/r/20220624125423.6126-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm/page_alloc: add page->buddy_list and page->pcp_listMel Gorman
Patch series "Drain remote per-cpu directly", v5. Some setups, notably NOHZ_FULL CPUs, may be running realtime or latency-sensitive applications that cannot tolerate interference due to per-cpu drain work queued by __drain_all_pages(). Introduce a new mechanism to remotely drain the per-cpu lists. It is made possible by remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. This has two advantages, the time to drain is more predictable and other unrelated tasks are not interrupted. This series has the same intent as Nicolas' series "mm/page_alloc: Remote per-cpu lists drain support" -- avoid interference of a high priority task due to a workqueue item draining per-cpu page lists. While many workloads can tolerate a brief interruption, it may cause a real-time task running on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is non-deterministic. Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. The local_lock on its own prevents migration and the IRQ disabling protects from corruption due to an interrupt arriving while a page allocation is in progress. This series adjusts the locking. A spinlock is added to struct per_cpu_pages to protect the list contents while local_lock_irq is ultimately replaced by just the spinlock in the final patch. This allows a remote CPU to safely. Follow-on work should allow the spin_lock_irqsave to be converted to spin_lock to avoid IRQs being disabled/enabled in most cases. The follow-on patch will be one kernel release later as it is relatively high risk and it'll make bisections more clear if there are any problems. Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages and when it is storing per-cpu pages. Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking this is not necessary but it avoids per_cpu_pages consuming another cache line. Patch 3 is a preparation patch to avoid code duplication. Patch 4 is a minor correction. Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still relying on local_lock to prevent migration, stabilise the pcp lookup and prevent IRQ reentrancy. Patch 6 remote drains per-cpu pages directly instead of using a workqueue. Patch 7 uses a normal spinlock instead of local_lock for remote draining This patch (of 7): The page allocator uses page->lru for storing pages on either buddy or PCP lists. Create page->buddy_list and page->pcp_list as a union with page->lru. This is simply to clarify what type of list a page is on in the page allocator. No functional change intended. [minchan@kernel.org: fix page lru fields in macros] Link: https://lkml.kernel.org/r/20220624125423.6126-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17hugetlb: lazy page table copies in fork()Mike Kravetz
Lazy page table copying at fork time was introduced with commit d992895ba2b2 ("[PATCH] Lazy page table copies in fork()"). At the time, hugetlb was very new and did not support page faulting. As a result, it was excluded. When full page fault support was added for hugetlb, the exclusion was not removed. Simply remove the check that prevents lazy copying of hugetlb page tables at fork. Of course, like other mappings this only applies to shared mappings. Lazy page table copying at fork will be less advantageous for hugetlb mappings because: - There are fewer page table entries with hugetlb - hugetlb pmds can be shared instead of copied In any case, completely eliminating the copy at fork time should speed things up. Link: https://lkml.kernel.org/r/20220621235620.291305-5-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: James Houghton <jthoughton@google.com> Cc: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17hugetlb: do not update address in huge_pmd_unshareMike Kravetz
As an optimization for loops sequentially processing hugetlb address ranges, huge_pmd_unshare would update a passed address if it unshared a pmd. Updating a loop control variable outside the loop like this is generally a bad idea. These loops are now using hugetlb_mask_last_page to optimize scanning when non-present ptes are discovered. The same can be done when huge_pmd_unshare returns 1 indicating a pmd was unshared. Remove address update from huge_pmd_unshare. Change the passed argument type and update all callers. In loops sequentially processing addresses use hugetlb_mask_last_page to update address if pmd is unshared. [sfr@canb.auug.org.au: fix an unused variable warning/error] Link: https://lkml.kernel.org/r/20220622171117.70850960@canb.auug.org.au Link: https://lkml.kernel.org/r/20220621235620.291305-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Will Deacon <will@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17hugetlb: skip to end of PT page mapping when pte not presentMike Kravetz
Patch series "hugetlb: speed up linear address scanning", v2. At unmap, fork and remap time hugetlb address ranges are linearly scanned. We can optimize these scans if the ranges are sparsely populated. Also, enable page table "Lazy copy" for hugetlb at fork. NOTE: Architectures not defining CONFIG_ARCH_WANT_GENERAL_HUGETLB need to add an arch specific version hugetlb_mask_last_page() to take advantage of sparse address scanning improvements. Baolin Wang added the routine for arm64. Other architectures which could be optimized are: ia64, mips, parisc, powerpc, s390, sh and sparc. This patch (of 4): HugeTLB address ranges are linearly scanned during fork, unmap and remap operations. If a non-present entry is encountered, the code currently continues to the next huge page aligned address. However, a non-present entry implies that the page table page for that entry is not present. Therefore, the linear scan can skip to the end of range mapped by the page table page. This can speed operations on large sparsely populated hugetlb mappings. Create a new routine hugetlb_mask_last_page() that will return an address mask. When the mask is ORed with an address, the result will be the address of the last huge page mapped by the associated page table page. Use this mask to update addresses in routines which linearly scan hugetlb address ranges when a non-present pte is encountered. hugetlb_mask_last_page is related to the implementation of huge_pte_offset as hugetlb_mask_last_page is called when huge_pte_offset returns NULL. This patch only provides a complete hugetlb_mask_last_page implementation when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined. Architectures which provide their own versions of huge_pte_offset can also provide their own version of hugetlb_mask_last_page. Link: https://lkml.kernel.org/r/20220621235620.291305-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20220621235620.291305-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: James Houghton <jthoughton@google.com> Cc: Mina Almasry <almasrymina@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17kasan: separate double free case from invalid freeKuan-Ying Lee
Currently, KASAN describes all invalid-free/double-free bugs as "double-free or invalid-free". This is ambiguous. KASAN should report "double-free" when a double-free is a more likely cause (the address points to the start of an object) and report "invalid-free" otherwise [1]. [1] https://bugzilla.kernel.org/show_bug.cgi?id=212193 Link: https://lkml.kernel.org/r/20220615062219.22618-1-Kuan-Ying.Lee@mediatek.com Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Yee Lee <yee.lee@mediatek.com> Cc: Andrew Yang <andrew.yang@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: khugepaged: reorg some khugepaged helpersYang Shi
The khugepaged_{enabled|always|req_madv} are not khugepaged only anymore, move them to huge_mm.h and rename to hugepage_flags_xxx, and remove khugepaged_req_madv due to no users. Also move khugepaged_defrag to khugepaged.c since its only caller is in that file, it doesn't have to be in a header file. Link: https://lkml.kernel.org/r/20220616174840.1202070-7-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: thp: kill __transhuge_page_enabled()Yang Shi
The page fault path checks THP eligibility with __transhuge_page_enabled() which does the similar thing as hugepage_vma_check(), so use hugepage_vma_check() instead. However page fault allows DAX and !anon_vma cases, so added a new flag, in_pf, to hugepage_vma_check() to make page fault work correctly. The in_pf flag is also used to skip shmem and file THP for page fault since shmem handles THP in its own shmem_fault() and file THP allocation on fault is not supported yet. Also remove hugepage_vma_enabled() since hugepage_vma_check() is the only caller now, it is not necessary to have a helper function. Link: https://lkml.kernel.org/r/20220616174840.1202070-6-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: thp: kill transparent_hugepage_active()Yang Shi
The transparent_hugepage_active() was introduced to show THP eligibility bit in smaps in proc, smaps is the only user. But it actually does the similar check as hugepage_vma_check() which is used by khugepaged. We definitely don't have to maintain two similar checks, so kill transparent_hugepage_active(). This patch also fixed the wrong behavior for VM_NO_KHUGEPAGED vmas. Also move hugepage_vma_check() to huge_memory.c and huge_mm.h since it is not only for khugepaged anymore. [akpm@linux-foundation.org: check vma->vm_mm, per Zach] [akpm@linux-foundation.org: add comment to vdso check] Link: https://lkml.kernel.org/r/20220616174840.1202070-5-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: khugepaged: better comments for anon vma check in hugepage_vma_revalidateYang Shi
The hugepage_vma_revalidate() needs to check if the vma is still anonymous vma or not since the address may be unmapped then remapped to file before khugepaged reaquired the mmap_lock. The old comment is not quite helpful, elaborate this with better comment. Link: https://lkml.kernel.org/r/20220616174840.1202070-4-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: thp: consolidate vma size check to transhuge_vma_suitableYang Shi
There are couple of places that check whether the vma size is ok for THP or whether address fits, they are open coded and duplicate, use transhuge_vma_suitable() to do the job by passing in (vma->end - HPAGE_PMD_SIZE). Move vma size check into hugepage_vma_check(). This will make khugepaged_enter() is as same as khugepaged_enter_vma(). There is just one caller for khugepaged_enter(), replace it to khugepaged_enter_vma() and remove khugepaged_enter(). Link: https://lkml.kernel.org/r/20220616174840.1202070-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: khugepaged: check THP flag in hugepage_vma_check()Yang Shi
Patch series "Cleanup transhuge_xxx helpers", v5. This series is the follow-up of the discussion about cleaning up transhuge_xxx helpers at https://lore.kernel.org/linux-mm/627a71f8-e879-69a5-ceb3-fc8d29d2f7f1@suse.cz/. THP has a bunch of helpers that do VMA sanity check for different paths, they do the similar checks for the most callsites and have a lot duplicate codes. And it is confusing what helpers should be used at what conditions. This series reorganized and cleaned up the code so that we could consolidate all the checks into hugepage_vma_check(). The transhuge_vma_enabled(), transparent_hugepage_active() and __transparent_hugepage_enabled() are killed by this series. This patch (of 7): Currently the THP flag check in hugepage_vma_check() will fallthrough if the flag is NEVER and VM_HUGEPAGE is set. This is not a problem for now since all the callers have the flag checked before or can't be invoked if the flag is NEVER. However, the following patch will call hugepage_vma_check() in more places, for example, page fault, so this flag must be checked in hugepge_vma_check(). Link: https://lkml.kernel.org/r/20220616174840.1202070-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20220616174840.1202070-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: introduce mf_dax_kill_procs() for fsdax caseShiyang Ruan
This new function is a variant of mf_generic_kill_procs that accepts a file, offset pair instead of a struct to support multiple files sharing a DAX mapping. It is intended to be called by the file systems as part of the memory_failure handler after the file system performed a reverse mapping from the storage address to the file and file offset. Link: https://lkml.kernel.org/r/20220603053738.1218681-6-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17pagemap,pmem: introduce ->memory_failure()Shiyang Ruan
When memory-failure occurs, we call this function which is implemented by each kind of devices. For the fsdax case, pmem device driver implements it. Pmem device driver will find out the filesystem in which the corrupted page located in. With dax_holder notify support, we are able to notify the memory failure from pmem driver to upper layers. If there is something not support in the notify routine, memory_failure will fall back to the generic hanlder. Link: https://lkml.kernel.org/r/20220603053738.1218681-4-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: factor helpers for memory_failure_dev_pagemapShiyang Ruan
memory_failure_dev_pagemap code is a bit complex before introduce RMAP feature for fsdax. So it is needed to factor some helper functions to simplify these code. [akpm@linux-foundation.org: fix CONFIG_HUGETLB_PAGE=n build] [zhengbin13@huawei.com: fix redefinition of mf_generic_kill_procs] Link: https://lkml.kernel.org/r/20220628112143.1170473-1-zhengbin13@huawei.com Link: https://lkml.kernel.org/r/20220603053738.1218681-3-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm/gup: migrate device coherent pages when pinning instead of failingAlistair Popple
Currently any attempts to pin a device coherent page will fail. This is because device coherent pages need to be managed by a device driver, and pinning them would prevent a driver from migrating them off the device. However this is no reason to fail pinning of these pages. These are coherent and accessible from the CPU so can be migrated just like pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin them first try migrating them out of ZONE_DEVICE. [hch@lst.de: rebased to the split device memory checks, moved migrate_device_page to migrate_device.c] Link: https://lkml.kernel.org/r/20220715150521.18165-7-alex.sierra@amd.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: add device coherent vma selection for memory migrationAlex Sierra
This case is used to migrate pages from device memory, back to system memory. Device coherent type memory is cache coherent from device and CPU point of view. Link: https://lkml.kernel.org/r/20220715150521.18165-6-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alistair Poppple <apopple@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: handling Non-LRU pages returned by vm_normal_pagesAlex Sierra
With DEVICE_COHERENT, we'll soon have vm_normal_pages() return device-managed anonymous pages that are not LRU pages. Although they behave like normal pages for purposes of mapping in CPU page, and for COW. They do not support LRU lists, NUMA migration or THP. Callers to follow_page() currently don't expect ZONE_DEVICE pages, however, with DEVICE_COHERENT we might now return ZONE_DEVICE. Check for ZONE_DEVICE pages in applicable users of follow_page() as well. Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> [v2] Reviewed-by: Alistair Popple <apopple@nvidia.com> [v6] Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: add zone device coherent type memory supportAlex Sierra
Device memory that is cache coherent from device and CPU point of view. This is used on platforms that have an advanced system bus (like CAPI or CXL). Any page of a process can be migrated to such memory. However, no one should be allowed to pin such memory so that it can always be evicted. [hch@lst.de: rebased ontop of the refcount changes, remove is_dev_private_or_coherent_page] Link: https://lkml.kernel.org/r/20220715150521.18165-4-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm: rename is_pinnable_page() to is_longterm_pinnable_page()Alex Sierra
Patch series "Add MEMORY_DEVICE_COHERENT for coherent device memory mapping", v9. This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory owned by a device that can be mapped into CPU page tables like MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE. This patch series is mostly self-contained except for a few places where it needs to update other subsystems to handle the new memory type. System stability and performance are not affected according to our ongoing testing, including xfstests. How it works: The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver registers the memory with devmap as MEMORY_DEVICE_COHERENT using devm_memremap_pages. The initial user for this hardware page migration capability is the Frontier supercomputer project. This functionality is not AMD-specific. We expect other GPU vendors to find this functionality useful, and possibly other hardware types in the future. Our test nodes in the lab are similar to the Frontier configuration, with .5 TB of system memory plus 256 GB of device memory split across 4 GPUs, all in a single coherent address space. Page migration is expected to improve application efficiency significantly. We will report empirical results as they become available. Coherent device type pages at gup are now migrated back to system memory if they are being pinned long-term (FOLL_LONGTERM). The reason is, that long-term pinning would interfere with the device memory manager owning the device-coherent pages (e.g. evictions in TTM). These series incorporate Alistair Popple patches to do this migration from pin_user_pages() calls. hmm_gup_test has been added to hmm-test to test different get user pages calls. This series includes handling of device-managed anonymous pages returned by vm_normal_pages. Although they behave like normal pages for purposes of mapping in CPU page tables and for COW, they do not support LRU lists, NUMA migration or THP. We also introduced a FOLL_LRU flag that adds the same behaviour to follow_page and related APIs, to allow callers to specify that they expect to put pages on an LRU list. This patch (of 14): is_pinnable_page() and folio_is_pinnable() are renamed to is_longterm_pinnable_page() and folio_is_longterm_pinnable() respectively. These functions are used in the FOLL_LONGTERM flag context. Link: https://lkml.kernel.org/r/20220715150521.18165-1-alex.sierra@amd.com Link: https://lkml.kernel.org/r/20220715150521.18165-2-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17mm/damon/lru_sort: fix potential memory leak in damon_lru_sort_init()SeongJae Park
damon_lru_sort_init() returns an error when damon_select_ops() fails without freeing 'ctx' which allocated before. This commit fixes the potential memory leak by freeing 'ctx' under the situation. Link: https://lkml.kernel.org/r/20220714170458.49727-1-sj@kernel.org Fixes: 40e983cca927 ("mm/damon: introduce DAMON-based LRU-lists Sorting") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17UML: add support for KASAN under x86_64Patricia Alfonso
Make KASAN run on User Mode Linux on x86_64. The UML-specific KASAN initializer uses mmap to map the ~16TB of shadow memory to the location defined by KASAN_SHADOW_OFFSET. kasan_init() utilizes constructors to initialize KASAN before main(). The location of the KASAN shadow memory, starting at KASAN_SHADOW_OFFSET, can be configured using the KASAN_SHADOW_OFFSET option. The default location of this offset is 0x100000000000, which keeps it out-of-the-way even on UML setups with more "physical" memory. For low-memory setups, 0x7fff8000 can be used instead, which fits in an immediate and is therefore faster, as suggested by Dmitry Vyukov. There is usually enough free space at this location; however, it is a config option so that it can be easily changed if needed. Note that, unlike KASAN on other architectures, vmalloc allocations still use the shadow memory allocated upfront, rather than allocating and free-ing it per-vmalloc allocation. If another architecture chooses to go down the same path, we should replace the checks for CONFIG_UML with something more generic, such as: - A CONFIG_KASAN_NO_SHADOW_ALLOC option, which architectures could set - or, a way of having architecture-specific versions of these vmalloc and module shadow memory allocation options. Also note that, while UML supports both KASAN in inline mode (CONFIG_KASAN_INLINE) and static linking (CONFIG_STATIC_LINK), it does not support both at the same time. Signed-off-by: Patricia Alfonso <trishalfonso@google.com> Co-developed-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: David Gow <davidgow@google.com> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2022-07-07mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISONCatalin Marinas
Currently post_alloc_hook() skips the kasan unpoisoning if the tags will be zeroed (__GFP_ZEROTAGS) or __GFP_SKIP_KASAN_UNPOISON is passed. Since __GFP_ZEROTAGS is now accompanied by __GFP_SKIP_KASAN_UNPOISON, remove the extra check. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20220610152141.2148929-4-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-07mm: kasan: Skip unpoisoning of user pagesCatalin Marinas
Commit c275c5c6d50a ("kasan: disable freed user page poisoning with HW tags") added __GFP_SKIP_KASAN_POISON to GFP_HIGHUSER_MOVABLE. A similar argument can be made about unpoisoning, so also add __GFP_SKIP_KASAN_UNPOISON to user pages. To ensure the user page is still accessible via page_address() without a kasan fault, reset the page->flags tag. With the above changes, there is no need for the arm64 tag_clear_highpage() to reset the page->flags tag. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20220610152141.2148929-3-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-07mm: kasan: Ensure the tags are visible before the tag in page->flagsCatalin Marinas
__kasan_unpoison_pages() colours the memory with a random tag and stores it in page->flags in order to re-create the tagged pointer via page_to_virt() later. When the tag from the page->flags is read, ensure that the in-memory tags are already visible by re-ordering the page_kasan_tag_set() after kasan_unpoison(). The former already has barriers in place through try_cmpxchg(). On the reader side, the order is ensured by the address dependency between page->flags and the memory access. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Link: https://lore.kernel.org/r/20220610152141.2148929-2-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-04mm: slab: optimize memcg_slab_free_hook()Muchun Song
Most callers of memcg_slab_free_hook() already know the slab, which could be passed to memcg_slab_free_hook() directly to reduce the overhead of an another call of virt_to_slab(). For bulk freeing of objects, the call of slab_objcgs() in the loop in memcg_slab_free_hook() is redundant as well. Rework memcg_slab_free_hook() and build_detached_freelist() to reduce those unnecessary overhead and make memcg_slab_free_hook() can handle bulk freeing in slab_free(). Move the calling site of memcg_slab_free_hook() from do_slab_free() to slab_free() for slub to make the code clearer since the logic is weird (e.g. the caller need to judge whether it needs to call memcg_slab_free_hook()). It is easy to make mistakes like missing calling of memcg_slab_free_hook() like fixes of: commit d1b2cf6cb84a ("mm: memcg/slab: uncharge during kmem_cache_free_bulk()") commit ae085d7f9365 ("mm: kfence: fix missing objcg housekeeping for SLAB") This optimization is mainly for bulk objects freeing. The following numbers is shown for 16-object freeing. before after kmem_cache_free_bulk: ~430 ns ~400 ns The overhead is reduced by about 7% for 16-object freeing. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Link: https://lore.kernel.org/r/20220429123044.37885-1-songmuchun@bytedance.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-07-04mm/tracing: add 'accounted' entry into output of allocation tracepointsVasily Averin
Slab caches marked with SLAB_ACCOUNT force accounting for every allocation from this cache even if __GFP_ACCOUNT flag is not passed. Unfortunately, at the moment this flag is not visible in ftrace output, and this makes it difficult to analyze the accounted allocations. This patch adds boolean "accounted" entry into trace output, and set it to 'true' for calls used __GFP_ACCOUNT flag and for allocations from caches marked with SLAB_ACCOUNT. Set it to 'false' if accounting is disabled in configs. Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Link: https://lore.kernel.org/r/c418ed25-65fe-f623-fbf8-1676528859ed@openvz.org Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-07-04mm/slub: Simplify __kmem_cache_alias()Xiongwei Song
There is no need to do anything if sysfs_slab_alias() return nonzero value after getting a mergeable cache. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Link: https://lore.kernel.org/all/e5ebc952-af17-321f-5343-bc914d47c931@suse.cz/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-07-04mm, slab: fix bad alignmentsJiapeng Chong
As reported by coccicheck: ./mm/slab.c:3253:2-59: code aligned with following code on line 3255. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-07-03mm/khugepaged: try to free transhuge swapcache when possibleMiaohe Lin
Transhuge swapcaches won't be freed in __collapse_huge_page_copy(). It's because release_pte_page() is not called for these pages and thus free_page_and_swap_cache can't grab the page lock. These pages won't be freed from swap cache even if we are the only user until next time reclaim. It shouldn't hurt indeed, but we could try to free these pages to save more memory for system. Link: https://lkml.kernel.org/r/20220625092816.4856-8-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/khugepaged: remove unneeded return value of khugepaged_add_pte_mapped_thp()Miaohe Lin
The return value of khugepaged_add_pte_mapped_thp() is always 0 and also ignored. Remove it to clean up the code. Link: https://lkml.kernel.org/r/20220625092816.4856-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/khugepaged: use helper macro __ATTR_RWMiaohe Lin
Use helper macro __ATTR_RW to define the khugepaged attributes. Minor readability improvement. Link: https://lkml.kernel.org/r/20220625092816.4856-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/khugepaged: minor cleanup for collapse_fileMiaohe Lin
nr_none is always 0 for non-shmem case because the page can be read from the backend store. So when nr_none ! = 0, it must be in is_shmem case. Also only adjust the nrpages and uncharge shmem when nr_none != 0 to save cpu cycles. Link: https://lkml.kernel.org/r/20220625092816.4856-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/khugepaged: trivial typo and codestyle cleanupMiaohe Lin
Fix some typos and tweak the code to meet codestyle. No functional change intended. Link: https://lkml.kernel.org/r/20220625092816.4856-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/khugepaged: stop swapping in page when VM_FAULT_RETRY occursMiaohe Lin
When do_swap_page returns VM_FAULT_RETRY, we do not retry here and thus swap entry will remain in pagetable. This will result in later failure. So stop swapping in pages in this case to save cpu cycles. As A further optimization, mmap_lock is released when __collapse_huge_page_swapin() fails to avoid relocking mmap_lock. And "swapped_in++" is moved after error handling to make it more accurate. Link: https://lkml.kernel.org/r/20220625092816.4856-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/khugepaged: remove unneeded shmem_huge_enabled() checkMiaohe Lin
Patch series "A few cleanup patches for khugepaged", v2. This series contains a few cleaup patches to remove unneeded return value, use helper macro, fix typos and so on. More details can be found in the respective changelogs. This patch (of 7): If we reach here, khugepaged_scan_mm_slot() has already made sure that hugepage is enabled for shmem, via its call to hugepage_vma_check(). Remove this duplicated check. Link: https://lkml.kernel.org/r/20220625092816.4856-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20220625092816.4856-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Howells <dhowells@redhat.com> Cc: NeilBrown <neilb@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm: sparsemem: drop unexpected word 'a' in commentsXueBing Chen
there is an unexpected word 'a' in the comments that need to be dropped Link: https://lkml.kernel.org/r/24fbdae3.c86.1819a0f31b9.Coremail.chenxuebing@jari.cn Signed-off-by: XueBing Chen <chenxuebing@jari.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm: hugetlb: kill set_huge_swap_pte_at()Qi Zheng
Commit e5251fd43007 ("mm/hugetlb: introduce set_huge_swap_pte_at() helper") add set_huge_swap_pte_at() to handle swap entries on architectures that support hugepages consisting of contiguous ptes. And currently the set_huge_swap_pte_at() is only overridden by arm64. set_huge_swap_pte_at() provide a sz parameter to help determine the number of entries to be updated. But in fact, all hugetlb swap entries contain pfn information, so we can find the corresponding folio through the pfn recorded in the swap entry, then the folio_size() is the number of entries that need to be updated. And considering that users will easily cause bugs by ignoring the difference between set_huge_swap_pte_at() and set_huge_pte_at(). Let's handle swap entries in set_huge_pte_at() and remove the set_huge_swap_pte_at(), then we can call set_huge_pte_at() anywhere, which simplifies our coding. Link: https://lkml.kernel.org/r/20220626145717.53572-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/page_alloc: make the annotations of available memory more accurateYang Yang
Not all systems use swap, so estimating available memory would help to prevent swapping or OOM of system that not use swap. And we need to reserve some page cache to prevent swapping or thrashing. If somebody is accessing the pages in pagecache, and if too much would be freed, most accesses might mean reading data from disk, i.e. thrashing. Link: https://lkml.kernel.org/r/20220623020833.972979-1-yang.yang29@zte.com.cn Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: CGEL ZTE <cgel.zte@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm, docs: fix comments that mention mem_hotplug_end()Yun-Ze Li
Comments that mention mem_hotplug_end() are confusing as there is no function called mem_hotplug_end(). Fix them by replacing all the occurences of mem_hotplug_end() in the comments with mem_hotplug_done(). [akpm@linux-foundation.org: grammatical fixes] Link: https://lkml.kernel.org/r/20220620071516.1286101-1-p76091292@gs.ncku.edu.tw Signed-off-by: Yun-Ze Li <p76091292@gs.ncku.edu.tw> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm: rmap: simplify the hugetlb handling when unmapping or migrationBaolin Wang
According to previous discussion [1], there are so many levels of indenting to handle the hugetlb case when unmapping or migration. We can combine folio_test_anon() and huge_pmd_unshare() to save one level of indenting, by adding a local variable and moving the VM_BUG_ON() a little forward. No intended functional changes in this patch. [1] https://lore.kernel.org/all/0b986dc4-5843-3e2d-c2df-5a2e9f13e6ab@oracle.com/ Link: https://lkml.kernel.org/r/28414b1b96f095e838c1e548074f8e0fc70d78cf.1655724713.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/madvise: minor cleanup for swapin_walk_pmd_entry()Miaohe Lin
Passing index to pte_offset_map_lock() directly so the below calculation can be avoided. Rename orig_pte to ptep as it's not changed. Also use helper is_swap_pte() to improve the readability. No functional change intended. [akpm@linux-foundation.org: reduce scope of `ptep'] Link: https://lkml.kernel.org/r/20220618090527.37843-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>