summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2023-08-21mm/compaction: correct comment to complete migration failureKemeng Shi
Commit cfccd2e63e7e0 ("mm, compaction: finish pageblocks on complete migration failure") convert cc->order aligned check to page block order aligned check. Correct comment relevant with it. Link: https://lkml.kernel.org/r/20230804110454.2935878-7-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/compaction: correct comment of cached migrate pfn updateKemeng Shi
Commit e380bebe47715 ("mm, compaction: keep migration source private to a single compaction instance") moved update of async and sync compact_cached_migrate_pfn from update_pageblock_skip to update_cached_migrate but left the comment behind. Move the relevant comment to correct this. Link: https://lkml.kernel.org/r/20230804110454.2935878-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/compaction: correct comment of fast_find_migrateblock in isolate_migratepagesKemeng Shi
After 90ed667c03fe5 ("Revert "Revert "mm/compaction: fix set skip in fast_find_migrateblock"""), we remove skip set in fast_find_migrateblock. Correct comment that fast_find_block is used to avoid isolation_suitable check for pageblock returned from fast_find_migrateblock because fast_find_migrateblock will mark found pageblock skipped. Instead, comment that fast_find_block is used to avoid a redundant check of fast found pageblock which is already checked skip flag inside fast_find_migrateblock. Link: https://lkml.kernel.org/r/20230804110454.2935878-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/compaction: skip page block marked skip in isolate_migratepages_blockKemeng Shi
Move migrate_pfn to page block end when block is marked skip to avoid unnecessary scan retry of that block from upper caller. For example, compact_zone may wrongly rescan skip page block with finish_pageblock set as following: 1. cc->migrate point to the start of page block 2. compact_zone record last_migrated_pfn to cc->migrate 3. compact_zone->isolate_migratepages->isolate_migratepages_block tries to scan the block. The low_pfn maybe moved forward to middle of block because of free pages at beginning of block. 4. we find first lru page could be isolated but block was exclusive marked skip. 5. abort isolate_migratepages_block and make cc->migrate_pfn point to found lru page at middle of block. 6. compact_zone find cc->migrate_pfn and last_migrated_pfn are in the same block and wrongly rescan the block with finish_pageblock set. Link: https://lkml.kernel.org/r/20230804110454.2935878-4-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/compaction: correct last_migrated_pfn update in compact_zoneKemeng Shi
We record start pfn of last isolated page block with last_migrated_pfn. And then: 1. We check if we mark the page block skip for exclusive access in isolate_migratepages_block by test if next migrate pfn is still in last isolated page block. If so, we will set finish_pageblock to do the rescan. 2. We check if a full cc->order block is scanned by test if last scan range passes the cc->order block boundary. If so, we flush the pages were freed. We treat cc->migrate_pfn before isolate_migratepages as the start pfn of last isolated page range. However, we always align migrate_pfn to page block or move to another page block in fast_find_migrateblock or in linearly scan forward in isolate_migratepages before do page isolation in isolate_migratepages_block. Update last_migrated_pfn with pageblock_start_pfn(cc->migrate_pfn - 1) after scan to correctly set start pfn of last isolated page range. To avoid that: 1. Miss a rescan with finish_pageblock set as last_migrate_pfn does not point to right pageblock and the migrate will not be in pageblock of last_migrate_pfn as it should be. 2. Wrongly issue flush by test cc->order block boundary with wrong last_migrate_pfn. Link: https://lkml.kernel.org/r/20230804110454.2935878-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: no need to export mm_kobjGreg Kroah-Hartman
There are no modules using mm_kobj, so do not export it. Link: https://lkml.kernel.org/r/2023080436-algebra-cabana-417d@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: hugetlb: use flush_hugetlb_tlb_range() in move_hugetlb_page_tables()Kefeng Wang
Archs may need to do special things when flushing hugepage tlb, so use the more applicable flush_hugetlb_tlb_range() instead of flush_tlb_range(). Link: https://lkml.kernel.org/r/20230801023145.17026-2-wangkefeng.wang@huawei.com Fixes: 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mina Almasry <almasrymina@google.com> Cc: Will Deacon <will@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/compaction: remove unnecessary "else continue" at end of loop in ↵Kemeng Shi
isolate_freepages_block There is no behavior change to remove "else continue" code at end of scan loop. Just remove it to make code cleaner. Link: https://lkml.kernel.org/r/20230803094901.2915942-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kemeng Shi <shikemeng@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/compaction: remove unnecessary cursor page in isolate_freepages_blockKemeng Shi
The cursor is only used for page forward currently. We can simply move page forward directly to remove unnecessary cursor. Link: https://lkml.kernel.org/r/20230803094901.2915942-4-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kemeng Shi <shikemeng@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/compaction: merge end_pfn boundary check in isolate_freepages_rangeKemeng Shi
Merge the end_pfn boundary check for single page block forward and multiple page blocks forward to avoid do twice boundary check for multiple page blocks forward. Link: https://lkml.kernel.org/r/20230803094901.2915942-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/compaction: set compact_cached_free_pfn correctly in update_pageblock_skipKemeng Shi
Patch series "Fixes and cleanups to compaction", v2. This series contains random fixes and cleanups to free page isolation in compaction. This is based on another compact series[1]. More details can be found in respective patches. This patch (of 4): We will set skip to page block of block_start_pfn, it's more reasonable to set compact_cached_free_pfn to page block before the block_start_pfn. Link: https://lkml.kernel.org/r/20230803094901.2915942-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20230803094901.2915942-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Kemeng Shi <shikemeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/memcg: fix wrong function name above obj_cgroup_charge_zswap()Miaohe Lin
The correct function name is obj_cgroup_may_zswap(). Correct the comment. Link: https://lkml.kernel.org/r/20230803120021.762279-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/page_alloc: remove unneeded variable baseMiaohe Lin
Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations"), local variable base is just as same as order. So remove it. No functional change intended. Link: https://lkml.kernel.org/r/20230803114934.693989-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/z3fold: use helper function put_z3fold_locked() and put_z3fold_locked_list()Ruan Jinjie
This code is already duplicated six times, use helper function put_z3fold_locked() to release z3fold page instead of open code it to help improve code readability a bit. And add put_z3fold_locked_list() helper function to be consistent with it. No functional change involved. Link: https://lkml.kernel.org/r/20230803113824.886413-1-ruanjinjie@huawei.com Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/damon/sysfs-schemes: support target damos filterSeongJae Park
Extend DAMON sysfs interface to support the DAMON monitoring target based DAMOS filter. Users can use it via writing 'target' to the filter's 'type' file and specifying the index of the target from the corresponding DAMON context's monitoring targets list to 'target_idx' sysfs file. Link: https://lkml.kernel.org/r/20230802214312.110532-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/damon/core: implement target type damos filterSeongJae Park
One DAMON context can have multiple monitoring targets, and DAMOS schemes are applied to all targets. In some cases, users need to apply different scheme to different targets. Retrieving monitoring results via DAMON sysfs interface' 'tried_regions' directory could be one good example. Also, there could be cases that cgroup DAMOS filter is not enough. All such use cases can be worked around by having multiple DAMON contexts having only single target, but it is inefficient in terms of resource usage, thogh the overhead is not estimated to be huge. Implement DAMON monitoring target based DAMOS filter for the case. Like address range target DAMOS filter, handle these filters in the DAMON core layer, since it is more efficient than doing in operations set layer. This also means that regions that filtered out by monitoring target type DAMOS filters are counted as not tried by the scheme. Hence, target granularity monitoring results retrieval via DAMON sysfs interface becomes available. Link: https://lkml.kernel.org/r/20230802214312.110532-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/damon/core-test: add a unit test for __damos_filter_out()SeongJae Park
Implement a kunit test for the core of address range DAMOS filter handling, namely __damos_filter_out(). The test especially focus on regions that overlap with given filter's target address range. Link: https://lkml.kernel.org/r/20230802214312.110532-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/damon/sysfs-schemes: support address range type DAMOS filterSeongJae Park
Extend DAMON sysfs interface to support address range based DAMOS filters, by adding a special keyword for the filter/<N>/type file, namely 'addr', and two files under filter/<N>/ for specifying the start and the end addresses of the range, namely 'addr_start' and 'addr_end'. Link: https://lkml.kernel.org/r/20230802214312.110532-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/damon/core: introduce address range type damos filterSeongJae Park
Patch series "Extend DAMOS filters for address ranges and DAMON monitoring targets" There are use cases that need to apply DAMOS schemes to specific address ranges or DAMON monitoring targets. NUMA nodes in the physical address space, special memory objects in the virtual address space, and monitoring target specific efficient monitoring results snapshot retrieval could be examples of such use cases. This patchset extends DAMOS filters feature for such cases, by implementing two more filter types, namely address ranges and DAMON monitoring types. Patches sequence ---------------- The first seven patches are for the address ranges based DAMOS filter. The first patch implements the filter feature and expose it via DAMON kernel API. The second patch further expose the feature to users via DAMON sysfs interface. The third and fourth patches implement unit tests and selftests for the feature. Three patches (fifth to seventh) updating the documents follow. The following six patches are for the DAMON monitoring target based DAMOS filter. The eighth patch implements the feature in the core layer and expose it via DAMON's kernel API. The ninth patch further expose it to users via DAMON sysfs interface. Tenth patch add a selftest, and two patches (eleventh and twelfth) update documents. [1] https://lore.kernel.org/damon/20230728203444.70703-1-sj@kernel.org/ This patch (of 13): Users can know special characteristic of specific address ranges. NUMA nodes or special objects or buffers in virtual address space could be such examples. For such cases, DAMOS schemes could required to be applied to only specific address ranges. Implement yet another type of DAMOS filter for the purpose. Note that the existing filter types, namely anon pages and memcg DAMOS filters needed page level type check. Because such check can be done efficiently in the opertions set layer, those filters are handled in operations set layer. Specifically, only paddr operations set implementation supports these filters. Also, because statistics counting is done in the DAMON core layer, the regions that filtered out by these filters are counted as tried but failed to the statistics. Unlike those, address range based filters can efficiently handled in the core layer. Hence, do the handling in the layer, and count the regions that filtered out by those as the scheme has not tried for the region. This difference should clearly documented. Link: https://lkml.kernel.org/r/20230802214312.110532-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230802214312.110532-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/damon/sysfs: implement a command for updating only schemes tried total bytesSeongJae Park
Using tried_regions/total_bytes file, users can efficiently retrieve the total size of memory regions having specific access pattern. However, DAMON sysfs interface in kernel still populates all the infomration on the tried_regions subdirectories. That means the kernel part overhead for the construction of tried regions directories still exists. To remove the overhead, implement yet another command input for 'state' DAMON sysfs file. Writing the input to the file makes DAMON sysfs interface to update only the total_bytes file. Link: https://lkml.kernel.org/r/20230802213222.109841-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/damon/sysfs-schemes: implement DAMOS tried total bytes fileSeongJae Park
Patch series "mm/damon/sysfs-schemes: implement DAMOS tried total bytes file". The tried_regions directory of DAMON sysfs interface is useful for retrieving monitoring results snapshot or DAMOS debugging. However, for common use case that need to monitor only the total size of the scheme tried regions (e.g., monitoring working set size), the kernel overhead for directory construction and user overhead for reading the content could be high if the number of monitoring region is not small. This patchset implements DAMON sysfs files for efficient support of the use case. The first patch implements the sysfs file to reduce the user space overhead, and the second patch implements a command for reducing the kernel space overhead. The third patch adds a selftest for the new file, and following two patches update documents. [1] https://lore.kernel.org/damon/20230728201817.70602-1-sj@kernel.org/ This patch (of 5): The tried_regions directory can be used for retrieving the monitoring results snapshot for regions of specific access pattern, by setting the scheme's action as 'stat' and the access pattern as required. While the interface provides every detail of the monitoring results, some use cases including working set size monitoring requires only the total size of the regions. For such cases, users should read all the information and calculate the total size of the regions. However, it could incur high overhead if the number of regions is high. Add a file for retrieving only the information, namely 'total_bytes' file. It allows users to get the total size by reading only the file. Link: https://lkml.kernel.org/r/20230802213222.109841-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230802213222.109841-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21Multi-gen LRU: fix can_swap in lru_gen_look_around()Kalesh Singh
walk->can_swap might be invalid since it's not guaranteed to be initialized for the particular lruvec. Instead deduce it from the folio type (anon/file). Link: https://lkml.kernel.org/r/20230802025606.346758-3-kaleshsingh@google.com Fixes: 018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap") Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek] Tested-by: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Steven Barrett <steven@liquorix.net> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21Multi-gen LRU: avoid race in inc_min_seq()Kalesh Singh
inc_max_seq() will try to inc_min_seq() if nr_gens == MAX_NR_GENS. This is because the generations are reused (the last oldest now empty generation will become the next youngest generation). inc_min_seq() is retried until successful, dropping the lru_lock and yielding the CPU on each failure, and retaking the lock before trying again: while (!inc_min_seq(lruvec, type, can_swap)) { spin_unlock_irq(&lruvec->lru_lock); cond_resched(); spin_lock_irq(&lruvec->lru_lock); } However, the initial condition that required incrementing the min_seq (nr_gens == MAX_NR_GENS) is not retested. This can change by another call to inc_max_seq() from run_aging() with force_scan=true from the debugfs interface. Since the eviction stalls when the nr_gens == MIN_NR_GENS, avoid unnecessarily incrementing the min_seq by rechecking the number of generations before each attempt. This issue was uncovered in previous discussion on the list by Yu Zhao and Aneesh Kumar [1]. [1] https://lore.kernel.org/linux-mm/CAOUHufbO7CaVm=xjEb1avDhHVvnC8pJmGyKcFf2iY_dpf+zR3w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20230802025606.346758-2-kaleshsingh@google.com Fixes: d6c3af7d8a2b ("mm: multi-gen LRU: debugfs interface") Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek] Tested-by: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Steven Barrett <steven@liquorix.net> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21Multi-gen LRU: fix per-zone reclaimKalesh Singh
MGLRU has a LRU list for each zone for each type (anon/file) in each generation: long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; The min_seq (oldest generation) can progress independently for each type but the max_seq (youngest generation) is shared for both anon and file. This is to maintain a common frame of reference. In order for eviction to advance the min_seq of a type, all the per-zone lists in the oldest generation of that type must be empty. The eviction logic only considers pages from eligible zones for eviction or promotion. scan_folios() { ... for (zone = sc->reclaim_idx; zone >= 0; zone--) { ... sort_folio(); // Promote ... isolate_folio(); // Evict } ... } Consider the system has the movable zone configured and default 4 generations. The current state of the system is as shown below (only illustrating one type for simplicity): Type: ANON Zone DMA32 Normal Movable Device Gen 0 0 0 4GB 0 Gen 1 0 1GB 1MB 0 Gen 2 1MB 4GB 1MB 0 Gen 3 1MB 1MB 1MB 0 Now consider there is a GFP_KERNEL allocation request (eligible zone index <= Normal), evict_folios() will return without doing any work since there are no pages to scan in the eligible zones of the oldest generation. Reclaim won't make progress until triggered from a ZONE_MOVABLE allocation request; which may not happen soon if there is a lot of free memory in the movable zone. This can lead to OOM kills, although there is 1GB pages in the Normal zone of Gen 1 that we have not yet tried to reclaim. This issue is not seen in the conventional active/inactive LRU since there are no per-zone lists. If there are no (not enough) folios to scan in the eligible zones, move folios from ineligible zone (zone_index > reclaim_index) to the next generation. This allows for the progression of min_seq and reclaiming from the next generation (Gen 1). Qualcomm, Mediatek and raspberrypi [1] discovered this issue independently. [1] https://github.com/raspberrypi/linux/issues/5395 Link: https://lkml.kernel.org/r/20230802025606.346758-1-kaleshsingh@google.com Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Reported-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek] Tested-by: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Steven Barrett <steven@liquorix.net> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm:vmscan: fix inaccurate reclaim during proactive reclaimEfly Young
Before commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim and swap floods"), proactive reclaim will extreme overreclaim sometimes. But proactive reclaim still inaccurate and some extent overreclaim. Problematic case is easy to construct. Allocate lots of anonymous memory (e.g., 20G) in a memcg, then swapping by writing memory.recalim and there is a certain probability of overreclaim. For example, request 1G by writing memory.reclaim will eventually reclaim 1.7G or other values more than 1G. The reason is that reclaimer may have already reclaimed part of requested memory in one loop, but before adjust sc->nr_to_reclaim in outer loop, call shrink_lruvec() again will still follow the current sc->nr_to_reclaim to work. It will eventually lead to overreclaim. In theory, the amount of reclaimed would be in [request, 2 * request). Reclaimer usually tends to reclaim more than request. But either direct or kswapd reclaim have much smaller nr_to_reclaim targets, so it is less noticeable and not have much impact. Proactive reclaim can usually come in with a larger value, so the error is difficult to ignore. Considering proactive reclaim is usually low frequency, handle the batching into smaller chunks is a better approach. Link: https://lkml.kernel.org/r/20230721014116.3388-1-yangyifei03@kuaishou.com Signed-off-by: Efly Young <yangyifei03@kuaishou.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/damon/core-test: add a test for damos_new_filter()SeongJae Park
damos_new_filter() was having a bug that not initializing ->list field of the returning damos_filter struct, which results in access to uninitialized memory. Add a unit test for the function. Link: https://lkml.kernel.org/r/20230729203733.38949-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/page_alloc: avoid unneeded alike_pages calculationMiaohe Lin
When free_pages is 0, alike_pages is not used. So alike_pages calculation can be avoided by checking free_pages early to save cpu cycles. Also fix typo 'comparable'. It should be 'compatible' here. Link: https://lkml.kernel.org/r/20230801123723.2225543-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/vmstat: remove unused page_ext.h from vmstatKemeng Shi
No page_ext function or structure is used in vmstat. Just remove page_ext header from vmstat. Link: https://lkml.kernel.org/r/20230717113227.1897173-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/page_poison: remove unused page_ext.h from page_poisonKemeng Shi
Patch series "minor cleanups to page_ext header". No page_ext function or structure is used in page_poison. Just remove page_ext header from page_poison. Link: https://lkml.kernel.org/r/20230717113227.1897173-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20230717113227.1897173-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21damon: use pmdp_get instead of drectly dereferencing pmdLevi Yun
As ptep_get, Use the pmdp_get wrapper when we accessing pmdval instead of directly dereferencing pmd. Link: https://lkml.kernel.org/r/20230727212157.2985025-1-ppbuk5246@gmail.com Signed-off-by: Levi Yun <ppbuk5246@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: improve the comment in isolate_migratepages_block()Matthew Wilcox
A recent patch shows that not everybody understands that "stabilise the mapping" really means "prevent the mapping from being freed", so change the wording to hopefully make that more clear. Link: https://lkml.kernel.org/r/ZMLWEB4m3zvX6SBN@casper.infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: kmsan: use helper macros PAGE_ALIGN and PAGE_ALIGN_DOWNZhangPeng
Use helper macros PAGE_ALIGN and PAGE_ALIGN_DOWN to improve code readability. No functional modification involved. Link: https://lkml.kernel.org/r/20230727011612.2721843-4-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marco Elver <elver@google.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: kmsan: use helper macro offset_in_page()ZhangPeng
Use helper macro offset_in_page() to improve code readability. No functional modification involved. Link: https://lkml.kernel.org/r/20230727011612.2721843-3-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marco Elver <elver@google.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: kmsan: use helper function page_size()ZhangPeng
Patch series "minor cleanups for kmsan". Use helper function and macros to improve code readability. No functional modification involved. This patch (of 3): Use function page_size() to improve code readability. No functional modification involved. Link: https://lkml.kernel.org/r/20230727011612.2721843-1-zhangpeng362@huawei.com Link: https://lkml.kernel.org/r/20230727011612.2721843-2-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Marco Elver <elver@google.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/memory.c: fix some kernel-doc commentsYang Li
Add description of @mas and @tree_end, remove @mt in unmap_vmas(). to silence the warnings: mm/memory.c:1837: warning: Function parameter or member 'mas' not described in 'unmap_vmas' mm/memory.c:1837: warning: Function parameter or member 'tree_end' not described in 'unmap_vmas' mm/memory.c:1837: warning: Excess function parameter 'mt' description in 'unmap_vmas' Link: https://lkml.kernel.org/r/20230727015558.69554-1-yang.lee@linux.alibaba.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5996 Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: zswap: kill zswap_get_swap_cache_page()Johannes Weiner
The __read_swap_cache_async() interface isn't more difficult to understand than what the helper abstracts. Save the indirection and a level of indentation for the primary work of the writeback func. Link: https://lkml.kernel.org/r/20230727162343.1415598-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: zswap: tighten up entry invalidationJohannes Weiner
Removing a zswap entry from the tree is tied to an explicit operation that's supposed to drop the base reference: swap invalidation, exclusive load, duplicate store. Don't silently remove the entry on final put, but instead warn if an entry is in tree without reference. While in that diff context, convert a BUG_ON to a WARN_ON_ONCE. No need to crash on a refcount underflow. Link: https://lkml.kernel.org/r/20230727162343.1415598-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: zswap: use zswap_invalidate_entry() for duplicatesJohannes Weiner
Patch series "mm: zswap: three cleanups". Three small cleanups to zswap, the first one suggested by Yosry during the frontswap removal. This patch (of 3): Minor cleanup. Instead of open-coding the tree deletion and the put, use the zswap_invalidate_entry() convenience helper. Link: https://lkml.kernel.org/r/20230727162343.1415598-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20230727162343.1415598-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/page_ext: use page_ext_data helper in page_ownerKemeng Shi
Use page_ext_data helper in page_owner to avoid access offset directly. Link: https://lkml.kernel.org/r/20230718145812.1991717-4-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Andrew Morton <akpm@linux-foudation.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/page_ext: use page_ext_data helper in page_table_checkKemeng Shi
Use page_ext_data helper in page_table_check to avoid access offset directly. Link: https://lkml.kernel.org/r/20230718145812.1991717-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Andrew Morton <akpm@linux-foudation.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21zswap: make zswap_load() take a folioMatthew Wilcox (Oracle)
Only convert a few easy parts of this function to use the folio passed in; convert back to struct page for the majority of it. Removes three hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20230715042343.434588-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21swap: remove some calls to compound_head() in swap_readpage()Matthew Wilcox (Oracle)
Replace six implicit calls to compound_head() with one call to page_folio(). Link: https://lkml.kernel.org/r/20230715042343.434588-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21memcg: convert get_obj_cgroup_from_page to get_obj_cgroup_from_folioMatthew Wilcox (Oracle)
As the one caller now has a folio, pass it in and use it. Removes three calls to compound_head(). Link: https://lkml.kernel.org/r/20230715042343.434588-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21zswap: make zswap_store() take a folioMatthew Wilcox (Oracle)
Patch series "Followup folio conversions for zswap". With frontswap killed, it's worth converting the zswap_load() and zswap_store() functions to take a folio instead of a page pointer. They aren't converted to support large folios, but there are a lot of unnecessary calls to compound_head() that are removed by these patches. This patch (of 4): Only convert a few easy parts of this function to use the folio passed in; convert back to struct page for the majority of it. This does remove a few hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20230715042343.434588-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230715042343.434588-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: kill frontswapJohannes Weiner
The only user of frontswap is zswap, and has been for a long time. Have swap call into zswap directly and remove the indirection. [hannes@cmpxchg.org: remove obsolete comment, per Yosry] Link: https://lkml.kernel.org/r/20230719142832.GA932528@cmpxchg.org [fengwei.yin@intel.com: don't warn if none swapcache folio is passed to zswap_load] Link: https://lkml.kernel.org/r/20230810095652.3905184-1-fengwei.yin@intel.com Link: https://lkml.kernel.org/r/20230717160227.GA867137@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: zswap: multiple zpools supportYosry Ahmed
Support using multiple zpools of the same type in zswap, for concurrency purposes. A fixed number of 32 zpools is suggested by this commit, which was determined empirically. It can be later changed or made into a config option if needed. On a setup with zswap and zsmalloc, comparing a single zpool to 32 zpools shows improvements in the zsmalloc lock contention, especially on the swap out path. The following shows the perf analysis of the swapout path when 10 workloads are simultaneously reclaiming and refaulting tmpfs pages. There are some improvements on the swap in path as well, but less significant. 1 zpool: |--28.99%--zswap_frontswap_store | <snip> | |--8.98%--zpool_map_handle | | | --8.98%--zs_zpool_map | | | --8.95%--zs_map_object | | | --8.38%--_raw_spin_lock | | | --7.39%--queued_spin_lock_slowpath | |--8.82%--zpool_malloc | | | --8.82%--zs_zpool_malloc | | | --8.80%--zs_malloc | | | |--7.21%--_raw_spin_lock | | | | | --6.81%--queued_spin_lock_slowpath <snip> 32 zpools: |--16.73%--zswap_frontswap_store | <snip> | |--1.81%--zpool_malloc | | | --1.81%--zs_zpool_malloc | | | --1.79%--zs_malloc | | | --0.73%--obj_malloc | |--1.06%--zswap_update_total_size | |--0.59%--zpool_map_handle | | | --0.59%--zs_zpool_map | | | --0.57%--zs_map_object | | | --0.51%--_raw_spin_lock <snip> Link: https://lkml.kernel.org/r/20230620194644.3142384-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Yu Zhao <yuzhao@google.com> Acked-by: Chris Li (Google) <chrisl@kernel.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Nhat Pham <nphamcs@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: multi-gen LRU: don't spin during memcg releaseT.J. Mercier
When a memcg is in the process of being released mem_cgroup_tryget will fail because its reference count has already reached 0. This can happen during reclaim if the memcg has already been offlined, and we reclaim all remaining pages attributed to the offlined memcg. shrink_many attempts to skip the empty memcg in this case, and continue reclaiming from the remaining memcgs in the old generation. If there is only one memcg remaining, or if all remaining memcgs are in the process of being released then shrink_many will spin until all memcgs have finished being released. The release occurs through a workqueue, so it can take a while before kswapd is able to make any further progress. This fix results in reductions in kswapd activity and direct reclaim in a test where 28 apps (working set size > total memory) are repeatedly launched in a random sequence: A B delta ratio(%) allocstall_movable 5962 3539 -2423 -40.64 allocstall_normal 2661 2417 -244 -9.17 kswapd_high_wmark_hit_quickly 53152 7594 -45558 -85.71 pageoutrun 57365 11750 -45615 -79.52 Link: https://lkml.kernel.org/r/20230814151636.1639123-1-tjmercier@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: memory-failure: fix unexpected return value in soft_offline_page()Miaohe Lin
When page_handle_poison() fails to handle the hugepage or free page in retry path, soft_offline_page() will return 0 while -EBUSY is expected in this case. Consequently the user will think soft_offline_page succeeds while it in fact failed. So the user will not try again later in this case. Link: https://lkml.kernel.org/r/20230627112808.1275241-1-linmiaohe@huawei.com Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: add a call to flush_cache_vmap() in vmap_pfn()Alexandre Ghiti
flush_cache_vmap() must be called after new vmalloc mappings are installed in the page table in order to allow architectures to make sure the new mapping is visible. It could lead to a panic since on some architectures (like powerpc), the page table walker could see the wrong pte value and trigger a spurious page fault that can not be resolved (see commit f1cb8f9beba8 ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")). But actually the patch is aiming at riscv: the riscv specification allows the caching of invalid entries in the TLB, and since we recently removed the vmalloc page fault handling, we now need to emit a tlb shootdown whenever a new vmalloc mapping is emitted (https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/). That's a temporary solution, there are ways to avoid that :) Link: https://lkml.kernel.org/r/20230809164633.1556126-1-alexghiti@rivosinc.com Fixes: 3e9a9e256b1e ("mm: add a vmap_pfn function") Reported-by: Dylan Jhong <dylan@andestech.com> Closes: https://lore.kernel.org/linux-riscv/ZMytNY2J8iyjbPPy@atctrx.andestech.com/ Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Reviewed-by: Dylan Jhong <dylan@andestech.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via ↵David Hildenbrand
GUP-fast In contrast to most other GUP code, GUP-fast common page table walking code like gup_pte_range() also handles hugetlb pages. But in contrast to other hugetlb page table walking code, it does not look at the hugetlb PTE abstraction whereby we have only a single logical hugetlb PTE per hugetlb page, even when using multiple cont-PTEs underneath -- which is for example what huge_ptep_get() abstracts. So when we have a hugetlb page that is mapped via cont-PTEs, GUP-fast might stumble over a PTE that does not map the head page of a hugetlb page -- not the first "head" PTE of such a cont mapping. Logically, the whole hugetlb page is mapped (entire_mapcount == 1), but we might end up calling gup_must_unshare() with a tail page of a hugetlb page. We only maintain a single PageAnonExclusive flag per hugetlb page (as hugetlb pages cannot get partially COW-shared), stored for the head page. That flag is clear for all tail pages. So when gup_must_unshare() ends up calling PageAnonExclusive() with a tail page of a hugetlb page: 1) With CONFIG_DEBUG_VM_PGFLAGS Stumbles over the: VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); For example, when executing the COW selftests with 64k hugetlb pages on arm64: [ 61.082187] page:00000000829819ff refcount:3 mapcount:1 mapping:0000000000000000 index:0x1 pfn:0x11ee11 [ 61.082842] head:0000000080f79bf7 order:4 entire_mapcount:1 nr_pages_mapped:0 pincount:2 [ 61.083384] anon flags: 0x17ffff80003000e(referenced|uptodate|dirty|head|mappedtodisk|node=0|zone=2|lastcpupid=0xfffff) [ 61.084101] page_type: 0xffffffff() [ 61.084332] raw: 017ffff800000000 fffffc00037b8401 0000000000000402 0000000200000000 [ 61.084840] raw: 0000000000000010 0000000000000000 00000000ffffffff 0000000000000000 [ 61.085359] head: 017ffff80003000e ffffd9e95b09b788 ffffd9e95b09b788 ffff0007ff63cf71 [ 61.085885] head: 0000000000000000 0000000000000002 00000003ffffffff 0000000000000000 [ 61.086415] page dumped because: VM_BUG_ON_PAGE(PageHuge(page) && !PageHead(page)) [ 61.086914] ------------[ cut here ]------------ [ 61.087220] kernel BUG at include/linux/page-flags.h:990! [ 61.087591] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 61.087999] Modules linked in: ... [ 61.089404] CPU: 0 PID: 4612 Comm: cow Kdump: loaded Not tainted 6.5.0-rc4+ #3 [ 61.089917] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 61.090409] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 61.090897] pc : gup_must_unshare.part.0+0x64/0x98 [ 61.091242] lr : gup_must_unshare.part.0+0x64/0x98 [ 61.091592] sp : ffff8000825eb940 [ 61.091826] x29: ffff8000825eb940 x28: 0000000000000000 x27: fffffc00037b8440 [ 61.092329] x26: 0400000000000001 x25: 0000000000080101 x24: 0000000000080000 [ 61.092835] x23: 0000000000080100 x22: ffff0000cffb9588 x21: ffff0000c8ec6b58 [ 61.093341] x20: 0000ffffad6b1000 x19: fffffc00037b8440 x18: ffffffffffffffff [ 61.093850] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548 [ 61.094358] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021 [ 61.094858] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffffd9e958b7a1c0 [ 61.095359] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 00000000002bffa8 [ 61.095873] x5 : ffff0008bb19e708 x4 : 0000000000000000 x3 : 0000000000000000 [ 61.096380] x2 : 0000000000000000 x1 : ffff0000cf6636c0 x0 : 0000000000000046 [ 61.096894] Call trace: [ 61.097080] gup_must_unshare.part.0+0x64/0x98 [ 61.097392] gup_pte_range+0x3a8/0x3f0 [ 61.097662] gup_pgd_range+0x1ec/0x280 [ 61.097942] lockless_pages_from_mm+0x64/0x1a0 [ 61.098258] internal_get_user_pages_fast+0xe4/0x1d0 [ 61.098612] pin_user_pages_fast+0x58/0x78 [ 61.098917] pin_longterm_test_start+0xf4/0x2b8 [ 61.099243] gup_test_ioctl+0x170/0x3b0 [ 61.099528] __arm64_sys_ioctl+0xa8/0xf0 [ 61.099822] invoke_syscall.constprop.0+0x7c/0xd0 [ 61.100160] el0_svc_common.constprop.0+0xe8/0x100 [ 61.100500] do_el0_svc+0x38/0xa0 [ 61.100736] el0_svc+0x3c/0x198 [ 61.100971] el0t_64_sync_handler+0x134/0x150 [ 61.101280] el0t_64_sync+0x17c/0x180 [ 61.101543] Code: aa1303e0 f00074c1 912b0021 97fffeb2 (d4210000) 2) Without CONFIG_DEBUG_VM_PGFLAGS Always detects "not exclusive" for passed tail pages and refuses to PIN the tail pages R/O, as gup_must_unshare() == true. GUP-fast will fallback to ordinary GUP. As ordinary GUP properly considers the logical hugetlb PTE abstraction in hugetlb_follow_page_mask(), pinning the page will succeed when looking at the PageAnonExclusive on the head page only. So the only real effect of this is that with cont-PTE hugetlb pages, we'll always fallback from GUP-fast to ordinary GUP when not working on the head page, which ends up checking the head page and do the right thing. Consequently, the cow selftests pass with cont-PTE hugetlb pages as well without CONFIG_DEBUG_VM_PGFLAGS. Note that this only applies to anon hugetlb pages that are mapped using cont-PTEs: for example 64k hugetlb pages on a 4k arm64 kernel. ... and only when R/O-pinning (FOLL_PIN) such pages that are mapped into the page table R/O using GUP-fast. On production kernels (and even most debug kernels, that don't set CONFIG_DEBUG_VM_PGFLAGS) this patch should theoretically not be required to be backported. But of course, it does not hurt. Link: https://lkml.kernel.org/r/20230805101256.87306-1-david@redhat.com Fixes: a7f226604170 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>