Age | Commit message (Collapse) | Author |
|
We already have the folio here, so just use it, removing three hidden
calls to compound_head().
Link: https://lkml.kernel.org/r/20241002151403.1345296-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All callers have been converted to use folio_test_ksm() or
PageAnonNotKsm(), so we can remove this wrapper.
Link: https://lkml.kernel.org/r/20241002152533.1350629-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alex Shi <alexs@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Check that this anonymous page is really anonymous, not anonymous-or-KSM.
This optimises the debug check, but its real purpose is to remove the last
two users of PageKsm().
[willy@infradead.org: fix assertions]
Link: https://lkml.kernel.org/r/ZwApWPER7caIA_N3@casper.infradead.org
Link: https://lkml.kernel.org/r/20241002152533.1350629-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alex Shi <alexs@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove a call to PageKSM() by passing the folio containing tmp_page to
should_skip_rmap_item. Removes a hidden call to compound_head().
Link: https://lkml.kernel.org/r/20241002152533.1350629-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alex Shi <alexs@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
By making try_to_merge_two_pages() and stable_tree_search() return a
folio, we can replace kpage with kfolio. This replaces 7 calls to
compound_head() with one.
[cuigaosheng1@huawei.com: add IS_ERR_OR_NULL check for stable_tree_search()]
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Link: https://lkml.kernel.org/r/20241002152533.1350629-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Remove PageKsm()".
The KSM flag is almost always tested on the folio rather than on the page.
This series removes the final users of PageKsm() and makes the flag only
This patch (of 5):
It is safe to use a folio here because all callers took a refcount on this
page. The one wrinkle is that we have to recalculate the value of folio
after splitting the page, since it has probably changed. Replaces nine
calls to compound_head() with one.
Link: https://lkml.kernel.org/r/20241002152533.1350629-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20241002152533.1350629-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alex Shi <alexs@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
By reading the code, I found these variables are never referenced in the
code. Just remove them.
Link: https://lkml.kernel.org/r/20240924021426.1980-1-bajing@cmss.chinamobile.com
Signed-off-by: Ba Jing <bajing@cmss.chinamobile.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There is a unnecessary return statement at the end of void function
cma_activate_area. This can be dropped.
While at it, also fix another warning related to unsigned.
These are reported by checkpatch as well.
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
+unsigned cma_area_count;
WARNING: void function return statements are not generally useful
+ return;
+}
Link: https://lkml.kernel.org/r/20240927181637.19941-1-quic_pintu@quicinc.com
Signed-off-by: Pintu Kumar <quic_pintu@quicinc.com>
Cc: Pintu Agarwal <pintu.ping@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The kernel invalidates the page cache in batches of PAGEVEC_SIZE. For
each batch, it traverses the page cache tree and collects the entries
(folio and shadow entries) in the struct folio_batch. For the shadow
entries present in the folio_batch, it has to traverse the page cache tree
for each individual entry to remove them. This patch optimize this by
removing them in a single tree traversal.
To evaluate the changes, we created 200GiB file on a fuse fs and in a
memcg. We created the shadow entries by triggering reclaim through
memory.reclaim in that specific memcg and measure the simple
fadvise(DONTNEED) operation.
# time xfs_io -c 'fadvise -d 0 ${file_size}' file
time (sec)
Without 5.12 +- 0.061
With-patch 4.19 +- 0.086 (18.16% decrease)
Link: https://lkml.kernel.org/r/20240925224716.2904498-3-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Chris Mason <clm@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: optimize shadow entries removal", v2.
Some of our production workloads which processes a large amount of data
spends considerable amount of CPUs on truncation and invalidation of large
sized files (100s of GiBs of size). Tracing the operations showed that
most of the time is in shadow entries removal. This patch series
optimizes the truncation and invalidation operations.
This patch (of 2):
The kernel truncates the page cache in batches of PAGEVEC_SIZE. For each
batch, it traverses the page cache tree and collects the entries (folio
and shadow entries) in the struct folio_batch. For the shadow entries
present in the folio_batch, it has to traverse the page cache tree for
each individual entry to remove them. This patch optimize this by
removing them in a single tree traversal.
On large machines in our production which run workloads manipulating large
amount of data, we have observed that a large amount of CPUs are spent on
truncation of very large files (100s of GiBs file sizes). More
specifically most of time was spent on shadow entries cleanup, so
optimizing the shadow entries cleanup, even a little bit, has good impact.
To evaluate the changes, we created 200GiB file on a fuse fs and in a
memcg. We created the shadow entries by triggering reclaim through
memory.reclaim in that specific memcg and measure the simple truncation
operation.
# time truncate -s 0 file
time (sec)
Without 5.164 +- 0.059
With-patch 4.21 +- 0.066 (18.47% decrease)
Link: https://lkml.kernel.org/r/20240925224716.2904498-1-shakeel.butt@linux.dev
Link: https://lkml.kernel.org/r/20240925224716.2904498-2-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Mason <clm@fb.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Bits of LRU_REFS_MASK are not inherited during migration which lead to new
folio start from tier0 when MGLRU enabled. Try to bring as much bits of
folio->flags as possible since compaction and alloc_contig_range which
introduce migration do happen at times.
Link: https://lkml.kernel.org/r/20240926050647.5653-1-zhaoyang.huang@unisoc.com
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Suggested-by: Yu Zhao <yuzhao@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now no users are using the pte_offset_map_nolock(), remove it.
Link: https://lkml.kernel.org/r/d04f9bbbcde048fb6ffa6f2bdbc6f9b22d5286f9.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In walk_pte_range(), we may modify the pte entry after holding the ptl, so
convert it to using pte_offset_map_rw_nolock(). At this time, the
pte_same() check is not performed after the ptl held, so we should get
pmdval and do pmd_same() check to ensure the stability of pmd entry.
Link: https://lkml.kernel.org/r/7e9c194a5efacc9609cfd31abb9c7df88b53b530.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring
the ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
will use pte_same() to detect the change of the pte entry, there is no
need to get pmdval, so just pass a dummy variable to it.
Link: https://lkml.kernel.org/r/1530e8fdbfc72eacf3b095babe139ce3d715600a.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In the caller of map_pte(), we may modify the pvmw->pte after acquiring
the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At this
time, the pte_same() check is not performed after the pvmw->ptl held, so
we should get pmdval and do pmd_same() check to ensure the stability of
pvmw->pmd.
Link: https://lkml.kernel.org/r/2620a48f34c9f19864ab0169cdbf253d31a8fcaa.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
convert it to using pte_offset_map_rw_nolock(). Now new_pte is none, so
hpage_collapse_scan_file() path can not find this by traversing
file->f_mapping, so there is no concurrency with retract_page_tables().
In addition, we already hold the exclusive mmap_lock, so this new_pte page
is stable, so there is no need to get pmdval and do pmd_same() check.
Link: https://lkml.kernel.org/r/9d582a09dbcf12e562ac5fe0ba05e9248a58f5e0.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In copy_pte_range(), we may modify the src_pte entry after holding the
src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we
already hold the exclusive mmap_lock, and the copy_pte_range() and
retract_page_tables() are using vma->anon_vma to be exclusive, so the PTE
page is stable, there is no need to get pmdval and do pmd_same() check.
Link: https://lkml.kernel.org/r/9166f6fad806efbca72e318ab6f0f8af458056a9.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
acquiring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
this time, the pte_same() check is not performed after the PTL held. So
we should get pgt_pmd and do pmd_same() check after the ptl held.
Link: https://lkml.kernel.org/r/055e42db68da00ac8ecab94bd2633c7cd965eb1c.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In handle_pte_fault(), we may modify the vmf->pte after acquiring the
vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
will do the pte_same() check, so there is no need to get pmdval to do
pmd_same() check, just pass a dummy variable to it.
Link: https://lkml.kernel.org/r/af8d694853b44c5a6018403ae435440e275854c7.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In do_adjust_pte(), we may modify the pte entry. The corresponding pmd
entry may have been modified concurrently. Therefore, in order to ensure
the stability if pmd entry, use pte_offset_map_rw_nolock() to replace
pte_offset_map_nolock(), and do pmd_same() check after holding the PTL.
All callers of update_mmu_cache_range() hold the vmf->ptl, so we can
determined whether split PTE locks is being used by doing the following,
just as we do elsewhere in the kernel.
ptl != vmf->ptl
And then we can delete the do_pte_lock() and do_pte_unlock().
Link: https://lkml.kernel.org/r/0eaf6b69aeb2fe35092a633fed12537efe645303.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check
in do_swap_page(). In other places, we directly use
pte_offset_map_lock(), so convert it to using pte_offset_map_ro_nolock().
Link: https://lkml.kernel.org/r/dc97a6c3cb9ea80cab30c5626eeea79959d93258.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In filemap_fault_recheck_pte_none(), we just do pte_none() check, so
convert it to using pte_offset_map_ro_nolock().
Link: https://lkml.kernel.org/r/9f7cbbaa772385ced1b8931b67a8b9d246c9b82d.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In assert_pte_locked(), we just get the ptl and assert if it was already
held, so convert it to using pte_offset_map_ro_nolock().
Link: https://lkml.kernel.org/r/42559e042eb6fc3129a40f710d671712030646b4.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "introduce pte_offset_map_{ro|rw}_nolock()", v5.
As proposed by David Hildenbrand [1], this series introduces the following
two new helper functions to replace pte_offset_map_nolock().
1. pte_offset_map_ro_nolock()
2. pte_offset_map_rw_nolock()
As the name suggests, pte_offset_map_ro_nolock() is used for read-only
case. In this case, only read-only operations will be performed on PTE
page after the PTL is held. The RCU lock in pte_offset_map_nolock() will
ensure that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified. Therefore
pte_offset_map_ro_nolock() is just a renamed version of
pte_offset_map_nolock().
pte_offset_map_rw_nolock() is used for may-write case. In this case, the
pte or pmd entry may be modified after the PTL is held, so we need to
ensure that the pmd entry has not been modified concurrently. So in
addition to the name change, it also outputs the pmdval when successful.
The users should make sure the page table is stable like checking
pte_same() or checking pmd_same() by using the output pmdval before
performing the write operations.
This series will convert all pte_offset_map_nolock() into the above two
helper functions one by one, and finally completely delete it.
This also a preparation for reclaiming the empty user PTE page table
pages.
This patch (of 13):
Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:
1) After acquiring PTL, only read-only operations are performed on the PTE
page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
that the PTE page will not be freed, and there is no need to worry
about whether the pmd entry is modified.
2) After acquiring PTL, the pte or pmd entries may be modified. At this
time, we need to ensure that the pmd entry has not been modified
concurrently.
To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. It is applicable for may-write cases where any
modification operations to the page table may happen after the
corresponding spinlock is held afterwards. But the users should make sure
the page table is stable like checking pte_same() or checking pmd_same()
by using the output pmdval before performing the write operations.
Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.
Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.
Link: https://lkml.kernel.org/r/cover.1727332572.git.zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/5aeecfa131600a454b1f3a038a1a54282ca3b856.1727332572.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The types of mm flags are now far beyond the core dump related features.
This patch moves mm flags from linux/sched/coredump.h to linux/mm_types.h.
The linux/sched/coredump.h has include the mm_types.h, so the C files
related to coredump does not need to change head file inclusion. In
addition, the inclusion of sched/coredump.h now can be deleted from the C
files that irrelevant to core dump.
Link: https://lkml.kernel.org/r/20240926074922.2721274-1-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The process_madvise() call was introduced in commit ecb8ac8b1f14
("mm/madvise: introduce process_madvise() syscall: an external memory
hinting API") as a means of performing madvise() operations on another
process.
However, as it provides the means by which to perform multiple madvise()
operations in a batch via an iovec, it is useful to utilise the same
interface for performing operations on the current process rather than a
remote one.
Commit 22af8caff7d1 ("mm/madvise: process_madvise() drop capability check
if same mm") removed the need for a caller invoking process_madvise() on
its own pidfd to possess the CAP_SYS_NICE capability, however this leaves
the restrictions on operation in place.
Resolve this by only applying the restriction on operations when accessing
a remote process.
Moving forward we plan to implement a simpler means of specifying this
condition other than needing to establish a self pidfd, perhaps in the
form of a sentinel pidfd.
Also take the opportunity to refactor the system call implementation
abstracting the vectorised operation.
Link: https://lkml.kernel.org/r/20240926151019.82902-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Let's improve the test output. For example, print the proper test result.
Install a SIGBUS handler to catch any SIGBUS instead of crashing the test
on failure.
With unsuitable hugetlb page count:
$ ./hugetlb_fault_after_madv
TAP version 13
1..1
# [INFO] detected default hugetlb page size: 2048 KiB
ok 2 # SKIP This test needs one and only one page to execute. Got 0
# Totals: pass:0 fail:0 xfail:0 xpass:0 skip:1 error:0
On a failure:
$ ./hugetlb_fault_after_madv
TAP version 13
1..1
not ok 1 SIGBUS behavior
Bail out! 1 out of 1 tests failed
On success:
$ ./hugetlb_fault_after_madv
TAP version 13
1..1
# [INFO] detected default hugetlb page size: 2048 KiB
ok 1 SIGBUS behavior
# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
Link: https://lkml.kernel.org/r/20240926152044.2205129-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "selftests/mm: hugetlb_fault_after_madv improvements".
Mario brought to my attention that the hugetlb_fault_after_madv test is
currently always skipped on s390x. Let's adjust the test to be
independent of the default hugetlb page size and while at it, also improve
the test output.
This patch (of 2):
We currently assume that the hugetlb page size is 2 MiB, which is why we
mmap() a 2 MiB range.
Is the default hugetlb size is larger, mmap() will fail because the range
is not suitable. If the default hugetlb size is smaller (e.g., s390x),
mmap() will fail because we would need more than one hugetlb page, but
just asserted that we have exactly one.
So let's simply use the default hugetlb page size instead of hard-coded 2
MiB, so the test isn't unconditionally skipped on architectures like
s390x.
Before this patch on s390x:
$ ./hugetlb_fault_after_madv
1..0 # SKIP Failed to allocated huge page
With this change on s390x:
$ ./hugetlb_fault_after_madv
While at it, make "huge_ptr" static.
Link: https://lkml.kernel.org/r/20240926152044.2205129-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240926152044.2205129-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Mario Casquero <mcasquer@redhat.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Fix typo in mempolicy.h and Correct the number of allowed memory policy
Link: https://lkml.kernel.org/r/20240926183516.4034-2-tanyaagarwal25699@gmail.com
Signed-off-by: Tanya Agarwal <tanyaagarwal25699@gmail.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
Cc: Anup Sharma <anupnewsmail@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It is needed to ensure sc->nr.unqueued_dirty > 0, which can avoid setting
PGDAT_DIRTY flag when sc->nr.unqueued_dirty and sc->nr.file_taken are both
zero.
Link: https://lkml.kernel.org/r/20240112012353.1387-1-justinjiang@vivo.com
Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In mast_fill_bnode(), we first clear some fields of maple_big_node and set
the 'type' unconditionally before return. This means we won't leverage
any information in maple_big_node and it is safe to clear the whole
structure.
In maple_big_node, we define slot and padding/gap in a union. And based
on current definition of MAPLE_BIG_NODE_SLOTS/GAPS, padding is always less
than slot and part of the gap is overlapped by slot.
For example on 64bit system:
MAPLE_BIG_NODE_SLOT is 34
MAPLE_BIG_NODE_GAP is 21
With this knowledge, current code may clear some space by twice. And
this could be avoid by clearing the structure as a whole.
Link: https://lkml.kernel.org/r/20240908140554.20378-3-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Reduce the space to be cleared for maple_big_node", v2.
Found current code may clear maple_big_node redundantly.
First we define a field parent, which is never used. After removing this,
we reduce the size of memory to be cleared by memset.
Then mast_fill_bnode() clears part of the structure twice, since slot and
gap share some space. By clearing the whole structure, we can avoid this.
This patch (of 2):
The member parent of maple_big_node is never used.
Let's remove it which could reduce the number of space to be cleared on
memset.
Link: https://lkml.kernel.org/r/20240908140554.20378-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20240908140554.20378-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When we break the loop after assigning a pivot, the index i/j is not
changed. Then the following code assign pivot, which means we do the
assignment with same i/j by mas_safe_pivot.
Since the loop condition is (i < piv_end), from which we can get i is less
than mt_pivots[mt]. It implies mas_safe_pivot() return pivot[i] which is
the same value we get in loop.
Now we can conclude it does a redundant assignment on a pivot of 0. Let's
just go to complete to avoid it.
Link: https://lkml.kernel.org/r/20240911142759.20989-3-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "refine mas_mab_cp()".
By analysis of the code, one condition check can be removed and one case
would hit a redundant assignment.
This patch (of 2):
mas_mab_cp() copy range [mas_start, mas_end] inclusively from a
maple_node to maple_big_node. This implies mas_start <= mas_end.
Based on the relationship of mas_start and mas_end, we can have the
following four cases:
| mas_start == mas_end | mas_start < mas_end
---------------+----------------------+----------------------
mas_start == 0 | 1 | 2
---------------+----------------------+----------------------
mas_start != 0 | 3 | 4
We can see in all these four cases, i is always less than or equal to
mas_end after finish the loop:
Case 1: After assign pivot 0, i is set to 1, which is bigger than
mas_end 0. So it jumps to complete and skip the check.
Case 2: After assign pivot 0, i is set to 1.
∵ (mas_start < mas_end) && (mas_start == 0)
==> (1 <= mas_end)
∵ (i == 1) && (1 <= mas_end)
==> (i <= mas_end)
∴ Before loop, we have (i <= mas_end). And we still hold this
if it skips the loop. For example, (i == mas_end).
Now let's see what happens in the loop:
∵ piv_end = min(mas_end, mt_pivots[mt])
==> (piv_end <= mas_end)
∵ loop condition is (i < piv_end)
==> (i <= piv_end) on finish the loop both normally or break
∵ (i <= piv_end) && (piv_end <= mas_end)
==> (i <= mas_end)
∴ After loop, we still get (i <= mas_end) in this case
Case 3: This case would skip both if clause and loop. So when it comes
to the check, i is still mas_start which equals to mas_end.
Case 4: This case would skip the if clause.
∵ (mas_start < mas_end) && (i == mas_start)
==> (i < mas_end)
∴ Before loop, we have (i < mas_end).
The loop process is similar with Case 2, so we get the same
result.
Now we can conclude in all cases, we get (i <= mas_end) when doing
check. Then it is not necessary to do the check.
Link: https://lkml.kernel.org/r/20240911142759.20989-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20240911142759.20989-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
mm_access() can return NULL if the mm is not found, but this is handled
the same as an error in all callers, with some translating this into an
-ESRCH error.
Only proc_mem_open() returns NULL if no mm is found, however in this case
it is clearer and makes more sense to explicitly handle the error.
Additionally we take the opportunity to refactor the function to eliminate
unnecessary nesting.
Simplify things by simply returning -ESRCH if no mm is found - this both
eliminates confusing use of the IS_ERR_OR_NULL() macro, and simplifies
callers which would return -ESRCH by returning this error directly.
[lorenzo.stoakes@oracle.com: prefer neater pointer error comparison]
Link: https://lkml.kernel.org/r/2fae1834-749a-45e1-8594-5e5979cf7103@lucifer.local
Link: https://lkml.kernel.org/r/20240924201023.193135-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We now have only one active post-processing at any time, so we don't have
same race conditions that we had before. If slot selected for
post-processing gets freed or freed and reallocated it loses its PP_SLOT
flag and there is no way for such a slot to gain PP_SLOT flag again until
current post-processing terminates.
Link: https://lkml.kernel.org/r/20240917021020.883356-8-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Drop some redundant zram_test_flag() calls and re-order zram_clear_flag()
calls. Plus two small trivial coding style fixes. No functional changes.
Link: https://lkml.kernel.org/r/20240917021020.883356-7-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
ZRAM_SAME slots cannot be post-processed (writeback or recompress) so do
not mark them ZRAM_IDLE. Same with ZRAM_WB slots, they cannot be
ZRAM_IDLE because they are not in zsmalloc pool anymore.
Link: https://lkml.kernel.org/r/20240917021020.883356-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Writeback suffers from the same problem as recompression did before -
target slot selection for writeback is just a simple iteration over
zram->table entries (stored pages) which selects suboptimal targets for
writeback. This is especially problematic for writeback, because we
uncompress objects before writeback so each of them takes 4K out of
limited writeback storage. For example, when we take a 48 bytes slot and
store it as a 4K object to writeback device we only save 48 bytes of
memory (release from zsmalloc pool). We naturally want to pick the
largest objects for writeback, because then each writeback will release
the largest amount of memory.
This patch applies the same solution and strategy as for recompression
target selection: pp control (post-process) with 16 buckets of candidate
pp slots. Slots are assigned to pp buckets based on sizes - the larger
the slot the higher the group index. This gives us sorted by size lists
of candidate slots (in linear time), so that among post-processing
candidate slots we always select the largest ones first and maximize the
memory saving.
TEST
====
A very simple demonstration: zram is configured with a writeback device.
A limited writeback (wb_limit 2500 pages) is performed then, with a log of
sizes of slots that were written back. You can see that patched zram
selects slots for recompression in significantly different manner, which
leads to higher memory savings (see column #2 of mm_stat output).
BASE
----
*** initial state of zram device
/sys/block/zram0/mm_stat
1750327296 619765836 631902208 0 631902208 1 0 34278 34278
*** writeback idle wb_limit 2500
/sys/block/zram0/mm_stat
1750327296 617622333 631578624 0 631902208 1 0 34278 34278
Sizes of selected objects for writeback:
... 193 349 46 46 46 46 852 1002 543 162 107 49 34 34 34 ...
PATCHED
-------
*** initial state of zram device
/sys/block/zram0/mm_stat
1750319104 619760957 631992320 0 631992320 1 0 34278 34278
*** writeback idle wb_limit 2500
/sys/block/zram0/mm_stat
1750319104 612672056 626135040 0 631992320 1 0 34278 34278
Sizes of selected objects for writeback:
... 3667 3580 3581 3580 3581 3581 3581 3231 3211 3203 3231 3246 ...
Note, pp-slots are not strictly sorted, there is a PP_BUCKET_SIZE_RANGE
variation of sizes within particular bucket.
Link: https://lkml.kernel.org/r/20240917021020.883356-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Target slot selection for recompression is just a simple iteration over
zram->table entries (stored pages) from slot 0 to max slot. Given that
zram->table slots are written in random order and are not sorted by size,
a simple iteration over slots selects suboptimal targets for
recompression. This is not a problem if we recompress every single
zram->table slot, but we never do that in reality. In reality we limit
the number of slots we can recompress (via max_pages parameter) and hence
proper slot selection becomes very important. The strategy is quite
simple, suppose we have two candidate slots for recompression, one of size
48 bytes and one of size 2800 bytes, and we can recompress only one, then
it certainly makes more sense to pick 2800 entry for recompression.
Because even if we manage to compress 48 bytes objects even further the
savings are going to be very small. Potential savings after good
re-compression of 2800 bytes objects are much higher.
This patch reworks slot selection and introduces the strategy described
above: among candidate slots always select the biggest ones first.
For that the patch introduces zram_pp_ctl (post-processing) structure
which holds NUM_PP_BUCKETS pp buckets of slots. Slots are assigned to a
particular group based on their sizes - the larger the size of the slot
the higher the group index. This, basically, sorts slots by size in liner
time (we still perform just one iteration over zram->table slots). When
we select slot for recompression we always first lookup in higher pp
buckets (those that hold the largest slots). Which achieves the desired
behavior.
TEST
====
A very simple demonstration: zram is configured with zstd, and zstd with
dict as a recompression stream. A limited (max 4096 pages) recompression
is performed then, with a log of sizes of slots that were recompressed.
You can see that patched zram selects slots for recompression in
significantly different manner, which leads to higher memory savings (see
column #2 of mm_stat output).
BASE
----
*** initial state of zram device
/sys/block/zram0/mm_stat
1750994944 504491413 514203648 0 514203648 1 0 34204 34204
*** recompress idle max_pages=4096
/sys/block/zram0/mm_stat
1750994944 504262229 514953216 0 514203648 1 0 34204 34204
Sizes of selected objects for recompression:
... 45 58 24 226 91 40 24 24 24 424 2104 93 2078 2078 2078 959 154 ...
PATCHED
-------
*** initial state of zram device
/sys/block/zram0/mm_stat
1750982656 504492801 514170880 0 514170880 1 0 34204 34204
*** recompress idle max_pages=4096
/sys/block/zram0/mm_stat
1750982656 503716710 517586944 0 514170880 1 0 34204 34204
Sizes of selected objects for recompression:
... 3680 3694 3667 3590 3614 3553 3537 3548 3550 3542 3543 3537 ...
Note, pp-slots are not strictly sorted, there is a PP_BUCKET_SIZE_RANGE
variation of sizes within particular bucket.
[senozhatsky@chromium.org: do not skip the first bucket]
Link: https://lkml.kernel.org/r/20241001085634.1948384-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20240917021020.883356-4-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Both recompress and writeback soon will unlock slots during processing,
which makes things too complex wrt possible race-conditions. We still
want to clear PP_SLOT in slot_free, because this is how we figure out that
slot that was selected for post-processing has been released under us and
when we start post-processing we check if slot still has PP_SLOT set. At
the same time, theoretically, we can have something like this:
CPU0 CPU1
recompress
scan slots
set PP_SLOT
unlock slot
slot_free
clear PP_SLOT
allocate PP_SLOT
writeback
scan slots
set PP_SLOT
unlock slot
select PP-slot
test PP_SLOT
So recompress will not detect that slot has been re-used and re-selected
for concurrent writeback post-processing.
Make sure that we only permit on post-processing operation at a time. So
now recompress and writeback post-processing don't race against each
other, we only need to handle slot re-use (slot_free and write), which is
handled individually by each pp operation.
Having recompress and writeback competing for the same slots is not
exactly good anyway (can't imagine anyone doing that).
Link: https://lkml.kernel.org/r/20240917021020.883356-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "zram: optimal post-processing target selection", v5.
Problem:
--------
Both recompression and writeback perform a very simple linear scan of all
zram slots in search for post-processing (writeback or recompress)
candidate slots. This often means that we pick the worst candidate for pp
(post-processing), e.g. a 48 bytes object for writeback, which is nearly
useless, because it only releases 48 bytes from zsmalloc pool, but
consumes an entire 4K slot in the backing device. Similarly,
recompression of an 48 bytes objects is unlikely to save more memory that
recompression of a 3000 bytes object. Both recompression and writeback
consume constrained resources (CPU time, batter, backing device storage
space) and quite often have a (daily) limit on the number of items they
post-process, so we should utilize those constrained resources in the most
optimal way.
Solution:
---------
This patch reworks the way we select pp targets. We, quite clearly, want
to sort all the candidates and always pick the largest, be it
recompression or writeback. Especially for writeback, because the larger
object we writeback the more memory we release. This series introduces
concept of pp buckets and pp scan/selection.
The scan step is a simple iteration over all zram->table entries, just
like what we currently do, but we don't post-process a candidate slot
immediately. Instead we assign it to a PP (post-processing) bucket. PP
bucket is, basically, a list which holds pp candidate slots that belong to
the same size class. PP buckets are 64 bytes apart, slots are not
strictly sorted within a bucket there is a 64 bytes variance.
The select step simply iterates over pp buckets from highest to lowest and
picks all candidate slots a particular buckets contains. So this gives us
sorted candidates (in linear time) and allows us to select most optimal
(largest) candidates for post-processing first.
This patch (of 7):
This flag indicates that the slot was selected as a candidate slot for
post-processing (pp) and was assigned to a pp bucket. It does not
necessarily mean that the slot is currently under post-processing, but may
mean so. The slot can loose its PP_SLOT flag, while still being in the
pp-bucket, if it's accessed or slot_free-ed.
Link: https://lkml.kernel.org/r/20240917021020.883356-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20240917021020.883356-2-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
into one operation
When compiling kernel source 'make -j $(nproc)' with the up-and-running
KASAN-enabled kernel on a 256-core machine, the following soft lockup is
shown:
watchdog: BUG: soft lockup - CPU#28 stuck for 22s! [kworker/28:1:1760]
CPU: 28 PID: 1760 Comm: kworker/28:1 Kdump: loaded Not tainted 6.10.0-rc5 #95
Workqueue: events drain_vmap_area_work
RIP: 0010:smp_call_function_many_cond+0x1d8/0xbb0
Code: 38 c8 7c 08 84 c9 0f 85 49 08 00 00 8b 45 08 a8 01 74 2e 48 89 f1 49 89 f7 48 c1 e9 03 41 83 e7 07 4c 01 e9 41 83 c7 03 f3 90 <0f> b6 01 41 38 c7 7c 08 84 c0 0f 85 d4 06 00 00 8b 45 08 a8 01 75
RSP: 0018:ffffc9000cb3fb60 EFLAGS: 00000202
RAX: 0000000000000011 RBX: ffff8883bc4469c0 RCX: ffffed10776e9949
RDX: 0000000000000002 RSI: ffff8883bb74ca48 RDI: ffffffff8434dc50
RBP: ffff8883bb74ca40 R08: ffff888103585dc0 R09: ffff8884533a1800
R10: 0000000000000004 R11: ffffffffffffffff R12: ffffed1077888d39
R13: dffffc0000000000 R14: ffffed1077888d38 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff8883bc400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005577b5c8d158 CR3: 0000000004850000 CR4: 0000000000350ef0
Call Trace:
<IRQ>
? watchdog_timer_fn+0x2cd/0x390
? __pfx_watchdog_timer_fn+0x10/0x10
? __hrtimer_run_queues+0x300/0x6d0
? sched_clock_cpu+0x69/0x4e0
? __pfx___hrtimer_run_queues+0x10/0x10
? srso_return_thunk+0x5/0x5f
? ktime_get_update_offsets_now+0x7f/0x2a0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? hrtimer_interrupt+0x2ca/0x760
? __sysvec_apic_timer_interrupt+0x8c/0x2b0
? sysvec_apic_timer_interrupt+0x6a/0x90
</IRQ>
<TASK>
? asm_sysvec_apic_timer_interrupt+0x16/0x20
? smp_call_function_many_cond+0x1d8/0xbb0
? __pfx_do_kernel_range_flush+0x10/0x10
on_each_cpu_cond_mask+0x20/0x40
flush_tlb_kernel_range+0x19b/0x250
? srso_return_thunk+0x5/0x5f
? kasan_release_vmalloc+0xa7/0xc0
purge_vmap_node+0x357/0x820
? __pfx_purge_vmap_node+0x10/0x10
__purge_vmap_area_lazy+0x5b8/0xa10
drain_vmap_area_work+0x21/0x30
process_one_work+0x661/0x10b0
worker_thread+0x844/0x10e0
? srso_return_thunk+0x5/0x5f
? __kthread_parkme+0x82/0x140
? __pfx_worker_thread+0x10/0x10
kthread+0x2a5/0x370
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Debugging Analysis:
1. The following ftrace log shows that the lockup CPU spends too much
time iterating vmap_nodes and flushing TLB when purging vm_area
structures. (Some info is trimmed).
kworker: funcgraph_entry: | drain_vmap_area_work() {
kworker: funcgraph_entry: | mutex_lock() {
kworker: funcgraph_entry: 1.092 us | __cond_resched();
kworker: funcgraph_exit: 3.306 us | }
... ...
kworker: funcgraph_entry: | flush_tlb_kernel_range() {
... ...
kworker: funcgraph_exit: # 7533.649 us | }
... ...
kworker: funcgraph_entry: 2.344 us | mutex_unlock();
kworker: funcgraph_exit: $ 23871554 us | }
The drain_vmap_area_work() spends over 23 seconds.
There are 2805 flush_tlb_kernel_range() calls in the ftrace log.
* One is called in __purge_vmap_area_lazy().
* Others are called by purge_vmap_node->kasan_release_vmalloc.
purge_vmap_node() iteratively releases kasan vmalloc
allocations and flushes TLB for each vmap_area.
- [Rough calculation] Each flush_tlb_kernel_range() runs
about 7.5ms.
-- 2804 * 7.5ms = 21.03 seconds.
-- That's why a soft lock is triggered.
2. Extending the soft lockup time can work around the issue (For example,
# echo 60 > /proc/sys/kernel/watchdog_thresh). This confirms the
above-mentioned speculation: drain_vmap_area_work() spends too much
time.
If we combine all TLB flush operations of the KASAN shadow virtual
address into one operation in the call path
'purge_vmap_node()->kasan_release_vmalloc()', the running time of
drain_vmap_area_work() can be saved greatly. The idea is from the
flush_tlb_kernel_range() call in __purge_vmap_area_lazy(). And, the
soft lockup won't be triggered.
Here is the test result based on 6.10:
[6.10 wo/ the patch]
1. ftrace latency profiling (record a trace if the latency > 20s).
echo 20000000 > /sys/kernel/debug/tracing/tracing_thresh
echo drain_vmap_area_work > /sys/kernel/debug/tracing/set_graph_function
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
2. Run `make -j $(nproc)` to compile the kernel source
3. Once the soft lockup is reproduced, check the ftrace log:
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
76) $ 50412985 us | } /* __purge_vmap_area_lazy */
76) $ 50412997 us | } /* drain_vmap_area_work */
76) $ 29165911 us | } /* __purge_vmap_area_lazy */
76) $ 29165926 us | } /* drain_vmap_area_work */
91) $ 53629423 us | } /* __purge_vmap_area_lazy */
91) $ 53629434 us | } /* drain_vmap_area_work */
91) $ 28121014 us | } /* __purge_vmap_area_lazy */
91) $ 28121026 us | } /* drain_vmap_area_work */
[6.10 w/ the patch]
1. Repeat step 1-2 in "[6.10 wo/ the patch]"
2. The soft lockup is not triggered and ftrace log is empty.
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
3. Setting 'tracing_thresh' to 10/5 seconds does not get any ftrace
log.
4. Setting 'tracing_thresh' to 1 second gets ftrace log.
cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
23) $ 1074942 us | } /* __purge_vmap_area_lazy */
23) $ 1074950 us | } /* drain_vmap_area_work */
The worst execution time of drain_vmap_area_work() is about 1 second.
Link: https://lore.kernel.org/lkml/ZqFlawuVnOMY2k3E@pc638.lan/
Link: https://lkml.kernel.org/r/20240726165246.31326-1-ahuang12@lenovo.com
Fixes: 282631cb2447 ("mm: vmalloc: remove global purge_vmap_area_root rb-tree")
Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Co-developed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Jiwei Sun <sunjw10@lenovo.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In proactive memory reclamation scenarios, it is necessary to estimate the
pswpin and pswpout metrics of the cgroup to determine whether to continue
reclaiming anonymous pages in the current batch. This patch will collect
these metrics and expose them.
[linuszeng@tencent.com: v2]
Link: https://lkml.kernel.org/r/20240830082244.156923-1-jingxiangzeng.cas@gmail.com
Li nk: https://lkml.kernel.org/r/20240913084453.3605621-1-jingxiangzeng.cas@gmail.com
Link: https://lkml.kernel.org/r/20240830082244.156923-1-jingxiangzeng.cas@gmail.com
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
sparse warns about zero initializing an array with {0,}, change it to
the equivalent {0}.
Fixes the sparse warning:
mm/damon/tests/vaddr-kunit.h:69:47: warning: missing braces around initializer
Link: https://lkml.kernel.org/r/xriwklcwjpwcz7eiavo6f7envdar4jychhsk6sfkj5klaznb6b@j6vrvr2sxjht
Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests")
Signed-off-by: Leo Stone <leocstone@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Jinjie Ruan <ruanjinjie@huawei.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Shmem has a separate interface (different from anonymous pages) to control
huge page allocation, that means shmem THP can be enabled while anonymous
THP is disabled. However, in this case, khugepaged will not start to
collapse shmem THP, which is unreasonable.
To fix this issue, we should call start_stop_khugepaged() to activate or
deactivate the khugepaged thread when setting shmem mTHP interfaces.
Moreover, add a new helper shmem_hpage_pmd_enabled() to help to check
whether shmem THP is enabled, which will determine if khugepaged should be
activated.
Link: https://lkml.kernel.org/r/9b9c6cbc4499bf44c6455367fd9e0f6036525680.1726978977.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 6998a73efbb8 ("selftests/mm: Add new testcases for pkeys") and
commit 3a103b5315b7 ("selftest: mm: Test if hugepage does not get leaked
during __bio_release_pages()") generate test binaries hugetlb_dio,
pkey_sighandler_tests_32 and pkey_sighandler_tests_64 but did not add
these to .gitignore. Correct this.
Link: https://lkml.kernel.org/r/20240924185911.117937-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Keith Lucas <keith.lucas@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pick these into mm-stable:
5de195060b2e mm: resolve faulty mmap_region() error path behaviour
5baf8b037deb mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
0fb4a7ad270b mm: refactor map_deny_write_exec()
4080ef1579b2 mm: unconditionally close VMAs on error
3dd6ed34ce1f mm: avoid unsafe VMA hook invocation when error arises on mmap hook
f8f931bba0f9 mm/thp: fix deferred split unqueue naming and locking
e66f3185fa04 mm/thp: fix deferred split queue not partially_mapped
to get a clean merge of these from mm-unstable into mm-stable:
Subject: memcg-v1: fully deprecate move_charge_at_immigrate
Subject: memcg-v1: remove charge move code
Subject: memcg-v1: no need for memcg locking for dirty tracking
Subject: memcg-v1: no need for memcg locking for writeback tracking
Subject: memcg-v1: no need for memcg locking for MGLRU
Subject: memcg-v1: remove memcg move locking code
Subject: tools: testing: add additional vma_internal.h stubs
Subject: mm: isolate mmap internal logic to mm/vma.c
Subject: mm: refactor __mmap_region()
Subject: mm: remove unnecessary reset state logic on merge new VMA
Subject: mm: defer second attempt at merge on mmap()
Subject: mm/vma: the pgoff is correct if can_merge_right
Subject: memcg: workingset: remove folio_memcg_rcu usage
|
|
The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.
A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently
observed issues with resource leaks and observable inconsistent state.
Taking advantage of previous patches in this series we move a number of
checks earlier in the code, simplifying things by moving the core of the
logic into a static internal function __mmap_region().
Doing this allows us to perform a number of checks up front before we do
any real work, and allows us to unwind the writable unmap check
unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
validation unconditionally also.
We move a number of things here:
1. We preallocate memory for the iterator before we call the file-backed
memory hook, allowing us to exit early and avoid having to perform
complicated and error-prone close/free logic. We carefully free
iterator state on both success and error paths.
2. The enclosing mmap_region() function handles the mapping_map_writable()
logic early. Previously the logic had the mapping_map_writable() at the
point of mapping a newly allocated file-backed VMA, and a matching
mapping_unmap_writable() on success and error paths.
We now do this unconditionally if this is a file-backed, shared writable
mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
doing so does not invalidate the seal check we just performed, and we in
any case always decrement the counter in the wrapper.
We perform a debug assert to ensure a driver does not attempt to do the
opposite.
3. We also move arch_validate_flags() up into the mmap_region()
function. This is only relevant on arm64 and sparc64, and the check is
only meaningful for SPARC with ADI enabled. We explicitly add a warning
for this arch if a driver invalidates this check, though the code ought
eventually to be fixed to eliminate the need for this.
With all of these measures in place, we no longer need to explicitly close
the VMA on error paths, as we place all checks which might fail prior to a
call to any driver mmap hook.
This eliminates an entire class of errors, makes the code easier to reason
about and more robust.
Link: https://lkml.kernel.org/r/6e0becb36d2f5472053ac5d544c0edfe9b899e25.1730224667.git.lorenzo.stoakes@oracle.com
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently MTE is permitted in two circumstances (desiring to use MTE
having been specified by the VM_MTE flag) - where MAP_ANONYMOUS is
specified, as checked by arch_calc_vm_flag_bits() and actualised by
setting the VM_MTE_ALLOWED flag, or if the file backing the mapping is
shmem, in which case we set VM_MTE_ALLOWED in shmem_mmap() when the mmap
hook is activated in mmap_region().
The function that checks that, if VM_MTE is set, VM_MTE_ALLOWED is also
set is the arm64 implementation of arch_validate_flags().
Unfortunately, we intend to refactor mmap_region() to perform this check
earlier, meaning that in the case of a shmem backing we will not have
invoked shmem_mmap() yet, causing the mapping to fail spuriously.
It is inappropriate to set this architecture-specific flag in general mm
code anyway, so a sensible resolution of this issue is to instead move the
check somewhere else.
We resolve this by setting VM_MTE_ALLOWED much earlier in do_mmap(), via
the arch_calc_vm_flag_bits() call.
This is an appropriate place to do this as we already check for the
MAP_ANONYMOUS case here, and the shmem file case is simply a variant of
the same idea - we permit RAM-backed memory.
This requires a modification to the arch_calc_vm_flag_bits() signature to
pass in a pointer to the struct file associated with the mapping, however
this is not too egregious as this is only used by two architectures anyway
- arm64 and parisc.
So this patch performs this adjustment and removes the unnecessary
assignment of VM_MTE_ALLOWED in shmem_mmap().
[akpm@linux-foundation.org: fix whitespace, per Catalin]
Link: https://lkml.kernel.org/r/ec251b20ba1964fb64cf1607d2ad80c47f3873df.1730224667.git.lorenzo.stoakes@oracle.com
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|