Age | Commit message (Collapse) | Author |
|
Patch series "Support large folios for tmpfs", v3.
Traditionally, tmpfs only supported PMD-sized large folios. However
nowadays with other file systems supporting any sized large folios, and
extending anonymous to support mTHP, we should not restrict tmpfs to
allocating only PMD-sized large folios, making it more special. Instead,
we should allow tmpfs can allocate any sized large folios.
Considering that tmpfs already has the 'huge=' option to control the
PMD-sized large folios allocation, we can extend the 'huge=' option to
allow any sized large folios. The semantics of the 'huge=' mount option
are:
huge=never: no any sized large folios
huge=always: any sized large folios
huge=within_size: like 'always' but respect the i_size
huge=advise: like 'always' if requested with madvise()
Note: for tmpfs mmap() faults, due to the lack of a write size hint, still
allocate the PMD-sized large folios if huge=always/within_size/advise is
set.
Moreover, the 'deny' and 'force' testing options controlled by
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same
semantics. The 'deny' can disable any sized large folios for tmpfs, while
the 'force' can enable PMD sized large folios for tmpfs.
This patch (of 6):
Factor out the order calculation into a new helper, which can be reused by
shmem in the following patch.
Link: https://lkml.kernel.org/r/cover.1732779148.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/5505f9ea50942820c1924d1803bfdd3a524e54f6.1732779148.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
kasan_record_aux_stack_noalloc() was introduced to record a stack trace
without allocating memory in the process. It has been added to callers
which were invoked while a raw_spinlock_t was held. More and more callers
were identified and changed over time. Is it a good thing to have this
while functions try their best to do a locklessly setup? The only
downside of having kasan_record_aux_stack() not allocate any memory is
that we end up without a stacktrace if stackdepot runs out of memory and
at the same stacktrace was not recorded before To quote Marco Elver from
https://lore.kernel.org/all/CANpmjNPmQYJ7pv1N3cuU8cP18u7PP_uoZD8YxwZd4jtbof9nVQ@mail.gmail.com/
| I'd be in favor, it simplifies things. And stack depot should be
| able to replenish its pool sufficiently in the "non-aux" cases
| i.e. regular allocations. Worst case we fail to record some
| aux stacks, but I think that's only really bad if there's a bug
| around one of these allocations. In general the probabilities
| of this being a regression are extremely small [...]
Make the kasan_record_aux_stack_noalloc() behaviour default as
kasan_record_aux_stack().
[bigeasy@linutronix.de: dressed the diff as patch]
Link: https://lkml.kernel.org/r/20241122155451.Mb2pmeyJ@linutronix.de
Fixes: 7cb3007ce2da ("kasan: generic: introduce kasan_record_aux_stack_noalloc()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reported-by: syzbot+39f85d612b7c20d8db48@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67275485.050a0220.3c8d68.0a37.GAE@google.com
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Marco Elver <elver@google.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: syzkaller-bugs@googlegroups.com
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
s/equivalend/equivalent/
Link: https://lkml.kernel.org/r/20241120105041.2394283-1-yikming2222@gmail.com
Signed-off-by: Chin Yik Ming <yikming2222@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Prefer 'unsigned int' over plain 'unsigned'. Also make it
consistent with mm/cma.c
Link: https://lkml.kernel.org/r/tencent_1E5E3AA25C261196D8C1F7097F130E382008@qq.com
Signed-off-by: Jiale Yang <295107659@qq.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In the generic ptep_get_and_clear() implementation, it is just a simple
combination of ptep_get() and pte_clear(). But for some architectures
(such as x86 and arm64, etc), the hardware will modify the A/D bits of the
page table entry, so the ptep_get_and_clear() needs to be overwritten
and implemented as an atomic operation to avoid contention, which has a
performance cost.
The commit d283d422c6c4 ("x86: mm: add x86_64 support for page table
check") adds the ptep_clear() on the x86, and makes it call
ptep_get_and_clear() when CONFIG_PAGE_TABLE_CHECK is enabled. The page
table check feature does not actually care about the A/D bits, so only
ptep_get() + pte_clear() should be called. But considering that the page
table check is a debug option, this should not have much of an impact.
But then the commit de8c8e52836d ("mm: page_table_check: add hooks to
public helpers") changed ptep_clear() to unconditionally call
ptep_get_and_clear(), so that the CONFIG_PAGE_TABLE_CHECK check can be
put into the page table check stubs (in include/linux/page_table_check.h).
This also cause performance loss to the kernel without
CONFIG_PAGE_TABLE_CHECK enabled, which doesn't make sense.
Currently ptep_clear() is only used in debug code and in khugepaged
collapse paths, which are fairly expensive. So the cost of an extra atomic
RMW operation does not matter. But this may be used for other paths in the
future. After all, for the present pte entry, we need to call ptep_clear()
instead of pte_clear() to ensure that PAGE_TABLE_CHECK works properly.
So to be more precise, just calling ptep_get() and pte_clear() in the
ptep_clear().
Link: https://lkml.kernel.org/r/20241122073652.54030-1-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Tong Tiangen <tongtiangen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Compiled binary files should be added to .gitignore
'git status' complains:
Untracked files:
(use "git add <file>..." to include in what will be committed)
mm/hugetlb_dio
mm/pkey_sighandler_tests_32
mm/pkey_sighandler_tests_64
Link: https://lkml.kernel.org/r/20241125064036.413536-1-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The return statement at the end of void function is unnecessary. Just
remove it as part of cleanup.
Link: https://lkml.kernel.org/r/20241122173558.20670-1-quic_pintu@quicinc.com
Signed-off-by: Pintu Kumar <quic_pintu@quicinc.com>
Cc: Pintu Agarwal <pintu.ping@gmail.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We are starting to deploy mmap_lock tracepoint monitoring across our
fleet and the early results showed that these tracepoints are consuming
significant amount of CPUs in kernfs_path_from_node when enabled.
It seems like the kernel is trying to resolve the cgroup path in the
fast path of the locking code path when the tracepoints are enabled. In
addition for some application their metrics are regressing when
monitoring is enabled.
The cgroup path resolution can be slow and should not be done in the
fast path. Most userspace tools, like bpftrace, provides functionality
to get the cgroup path from cgroup id, so let's just trace the cgroup
id and the users can use better tools to get the path in the slow path.
Link: https://lkml.kernel.org/r/20241125171617.113892-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damos_set_effective_quota() checks quota contidions but there are some
duplicate checks for quota->goals inside.
This patch reduces one of if statement to simplify the esz calculation
logic by setting esz as ULONG_MAX by default.
Link: https://lkml.kernel.org/r/20241125184307.41746-1-sj@kernel.org
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since slab does not use the page refcount, it can allocate and free frozen
pages, saving one atomic operation per free.
Link: https://lkml.kernel.org/r/20241125210149.2976098-16-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Provide an interface to allocate pages from the page allocator without
incrementing their refcount. This saves an atomic operation on free,
which may be beneficial to some users (eg slab).
Link: https://lkml.kernel.org/r/20241125210149.2976098-15-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Defer the initialisation of the page refcount to the new __alloc_pages()
wrapper and turn the old __alloc_pages() into __alloc_frozen_pages().
Link: https://lkml.kernel.org/r/20241125210149.2976098-14-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove some code duplication by calling set_page_refcounted() at the end
of __alloc_pages() instead of after each call that can allocate a page.
That means that we free a frozen page if we've exceeded the allowed memcg
memory.
Link: https://lkml.kernel.org/r/20241125210149.2976098-13-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In preparation for allocating frozen pages, stop initialising the page
refcount in __alloc_pages_slowpath().
Link: https://lkml.kernel.org/r/20241125210149.2976098-12-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
__alloc_pages_direct_reclaim()
In preparation for allocating frozen pages, stop initialising the page
refcount in __alloc_pages_direct_reclaim().
Link: https://lkml.kernel.org/r/20241125210149.2976098-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
__alloc_pages_direct_compact()
In preparation for allocating frozen pages, stop initialising the page
refcount in __alloc_pages_direct_compact().
Link: https://lkml.kernel.org/r/20241125210149.2976098-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In preparation for allocating frozen pages, stop initialising the page
refcount in __alloc_pages_may_oom().
Link: https://lkml.kernel.org/r/20241125210149.2976098-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
__alloc_pages_cpuset_fallback()
In preparation for allocating frozen pages, stop initialising the page
refcount in __alloc_pages_cpuset_fallback().
Link: https://lkml.kernel.org/r/20241125210149.2976098-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In preparation for allocating frozen pages, stop initialising the page
refcount in get_page_from_freelist().
Link: https://lkml.kernel.org/r/20241125210149.2976098-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In preparation for allocating frozen pages, stop initialising the page
refcount in prep_new_page().
Link: https://lkml.kernel.org/r/20241125210149.2976098-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In preparation for allocating frozen pages, stop initialising the page
refcount in post_alloc_hook().
Link: https://lkml.kernel.org/r/20241125210149.2976098-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We already have the concept of "frozen pages" (eg page_ref_freeze()), so
let's not complicate things by also having the concept of "unref pages".
Link: https://lkml.kernel.org/r/20241125210149.2976098-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All callers outside mempolicy.c now use folio_alloc_mpol() thanks to
Kefeng's cleanups, so we can remove this as a visible symbol.
And also remove the alloc_hooks for alloc_pages_mpol(), since all users
in mempolicy.c are using the nonprof version.
Link: https://lkml.kernel.org/r/20241125210149.2976098-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Allocate and free frozen pages", v3.
Slab does not need to use the page refcount at all, and it can avoid an
atomic operation on page free. Hugetlb wants to delay setting the
refcount until it has assembled a complete gigantic page. We already have
the ability to freeze a page (safely reduce its reference count to 0), so
this patchset adds APIs to allocate and free pages which are in a frozen
state.
This patchset is also a step towards the Glorious Future in which struct
page doesn't have a refcount; the users which need a refcount will have
one in their per-allocation memdesc.
This patch (of 15):
Save 17 bytes of text by calculating page_zone() once instead of twice.
Link: https://lkml.kernel.org/r/20241125210149.2976098-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20241125210149.2976098-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit ee86814b0562 ("mm/migrate: move NUMA hinting fault folio isolation
+ checks under PTL") removed the code that had used the vma argument in
migrate_misplaced_folio.
Since the vma argument was no longer used in migrate_misplaced_folio, this
patch removes it.
Link: https://lkml.kernel.org/r/20241126155655.466186-1-donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This function has been able to return LRU_STOP since commit b49547ade38a
("mm/zswap: stop lru list shrinking when encounter warm region"). To
reduce confusion, update the comment to also list LRU_STOP as an option.
Link: https://lkml.kernel.org/r/20241127-lru-stop-comment-v1-1-f54a7cba9429@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The loop condition makes sure (mas.last < max), so we can directly use
mas_next_slot() here.
Since no other use of mas_next_entry(), it is removed.
Link: https://lkml.kernel.org/r/20241125024156.26093-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since commit 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore") the
number of softlockups in __read_vmcore at kdump time have gone down, but
they still happen sometimes.
In a memory constrained environment like the kdump image, a softlockup is
not just a harmless message, but it can interfere with things like RCU
freeing memory, causing the crashdump to get stuck.
The second loop in __read_vmcore has a lot more opportunities for natural
sleep points, like scheduling out while waiting for a data write to
happen, but apparently that is not always enough.
Add a cond_resched() to the second loop in __read_vmcore to (hopefully)
get rid of the softlockups.
Link: https://lkml.kernel.org/r/20250110102821.2a37581b@fangorn
Fixes: 5cbcb62dddf5 ("fs/proc: fix softlockup in __read_vmcore")
Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-by: Breno Leitao <leitao@debian.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We only need to assert that the uptodate flag is clear if we're going to
set it. This hasn't been a problem before now because we have only used
folio_end_read() when completing with an error, but it's convenient to use
it in squashfs if we discover the folio is already uptodate.
Link: https://lkml.kernel.org/r/20250110163300.3346321-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When MGLRU is enabled, the pgdemote_kswapd, pgdemote_direct, and
pgdemote_khugepaged stats in vmstat are not being updated.
Commit f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA
balancing operations") moved the pgdemote vmstat update from
demote_folio_list() to shrink_inactive_list(), which is in the normal LRU
path. As a result, the pgdemote stats are updated correctly for the
normal LRU but not for MGLRU.
To address this, we have added the pgdemote stat update in the
evict_folios() function, which is in the MGLRU path. With this patch, the
pgdemote stats will now be updated correctly when MGLRU is enabled.
Without this patch vmstat output when MGLRU is enabled
======================================================
pgdemote_kswapd 0
pgdemote_direct 0
pgdemote_khugepaged 0
With this patch vmstat output when MGLRU is enabled
===================================================
pgdemote_kswapd 43234
pgdemote_direct 4691
pgdemote_khugepaged 0
Link: https://lkml.kernel.org/r/20250109060540.451261-1-donettom@linux.ibm.com
Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations")
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Tested-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kaiyang Zhao <kaiyang2@cs.cmu.edu>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Wei Xu <weixugc@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The upstream commit adcfb264c3ed ("vmstat: disable vmstat_work on
vmstat_cpu_down_prep()") introduced another warning during the boot phase
so was soon reverted on upstream by commit cd6313beaeae ("Revert "vmstat:
disable vmstat_work on vmstat_cpu_down_prep()""). This commit resolves it
and reattempts the original fix.
Even after mm/vmstat:online teardown, shepherd may still queue work for
the dying cpu until the cpu is removed from online mask. While it's quite
rare, this means that after unbind_workers() unbinds a per-cpu kworker, it
potentially runs vmstat_update for the dying CPU on an irrelevant cpu
before entering atomic AP states. When CONFIG_DEBUG_PREEMPT=y, it results
in the following error with the backtrace.
BUG: using smp_processor_id() in preemptible [00000000] code: \
kworker/7:3/1702
caller is refresh_cpu_vm_stats+0x235/0x5f0
CPU: 0 UID: 0 PID: 1702 Comm: kworker/7:3 Tainted: G
Tainted: [N]=TEST
Workqueue: mm_percpu_wq vmstat_update
Call Trace:
<TASK>
dump_stack_lvl+0x8d/0xb0
check_preemption_disabled+0xce/0xe0
refresh_cpu_vm_stats+0x235/0x5f0
vmstat_update+0x17/0xa0
process_one_work+0x869/0x1aa0
worker_thread+0x5e5/0x1100
kthread+0x29e/0x380
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x1a/0x30
</TASK>
So, for mm/vmstat:online, disable vmstat_work reliably on teardown and
symmetrically enable it on startup.
For secondary CPUs during CPU hotplug scenarios, ensure the delayed work
is disabled immediately after the initialization. These CPUs are not yet
online when start_shepherd_timer() runs on boot CPU. vmstat_cpu_online()
will enable the work for them.
Link: https://lkml.kernel.org/r/20250108042807.3429745-1-koichiro.den@canonical.com
Signed-off-by: Huacai Chen <chenhuacai@kernel.org>
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Suggested-by: Huacai Chen <chenhuacai@kernel.org>
Tested-by: Charalampos Mitrodimas <charmitro@posteo.net>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If zram_meta_alloc failed early, it frees allocated zram->table without
setting it NULL. Which will potentially cause zram_meta_free to access
the table if user reset an failed and uninitialized device.
Link: https://lkml.kernel.org/r/20250107065446.86928-1-ryncsn@gmail.com
Fixes: 74363ec674cb ("zram: fix uninitialized ZRAM not releasing backing device")
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
After commit b1f202060afe ("mm: remap unused subpages to shared zeropage
when splitting isolated thp"), cow test cases involving swapping out THPs
via madvise(MADV_PAGEOUT) started to be skipped due to the subsequent
check via pagemap determining that the memory was not actually swapped
out. Logs similar to this were emitted:
...
# [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (16 kB)
ok 2 # SKIP MADV_PAGEOUT did not work, is swap enabled?
# [RUN] Basic COW after fork() ... with single PTE of swapped-out THP (16 kB)
ok 3 # SKIP MADV_PAGEOUT did not work, is swap enabled?
# [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP (32 kB)
ok 4 # SKIP MADV_PAGEOUT did not work, is swap enabled?
...
The commit in question introduces the behaviour of scanning THPs and if
their content is predominantly zero, it splits them and replaces the pages
which are wholly zero with the zero page. These cow test cases were
getting caught up in this.
So let's avoid that by filling the contents of all allocated memory with
a non-zero value. With this in place, the tests are passing again.
Link: https://lkml.kernel.org/r/20250107142555.1870101-1-ryan.roberts@arm.com
Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp")
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When mremap()ing a memory region previously registered with userfaultfd as
write-protected but without UFFD_FEATURE_EVENT_REMAP, an inconsistency in
flag clearing leads to a mismatch between the vma flags (which have
uffd-wp cleared) and the pte/pmd flags (which do not have uffd-wp
cleared). This mismatch causes a subsequent mprotect(PROT_WRITE) to
trigger a warning in page_table_check_pte_flags() due to setting the pte
to writable while uffd-wp is still set.
Fix this by always explicitly clearing the uffd-wp pte/pmd flags on any
such mremap() so that the values are consistent with the existing clearing
of VM_UFFD_WP. Be careful to clear the logical flag regardless of its
physical form; a PTE bit, a swap PTE bit, or a PTE marker. Cover PTE,
huge PMD and hugetlb paths.
Link: https://lkml.kernel.org/r/20250107144755.1871363-2-ryan.roberts@arm.com
Co-developed-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Closes: https://lore.kernel.org/linux-mm/810b44a8-d2ae-4107-b665-5a42eae2d948@arm.com/
Fixes: 63b2d4174c4a ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl")
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A livepatch module can contain a special relocation section
.klp.rela.<objname>.<secname> to apply its relocations at the appropriate
time and to additionally access local and unexported symbols. When
<objname> points to another module, such relocations are processed
separately from the regular module relocation process. For instance, only
when the target <objname> actually becomes loaded.
With CONFIG_STRICT_MODULE_RWX, when the livepatch core decides to apply
these relocations, their processing results in the following bug:
[ 25.827238] BUG: unable to handle page fault for address: 00000000000012ba
[ 25.827819] #PF: supervisor read access in kernel mode
[ 25.828153] #PF: error_code(0x0000) - not-present page
[ 25.828588] PGD 0 P4D 0
[ 25.829063] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 25.829742] CPU: 2 UID: 0 PID: 452 Comm: insmod Tainted: G O K 6.13.0-rc4-00078-g059dd502b263 #7820
[ 25.830417] Tainted: [O]=OOT_MODULE, [K]=LIVEPATCH
[ 25.830768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014
[ 25.831651] RIP: 0010:memcmp+0x24/0x60
[ 25.832190] Code: [...]
[ 25.833378] RSP: 0018:ffffa40b403a3ae8 EFLAGS: 00000246
[ 25.833637] RAX: 0000000000000000 RBX: ffff93bc81d8e700 RCX: ffffffffc0202000
[ 25.834072] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 00000000000012ba
[ 25.834548] RBP: ffffa40b403a3b68 R08: ffffa40b403a3b30 R09: 0000004a00000002
[ 25.835088] R10: ffffffffffffd222 R11: f000000000000000 R12: 0000000000000000
[ 25.835666] R13: ffffffffc02032ba R14: ffffffffc007d1e0 R15: 0000000000000004
[ 25.836139] FS: 00007fecef8c3080(0000) GS:ffff93bc8f900000(0000) knlGS:0000000000000000
[ 25.836519] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 25.836977] CR2: 00000000000012ba CR3: 0000000002f24000 CR4: 00000000000006f0
[ 25.837442] Call Trace:
[ 25.838297] <TASK>
[ 25.841083] __write_relocate_add.constprop.0+0xc7/0x2b0
[ 25.841701] apply_relocate_add+0x75/0xa0
[ 25.841973] klp_write_section_relocs+0x10e/0x140
[ 25.842304] klp_write_object_relocs+0x70/0xa0
[ 25.842682] klp_init_object_loaded+0x21/0xf0
[ 25.842972] klp_enable_patch+0x43d/0x900
[ 25.843572] do_one_initcall+0x4c/0x220
[ 25.844186] do_init_module+0x6a/0x260
[ 25.844423] init_module_from_file+0x9c/0xe0
[ 25.844702] idempotent_init_module+0x172/0x270
[ 25.845008] __x64_sys_finit_module+0x69/0xc0
[ 25.845253] do_syscall_64+0x9e/0x1a0
[ 25.845498] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 25.846056] RIP: 0033:0x7fecef9eb25d
[ 25.846444] Code: [...]
[ 25.847563] RSP: 002b:00007ffd0c5d6de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 25.848082] RAX: ffffffffffffffda RBX: 000055b03f05e470 RCX: 00007fecef9eb25d
[ 25.848456] RDX: 0000000000000000 RSI: 000055b001e74e52 RDI: 0000000000000003
[ 25.848969] RBP: 00007ffd0c5d6ea0 R08: 0000000000000040 R09: 0000000000004100
[ 25.849411] R10: 00007fecefac7b20 R11: 0000000000000246 R12: 000055b001e74e52
[ 25.849905] R13: 0000000000000000 R14: 000055b03f05e440 R15: 0000000000000000
[ 25.850336] </TASK>
[ 25.850553] Modules linked in: deku(OK+) uinput
[ 25.851408] CR2: 00000000000012ba
[ 25.852085] ---[ end trace 0000000000000000 ]---
The problem is that the .klp.rela.<objname>.<secname> relocations are
processed after the module was already formed and mod->rw_copy was reset.
However, the code in __write_relocate_add() calls
module_writable_address() which translates the target address 'loc' still
to 'loc + (mem->rw_copy - mem->base)', with mem->rw_copy now being 0.
Fix the problem by returning directly 'loc' in module_writable_address()
when the module is already formed. Function __write_relocate_add() knows
to use text_poke() in such a case.
Link: https://lkml.kernel.org/r/20250107153507.14733-1-petr.pavlu@suse.com
Fixes: 0c133b1e78cd ("module: prepare to handle ROX allocations for text")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reported-by: Marek Maslanka <mmaslanka@google.com>
Closes: https://lore.kernel.org/linux-modules/CAGcaFA2hdThQV6mjD_1_U+GNHThv84+MQvMWLgEuX+LVbAyDxg@mail.gmail.com/
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In zswap_compress() and zswap_decompress(), the per-CPU acomp_ctx of the
current CPU at the beginning of the operation is retrieved and used
throughout. However, since neither preemption nor migration are disabled,
it is possible that the operation continues on a different CPU.
If the original CPU is hotunplugged while the acomp_ctx is still in use,
we run into a UAF bug as some of the resources attached to the acomp_ctx
are freed during hotunplug in zswap_cpu_comp_dead() (i.e.
acomp_ctx.buffer, acomp_ctx.req, or acomp_ctx.acomp).
The problem was introduced in commit 1ec3b5fe6eec ("mm/zswap: move to use
crypto_acomp API for hardware acceleration") when the switch to the
crypto_acomp API was made. Prior to that, the per-CPU crypto_comp was
retrieved using get_cpu_ptr() which disables preemption and makes sure the
CPU cannot go away from under us. Preemption cannot be disabled with the
crypto_acomp API as a sleepable context is needed.
Use the acomp_ctx.mutex to synchronize CPU hotplug callbacks allocating
and freeing resources with compression/decompression paths. Make sure
that acomp_ctx.req is NULL when the resources are freed. In the
compression/decompression paths, check if acomp_ctx.req is NULL after
acquiring the mutex (meaning the CPU was offlined) and retry on the new
CPU.
The initialization of acomp_ctx.mutex is moved from the CPU hotplug
callback to the pool initialization where it belongs (where the mutex is
allocated). In addition to adding clarity, this makes sure that CPU
hotplug cannot reinitialize a mutex that is already locked by
compression/decompression.
Previously a fix was attempted by holding cpus_read_lock() [1]. This
would have caused a potential deadlock as it is possible for code already
holding the lock to fall into reclaim and enter zswap (causing a
deadlock). A fix was also attempted using SRCU for synchronization, but
Johannes pointed out that synchronize_srcu() cannot be used in CPU hotplug
notifiers [2].
Alternative fixes that were considered/attempted and could have worked:
- Refcounting the per-CPU acomp_ctx. This involves complexity in
handling the race between the refcount dropping to zero in
zswap_[de]compress() and the refcount being re-initialized when the
CPU is onlined.
- Disabling migration before getting the per-CPU acomp_ctx [3], but
that's discouraged and is a much bigger hammer than needed, and could
result in subtle performance issues.
[1]https://lkml.kernel.org/20241219212437.2714151-1-yosryahmed@google.com/
[2]https://lkml.kernel.org/20250107074724.1756696-2-yosryahmed@google.com/
[3]https://lkml.kernel.org/20250107222236.2715883-2-yosryahmed@google.com/
[yosryahmed@google.com: remove comment]
Link: https://lkml.kernel.org/r/CAJD7tkaxS1wjn+swugt8QCvQ-rVF5RZnjxwPGX17k8x9zSManA@mail.gmail.com
Link: https://lkml.kernel.org/r/20250108222441.3622031-1-yosryahmed@google.com
Fixes: 1ec3b5fe6eec ("mm/zswap: move to use crypto_acomp API for hardware acceleration")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Closes: https://lore.kernel.org/lkml/20241113213007.GB1564047@cmpxchg.org/
Reported-by: Sam Sun <samsun1006219@gmail.com>
Closes: https://lore.kernel.org/lkml/CAEkJfYMtSdM5HceNsXUDf5haghD5+o2e7Qv4OcuruL4tPg6OaQ@mail.gmail.com/
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This reverts commit eaebeb93922ca6ab0dd92027b73d0112701706ef.
Commit eaebeb93922c ("mm: zswap: fix race between [de]compression and CPU
hotunplug") used the CPU hotplug lock in zswap compress/decompress
operations to protect against a race with CPU hotunplug making some
per-CPU resources go away.
However, zswap compress/decompress can be reached through reclaim while
the lock is held, resulting in a potential deadlock as reported by syzbot:
======================================================
WARNING: possible circular locking dependency detected
6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0 Not tainted
------------------------------------------------------
kswapd0/89 is trying to acquire lock:
ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: acomp_ctx_get_cpu mm/zswap.c:886 [inline]
ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_compress mm/zswap.c:908 [inline]
ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store_page mm/zswap.c:1439 [inline]
ffffffff8e7d2ed0 (cpu_hotplug_lock){++++}-{0:0}, at: zswap_store+0xa74/0x1ba0 mm/zswap.c:1546
but task is already holding lock:
ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline]
ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (fs_reclaim){+.+.}-{0:0}:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
__fs_reclaim_acquire mm/page_alloc.c:3853 [inline]
fs_reclaim_acquire+0x88/0x130 mm/page_alloc.c:3867
might_alloc include/linux/sched/mm.h:318 [inline]
slab_pre_alloc_hook mm/slub.c:4070 [inline]
slab_alloc_node mm/slub.c:4148 [inline]
__kmalloc_cache_node_noprof+0x40/0x3a0 mm/slub.c:4337
kmalloc_node_noprof include/linux/slab.h:924 [inline]
alloc_worker kernel/workqueue.c:2638 [inline]
create_worker+0x11b/0x720 kernel/workqueue.c:2781
workqueue_prepare_cpu+0xe3/0x170 kernel/workqueue.c:6628
cpuhp_invoke_callback+0x48d/0x830 kernel/cpu.c:194
__cpuhp_invoke_callback_range kernel/cpu.c:965 [inline]
cpuhp_invoke_callback_range kernel/cpu.c:989 [inline]
cpuhp_up_callbacks kernel/cpu.c:1020 [inline]
_cpu_up+0x2b3/0x580 kernel/cpu.c:1690
cpu_up+0x184/0x230 kernel/cpu.c:1722
cpuhp_bringup_mask+0xdf/0x260 kernel/cpu.c:1788
cpuhp_bringup_cpus_parallel+0xf9/0x160 kernel/cpu.c:1878
bringup_nonboot_cpus+0x2b/0x50 kernel/cpu.c:1892
smp_init+0x34/0x150 kernel/smp.c:1009
kernel_init_freeable+0x417/0x5d0 init/main.c:1569
kernel_init+0x1d/0x2b0 init/main.c:1466
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3161 [inline]
check_prevs_add kernel/locking/lockdep.c:3280 [inline]
validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
__lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
cpus_read_lock+0x42/0x150 kernel/cpu.c:490
acomp_ctx_get_cpu mm/zswap.c:886 [inline]
zswap_compress mm/zswap.c:908 [inline]
zswap_store_page mm/zswap.c:1439 [inline]
zswap_store+0xa74/0x1ba0 mm/zswap.c:1546
swap_writepage+0x647/0xce0 mm/page_io.c:279
shmem_writepage+0x1248/0x1610 mm/shmem.c:1579
pageout mm/vmscan.c:696 [inline]
shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374
shrink_inactive_list mm/vmscan.c:1967 [inline]
shrink_list mm/vmscan.c:2205 [inline]
shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734
mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575
mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline]
memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362
balance_pgdat mm/vmscan.c:6975 [inline]
kswapd+0x17b3/0x2f30 mm/vmscan.c:7253
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(cpu_hotplug_lock);
lock(fs_reclaim);
rlock(cpu_hotplug_lock);
*** DEADLOCK ***
1 lock held by kswapd0/89:
#0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6871 [inline]
#0: ffffffff8ea355a0 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xb58/0x2f30 mm/vmscan.c:7253
stack backtrace:
CPU: 0 UID: 0 PID: 89 Comm: kswapd0 Not tainted 6.13.0-rc6-syzkaller-00006-g5428dc1906dd #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074
check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206
check_prev_add kernel/locking/lockdep.c:3161 [inline]
check_prevs_add kernel/locking/lockdep.c:3280 [inline]
validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
__lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
cpus_read_lock+0x42/0x150 kernel/cpu.c:490
acomp_ctx_get_cpu mm/zswap.c:886 [inline]
zswap_compress mm/zswap.c:908 [inline]
zswap_store_page mm/zswap.c:1439 [inline]
zswap_store+0xa74/0x1ba0 mm/zswap.c:1546
swap_writepage+0x647/0xce0 mm/page_io.c:279
shmem_writepage+0x1248/0x1610 mm/shmem.c:1579
pageout mm/vmscan.c:696 [inline]
shrink_folio_list+0x35ee/0x57e0 mm/vmscan.c:1374
shrink_inactive_list mm/vmscan.c:1967 [inline]
shrink_list mm/vmscan.c:2205 [inline]
shrink_lruvec+0x16db/0x2f30 mm/vmscan.c:5734
mem_cgroup_shrink_node+0x385/0x8e0 mm/vmscan.c:6575
mem_cgroup_soft_reclaim mm/memcontrol-v1.c:312 [inline]
memcg1_soft_limit_reclaim+0x346/0x810 mm/memcontrol-v1.c:362
balance_pgdat mm/vmscan.c:6975 [inline]
kswapd+0x17b3/0x2f30 mm/vmscan.c:7253
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
Revert the change. A different fix for the race with CPU hotunplug will
follow.
Link: https://lkml.kernel.org/r/20250107222236.2715883-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sam Sun <samsun1006219@gmail.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
hugetlb_file_setup() will pass a NULL @dir to hugetlbfs_get_inode(), so we
will access a NULL pointer for @dir. Fix it and set __entry->dr to 0 if
@dir is NULL. Because ->i_ino cannot be 0 (see get_next_ino()), there is
no confusing if user sees a 0 inode number.
Link: https://lkml.kernel.org/r/20250106033118.4640-1-songmuchun@bytedance.com
Fixes: 318580ad7f28 ("hugetlbfs: support tracepoint")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reported-by: Cheung Wall <zzqq0103.hey@gmail.com>
Closes: https://lore.kernel.org/linux-mm/02858D60-43C1-4863-A84F-3C76A8AF1F15@linux.dev/T/#
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Cc: cheung wall <zzqq0103.hey@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
During testing it has been detected, that it is possible to get div by
zero error in bdi_set_min_bytes. The error is caused by the function
bdi_ratio_from_pages(). bdi_ratio_from_pages() calls global_dirty_limits.
If the dirty threshold is 0, the div by zero is raised. This can happen
if the root user is setting:
echo 0 > /proc/sys/vm/dirty_ratio
The following is a test case:
echo 0 > /proc/sys/vm/dirty_ratio
cd /sys/class/bdi/<device>
echo 1 > strict_limit
echo 8192 > min_bytes
==> error is raised.
The problem is addressed by returning -EINVAL if dirty_ratio or
dirty_bytes is set to 0.
[shr@devkernel.io: check for -EINVAL in bdi_set_min_bytes() and bdi_set_max_bytes()]
Link: https://lkml.kernel.org/r/20250108014723.166637-1-shr@devkernel.io
[shr@devkernel.io: v3]
Link: https://lkml.kernel.org/r/20250109063411.6591-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20250104012037.159386-1-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Reported-by: cheung wall <zzqq0103.hey@gmail.com>
Closes: https://lore.kernel.org/linux-mm/87pll35yd0.fsf@devkernel.io/T/#t
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Qiang Zhang <zzqq0103.hey@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The recently introduced ROX cache for modules is assuming large page
support in 64-bit mode without testing the related feature bit. This
results in breakage when running as a Xen PV guest, as in this mode large
pages are not supported.
Fix that by testing the X86_FEATURE_PSE capability when deciding whether
to enable the ROX cache.
Link: https://lkml.kernel.org/r/20250103065631.26459-1-jgross@suse.com
Fixes: 2e45474ab14f ("execmem: add support for cache of large ROX pages")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On 32-bit kernels, folio_seek_hole_data() was inadvertently truncating a
64-bit value to 32 bits, leading to a possible infinite loop when writing
to an xfs filesystem.
Link: https://lkml.kernel.org/r/20250102190540.1356838-1-marco.nelissen@gmail.com
Fixes: 54fa39ac2e00 ("iomap: use mapping_seek_hole_data")
Signed-off-by: Marco Nelissen <marco.nelissen@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently vma test is failing because of the new vma_assert_attached()
assertion. The check is failing because previous refcount_set() inside
vma_mark_attached() is a NoOp. Fix the definition of atomic_set() to
correctly set the value of the atomic.
Link: https://lkml.kernel.org/r/20241227222220.1726384-1-surenb@google.com
Fixes: 9325b8b5a1cb ("tools: add skeleton code for userland testing of VMA logic")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit fa3bea4e1f82 introduced MPOL_WEIGHTED_INTERLEAVE but it missed
adding its counter to "interleave_hit" of numastat, which is located at
/sys/devices/system/node/nodeN/ directory.
It'd be better to add weighted interleving counter info to the existing
"interleave_hit" instead of introducing a new counter
"weighted_interleave_hit".
Link: https://lkml.kernel.org/r/20241227095737.645-1-honggyu.kim@sk.com
Fixes: fa3bea4e1f82 ("mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving")
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Hyeonggon Yoo <hyeonggon.yoo@sk.com>
Tested-by: Yunjeong Mun <yunjeong.mun@sk.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since commit bdf8eafbf7f5 ("arm64: stacktrace: report source of unwind
data") a stack trace line can contain an additional info field that was not
present before, in the form of one or more letters in parentheses. E.g.:
[ 504.517915] led_sysfs_enable+0x54/0x80 (P)
^^^
When this is present, decode_stacktrace decodes the line incorrectly:
[ 504.517915] led_sysfs_enable+0x54/0x80 P
Extend parsing to decode it correctly:
[ 504.517915] led_sysfs_enable (drivers/leds/led-core.c:455 (discriminator 7)) (P)
The regex to match such lines assumes the info can be extended in the
future to other uppercase characters, and will need to be extended in case
other characters will be used. Using a much more generic regex might incur
in false positives, so this looked like a good tradeoff.
Link: https://lkml.kernel.org/r/20241230-decode_stacktrace-fix-info-v1-1-984910659173@bootlin.com
Fixes: bdf8eafbf7f5 ("arm64: stacktrace: report source of unwind data")
Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Puranjay Mohan <puranjay@kernel.org>
Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
kmemleak_alloc_percpu gives an incorrect min_count parameter, causing
percpu memory to be considered a gray object.
Link: https://lkml.kernel.org/r/20241227092311.3572500-1-guoweikang.kernel@gmail.com
Fixes: 8c8685928910 ("mm/kmemleak: use IS_ERR_PCPU() for pointer in the percpu address space")
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Guo Weikang <guoweikang.kernel@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This reverts commit adcfb264c3ed51fbbf5068ddf10d309a63683868.
It turns out this just causes a different warning splat instead that
seems to be much easier to trigger, so let's revert ASAP.
Reported-and-bisected-by: Borislav Petkov <bp@alien8.de>
Tested-by: Breno Leitao <leitao@debian.org>
Reported-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/all/20250106131817.GAZ3vYGVr3-hWFFPLj@fat_crate.local/
Cc: Koichiro Den <koichiro.den@canonical.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix escaping of '$' in scripts/mksysmap
- Fix a modpost crash observed with the latest binutils
- Fix 'provides' in the linux-api-headers pacman package
* tag 'kbuild-fixes-v6.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: pacman-pkg: provide versioned linux-api-headers package
modpost: work around unaligned data access error
modpost: refactor do_vmbus_entry()
modpost: fix the missed iteration for the max bit in do_input()
scripts/mksysmap: Fix escape chars '$'
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"25 hotfixes. 16 are cc:stable. 18 are MM and 7 are non-MM.
The usual bunch of singletons and two doubletons - please see the
relevant changelogs for details"
* tag 'mm-hotfixes-stable-2025-01-04-18-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (25 commits)
MAINTAINERS: change Arınç _NAL's name and email address
scripts/sorttable: fix orc_sort_cmp() to maintain symmetry and transitivity
mm/util: make memdup_user_nul() similar to memdup_user()
mm, madvise: fix potential workingset node list_lru leaks
mm/damon/core: fix ignored quota goals and filters of newly committed schemes
mm/damon/core: fix new damon_target objects leaks on damon_commit_targets()
mm/list_lru: fix false warning of negative counter
vmstat: disable vmstat_work on vmstat_cpu_down_prep()
mm: shmem: fix the update of 'shmem_falloc->nr_unswapped'
mm: shmem: fix incorrect index alignment for within_size policy
percpu: remove intermediate variable in PERCPU_PTR()
mm: zswap: fix race between [de]compression and CPU hotunplug
ocfs2: fix slab-use-after-free due to dangling pointer dqi_priv
fs/proc/task_mmu: fix pagemap flags with PMD THP entries on 32bit
kcov: mark in_softirq_really() as __always_inline
docs: mm: fix the incorrect 'FileHugeMapped' field
mailmap: modify the entry for Mathieu Othacehe
mm/kmemleak: fix sleeping function called from invalid context at print message
mm: hugetlb: independent PMD page table shared count
maple_tree: reload mas before the second call for mas_empty_area
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"A randconfig build fix and a performance fix:
- Fix the CONFIG_RESET_CONTROLLER=n path signature of
clk_imx8mp_audiomix_reset_controller_register() to appease
randconfig
- Speed up the sdhci clk on TH1520 by a factor of 4 by adding
a fixed factor clk"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: clk-imx8mp-audiomix: fix function signature
clk: thead: Fix TH1520 emmc and shdci clock rate
|