Age | Commit message (Collapse) | Author |
|
Define ptdesc_t type which describes the basic page table descriptor
layout on arm64 platform. Subsequently all level specific pxxval_t
descriptors are derived from ptdesc_t thus establishing a common original
format, which can also be appropriate for page table entries, masks and
protection values etc which are used at all page table levels.
Link: https://lkml.kernel.org/r/20250407053113.746295-4-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Suggested-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Last argument in effective_prot() is u64 assuming pxd_val() returned value
(all page table levels) is 64 bit. pxd_val() is very platform specific
and its type should not be assumed in generic MM.
Split effective_prot() into individual page table level specific callbacks
which accepts corresponding pxd_t argument instead and then the
subscribing platform (only x86) just derive pxd_val() from the entries as
required and proceed as earlier.
Link: https://lkml.kernel.org/r/20250407053113.746295-3-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/ptdump: Drop assumption that pxd_val() is u64", v2.
Last argument passed down in note_page() is u64 assuming pxd_val()
returned value (all page table levels) is 64 bit - which might not be the
case going ahead when D128 page tables is enabled on arm64 platform.
Besides pxd_val() is very platform specific and its type should not be
assumed in generic MM. A similar problem exists for effective_prot(),
although it is restricted to x86 platform.
This series splits note_page() and effective_prot() into individual page
table level specific callbacks which accepts corresponding pxd_t page
table entry as an argument instead and later on all subscribing platforms
could derive pxd_val() from the table entries as required and proceed as
before.
Define ptdesc_t type which describes the basic page table descriptor
layout on arm64 platform. Subsequently all level specific pxxval_t
descriptors are derived from ptdesc_t thus establishing a common original
format, which can also be appropriate for page table entries, masks and
protection values etc which are used at all page table levels.
This patch (of 3):
Last argument passed down in note_page() is u64 assuming pxd_val()
returned value (all page table levels) is 64 bit - which might not be the
case going ahead when D128 page tables is enabled on arm64 platform.
Besides pxd_val() is very platform specific and its type should not be
assumed in generic MM.
Split note_page() into individual page table level specific callbacks
which accepts corresponding pxd_t argument instead and then subscribing
platforms just derive pxd_val() from the entries as required and proceed
as earlier.
Also add a note_page_flush() callback for flushing the last page table
page that was being handled earlier via level = -1.
Link: https://lkml.kernel.org/r/20250407053113.746295-1-anshuman.khandual@arm.com
Link: https://lkml.kernel.org/r/20250407053113.746295-2-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In cgroup v2, memory and hugetlb usage reparenting is asynchronous. This
can cause test flakiness when immediately asserting usage after deleting a
child cgroup. To address this, add a helper function
`assert_with_retry()` that checks usage values with a timeout-based retry.
This improves test stability without relying on fixed sleep delays.
Also bump up the tolerance size to 7MB.
To avoid False Positives:
...
# Assert memory charged correctly for child only use.
# actual a = 11 MB
# expected a = 0 MB
# fail
# cleanup
# [FAIL]
not ok 11 hugetlb_reparenting_test.sh -cgroup-v2 # exit=1
# 0
# SUMMARY: PASS=10 SKIP=0 FAIL=1
Link: https://lkml.kernel.org/r/20250407084201.74492-1-liwang@redhat.com
Signed-off-by: Li Wang <liwang@redhat.com>
Tested-by: Donet Tom <donettom@linux.ibm.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
find_suitable_fallback() is not as efficient as it could be, and somewhat
difficult to follow.
1. should_try_claim_block() is a loop invariant. There is no point in
checking fallback areas if the caller is interested in claimable
blocks but the order and the migratetype don't allow for that.
2. __rmqueue_steal() doesn't care about claimability, so it shouldn't
have to run those tests.
Different callers want different things from this helper:
1. __compact_finished() scans orders up until it finds a claimable block
2. __rmqueue_claim() scans orders down as long as blocks are claimable
3. __rmqueue_steal() doesn't care about claimability at all
Move should_try_claim_block() out of the loop. Only test it for the
two callers who care in the first place. Distinguish "no blocks" from
"order + mt are not claimable" in the return value; __rmqueue_claim()
can stop once order becomes unclaimable, __compact_finished() can keep
advancing until order becomes claimable.
Before:
Performance counter stats for './run case-lru-file-mmap-read' (5 runs):
85,294.85 msec task-clock # 5.644 CPUs utilized ( +- 0.32% )
15,968 context-switches # 187.209 /sec ( +- 3.81% )
153 cpu-migrations # 1.794 /sec ( +- 3.29% )
801,808 page-faults # 9.400 K/sec ( +- 0.10% )
733,358,331,786 instructions # 1.87 insn per cycle ( +- 0.20% ) (64.94%)
392,622,904,199 cycles # 4.603 GHz ( +- 0.31% ) (64.84%)
148,563,488,531 branches # 1.742 G/sec ( +- 0.18% ) (63.86%)
152,143,228 branch-misses # 0.10% of all branches ( +- 1.19% ) (62.82%)
15.1128 +- 0.0637 seconds time elapsed ( +- 0.42% )
After:
Performance counter stats for './run case-lru-file-mmap-read' (5 runs):
84,380.21 msec task-clock # 5.664 CPUs utilized ( +- 0.21% )
16,656 context-switches # 197.392 /sec ( +- 3.27% )
151 cpu-migrations # 1.790 /sec ( +- 3.28% )
801,703 page-faults # 9.501 K/sec ( +- 0.09% )
731,914,183,060 instructions # 1.88 insn per cycle ( +- 0.38% ) (64.90%)
388,673,535,116 cycles # 4.606 GHz ( +- 0.24% ) (65.06%)
148,251,482,143 branches # 1.757 G/sec ( +- 0.37% ) (63.92%)
149,766,550 branch-misses # 0.10% of all branches ( +- 1.22% ) (62.88%)
14.8968 +- 0.0486 seconds time elapsed ( +- 0.33% )
Link: https://lkml.kernel.org/r/20250407180154.63348-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Carlos Song <carlos.song@nxp.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
uprobe_write_opcode() does some pretty low-level things that really, it
shouldn't be doing: for example, manually breaking COW by allocating
anonymous folios and replacing mapped pages.
Further, it does seem to do some shaky things: for example, writing to
possible COW-shared anonymous pages or zapping anonymous pages that might
be pinned. We're also not taking care of uffd, uffd-wp, softdirty ...
although rather corner cases here. Let's just get it right like ordinary
ptrace writes would.
Let's rewrite the code, leaving COW-breaking to core-MM, triggered by
FOLL_FORCE|FOLL_WRITE (note that the code was already using FOLL_FORCE).
We'll use GUP to lookup/faultin the page and break COW if required. Then,
we'll walk the page tables using a folio_walk to perform our page
modification atomically by temporarily unmap the PTE + flushing the TLB.
Likely, we could avoid the temporary unmap in case we can just atomically
write the instruction, but that will be a separate project.
Unfortunately, we still have to implement the zapping logic manually,
because we only want to zap in specific circumstances (e.g., page content
identical).
Note that we can now handle large folios (compound pages) and the shared
zeropage just fine, so drop these checks.
Link: https://lkml.kernel.org/r/20250321113713.204682-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namhyung kim <namhyung@kernel.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: tongtiangen <tongtiangen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
uprobe_write_opcode()
We already have the VMA, no need to look it up using
get_user_page_vma_remote(). We can now switch to get_user_pages_remote().
Link: https://lkml.kernel.org/r/20250321113713.204682-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namhyung kim <namhyung@kernel.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: tongtiangen <tongtiangen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "kernel/events/uprobes: uprobe_write_opcode() rewrite", v3.
Currently, uprobe_write_opcode() implements COW-breaking manually, which
is really far from ideal. Further, there is interest in supporting
uprobes on hugetlb pages [1], and leaving at least the COW-breaking to the
core will make this much easier.
Also, I think the current code doesn't really handle some things properly
(see patch #3) when replacing/zapping pages.
Let's rewrite it, to leave COW-breaking to the fault handler, and handle
registration/unregistration by temporarily unmapping the anonymous page,
modifying it, and mapping it again. We still have to implement zapping of
anonymous pages ourselves, unfortunately.
We could look into not performing the temporary unmapping if we can
perform the write atomically, which would likely also make adding hugetlb
support a lot easier. But, limited (e.g., only PMD/PUD) hugetlb support
could be added on top of this with some tweaking.
Note that we now won't have to allocate another anonymous folio when
unregistering (which will be beneficial for hugetlb as well), we can
simply modify the already-mapped one from the registration (if any). When
registering a uprobe, we'll first trigger a ptrace-like write fault to
break COW, to then modify the already-mapped page.
Briefly sanity tested with perf probes and with the bpf uprobes selftest.
This patch (of 3):
Pass VMA instead of MM to remove_breakpoint() and remove the "MM" argument
from install_breakpoint(), because it can easily be derived from the VMA.
Link: https://lkml.kernel.org/r/20250321113713.204682-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250321113713.204682-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namhyung kim <namhyung@kernel.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: tongtiangen <tongtiangen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As the comments of page_mapcount_is_type() indicate, the parameter passed
to the function should be one more than page->_mapcount. However,
page->_mapcount is passed to the function by commit 4ffca5a96678 ("mm:
support only one page_type per page") where page_type_has_type() is
replaced by page_mapcount_is_type(), but the parameter isn't adjusted.
Fix the parameter for page_mapcount_is_type() to be (page->__mapcount +
1). Note that the issue doesn't cause any visible impacts due to the
safety gap introduced by PGTY_mapcount_underflow limit.
[akpm@linux-foundation.org: simplify __dump_folio(), per David]
Link: https://lkml.kernel.org/r/20250321120222.1456770-3-gshan@redhat.com
Fixes: 4ffca5a96678 ("mm: support only one page_type per page")
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: gehao <gehao@kylinos.cn>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Fix parameter passed to page_mapcount_is_type()", v2.
Found by code inspection. There are two places where the parameter passed
to page_mapcount_is_type() is (page->_mapcount), which is incorrect since
it should be one more than the value, as explained in the comments to
page_mapcount_is_type(): (a) page_has_type() in page-flags.h (b)
__dump_folio() in mm/debug.c
PATCH[1] fixes the parameter for (a)
PATCH[2] fixes the parameter for (b)
Note that the issue doesn't cause any visible impacts due to the safety
gap introduced by PGTY_mapcount_underflow limit. So the tag 'Cc:
stable@vger.kernel.org' isn't needed.
This patch (of 2):
As the comments of page_mapcount_is_type() indicate, the parameter passed
to the function should be one more than page->_mapcount. However,
page->_mapcount (equivalent to page->page_type) is passed to the function
by commit 4ffca5a96678 ("mm: support only one page_type per page")
page_type_has_type() is replaced by page_mapcount_is_type(), but the
parameter isn't adjusted.
Fix it by replacing page_mapcount_is_type() with page_type_has_type() in
page_has_type(). Note that the issue doesn't cause any visible impacts
due to the safety gap introduced by PGTY_mapcount_underflow limit.
Link: https://lkml.kernel.org/r/20250321120222.1456770-1-gshan@redhat.com
Link: https://lkml.kernel.org/r/20250321120222.1456770-2-gshan@redhat.com
Fixes: 4ffca5a96678 ("mm: support only one page_type per page")
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: gehao <gehao@kylinos.cn>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a selftest to verify the PAGEMAP_SCAN ioctl correctly reports guard
regions using the newly introduced PAGE_IS_GUARD flag.
Link: https://lkml.kernel.org/r/20250324065328.107678-4-avagin@google.com
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Required for a new PAGEMAP_SCAN test to verify guard region reporting.
Link: https://lkml.kernel.org/r/20250324065328.107678-3-avagin@google.com
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard
regions", v2.
Introduce the PAGE_IS_GUARD flag in the PAGEMAP_SCAN ioctl to expose
information about guard regions. This allows userspace tools, such as
CRIU, to detect and handle guard regions.
Currently, CRIU utilizes PAGEMAP_SCAN as a more efficient alternative to
parsing /proc/pid/pagemap. Without this change, guard regions are
incorrectly reported as swap-anon regions, leading CRIU to attempt dumping
them and subsequently failing.
The series includes updates to the documentation and selftests to reflect
the new functionality.
This patch (of 3):
Introduce the PAGE_IS_GUARD flag in the PAGEMAP_SCAN ioctl to expose
information about guard regions. This allows userspace tools, such as
CRIU, to detect and handle guard regions.
Link: https://lkml.kernel.org/r/20250324065328.107678-1-avagin@google.com
Link: https://lkml.kernel.org/r/20250324065328.107678-2-avagin@google.com
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove unused headers includes from zsmalloc and move pagemap.h and
migrate.h includes into zpdesc header.
Link: https://lkml.kernel.org/r/20250325080427.3449359-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Check whether PTRACE_SET_SYSCALL_INFO semantics implemented in the kernel
matches userspace expectations.
Link: https://lkml.kernel.org/r/20250303112052.GG24170@strace.io
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexey Gladkov (Intel) <legion@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: anton ivanov <anton.ivanov@cambridgegreys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Charlie Jenkins <charlie@rivosinc.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Zankel <chris@zankel.net>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davide Berardi <berardi.dav@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Eugene Syromyatnikov <evgsyr@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Renzo Davoi <renzo@cs.unibo.it>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements
PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of system
calls the tracee is blocked in.
This API allows ptracers to obtain and modify system call details in a
straightforward and architecture-agnostic way, providing a consistent way
of manipulating the system call number and arguments across architectures.
As in case of PTRACE_GET_SYSCALL_INFO, PTRACE_SET_SYSCALL_INFO also does
not aim to address numerous architecture-specific system call ABI
peculiarities, like differences in the number of system call arguments for
such system calls as pread64 and preadv.
The current implementation supports changing only those bits of system
call information that are used by strace system call tampering, namely,
syscall number, syscall arguments, and syscall return value.
Support of changing additional details returned by
PTRACE_GET_SYSCALL_INFO, such as instruction pointer and stack pointer,
could be added later if needed, by using struct ptrace_syscall_info.flags
to specify the additional details that should be set. Currently, "flags"
and "reserved" fields of struct ptrace_syscall_info must be initialized
with zeroes; "arch", "instruction_pointer", and "stack_pointer" fields are
currently ignored.
PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY,
PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations.
Other operations could be added later if needed.
Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with
PTRACE_GET_SYSCALL_INFO, but it didn't happen. The last straw that
convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure to
provide an API of changing the first system call argument on riscv
architecture.
ptrace(2) man page:
long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
...
PTRACE_SET_SYSCALL_INFO
Modify information about the system call that caused the stop.
The "data" argument is a pointer to struct ptrace_syscall_info
that specifies the system call information to be set.
The "addr" argument should be set to sizeof(struct ptrace_syscall_info)).
Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
Link: https://lkml.kernel.org/r/20250303112044.GF24170@strace.io
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Eugene Syromiatnikov <esyr@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: anton ivanov <anton.ivanov@cambridgegreys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Zankel <chris@zankel.net>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davide Berardi <berardi.dav@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Eugene Syromyatnikov <evgsyr@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Renzo Davoi <renzo@cs.unibo.it>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Move the code that calculates the type of the system call stop out of
ptrace_get_syscall_info() into a separate function
ptrace_get_syscall_info_op() which is going to be used later to implement
PTRACE_SET_SYSCALL_INFO API.
Link: https://lkml.kernel.org/r/20250303112038.GE24170@strace.io
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexey Gladkov (Intel) <legion@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: anton ivanov <anton.ivanov@cambridgegreys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Charlie Jenkins <charlie@rivosinc.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Zankel <chris@zankel.net>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davide Berardi <berardi.dav@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Eugene Syromyatnikov <evgsyr@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Renzo Davoi <renzo@cs.unibo.it>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Similar to syscall_set_arguments() that complements
syscall_get_arguments(), introduce syscall_set_nr() that complements
syscall_get_nr().
syscall_set_nr() is going to be needed along with syscall_set_arguments()
on all HAVE_ARCH_TRACEHOOK architectures to implement
PTRACE_SET_SYSCALL_INFO API.
Link: https://lkml.kernel.org/r/20250303112020.GD24170@strace.io
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Reviewed-by: Maciej W. Rozycki <macro@orcam.me.uk> # mips
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexey Gladkov (Intel) <legion@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: anton ivanov <anton.ivanov@cambridgegreys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Zankel <chris@zankel.net>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davide Berardi <berardi.dav@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Eugene Syromyatnikov <evgsyr@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Renzo Davoi <renzo@cs.unibo.it>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This function is going to be needed on all HAVE_ARCH_TRACEHOOK
architectures to implement PTRACE_SET_SYSCALL_INFO API.
This partially reverts commit 7962c2eddbfe ("arch: remove unused function
syscall_set_arguments()") by reusing some of old syscall_set_arguments()
implementations.
[nathan@kernel.org: fix compile time fortify checks]
Link: https://lkml.kernel.org/r/20250408213131.GA2872426@ax162
Link: https://lkml.kernel.org/r/20250303112009.GC24170@strace.io
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Reviewed-by: Maciej W. Rozycki <macro@orcam.me.uk> [mips]
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexey Gladkov (Intel) <legion@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: anton ivanov <anton.ivanov@cambridgegreys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Zankel <chris@zankel.net>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davide Berardi <berardi.dav@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Eugene Syromyatnikov <evgsyr@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Renzo Davoi <renzo@cs.unibo.it>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API", v7.
PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements
PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of system
calls the tracee is blocked in.
This API allows ptracers to obtain and modify system call details in a
straightforward and architecture-agnostic way, providing a consistent way
of manipulating the system call number and arguments across architectures.
As in case of PTRACE_GET_SYSCALL_INFO, PTRACE_SET_SYSCALL_INFO also does
not aim to address numerous architecture-specific system call ABI
peculiarities, like differences in the number of system call arguments for
such system calls as pread64 and preadv.
The current implementation supports changing only those bits of system
call information that are used by strace system call tampering, namely,
syscall number, syscall arguments, and syscall return value.
Support of changing additional details returned by
PTRACE_GET_SYSCALL_INFO, such as instruction pointer and stack pointer,
could be added later if needed, by using struct ptrace_syscall_info.flags
to specify the additional details that should be set. Currently, "flags"
and "reserved" fields of struct ptrace_syscall_info must be initialized
with zeroes; "arch", "instruction_pointer", and "stack_pointer" fields are
currently ignored.
PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY,
PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations.
Other operations could be added later if needed.
Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with
PTRACE_GET_SYSCALL_INFO, but it didn't happen. The last straw that
convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure to
provide an API of changing the first system call argument on riscv
architecture [1].
ptrace(2) man page:
long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
...
PTRACE_SET_SYSCALL_INFO
Modify information about the system call that caused the stop.
The "data" argument is a pointer to struct ptrace_syscall_info
that specifies the system call information to be set.
The "addr" argument should be set to sizeof(struct ptrace_syscall_info)).
This patch (of 6):
hexagon is the only architecture that provides HAVE_ARCH_TRACEHOOK but
doesn't define syscall_set_return_value(). Since this function is going
to be needed on all HAVE_ARCH_TRACEHOOK architectures to implement
PTRACE_SET_SYSCALL_INFO API, add it on hexagon, too.
Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/ [1]
Link: https://lkml.kernel.org/r/20250303111953.GB24170@strace.io
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexey Gladkov (Intel) <legion@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: anton ivanov <anton.ivanov@cambridgegreys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Charlie Jenkins <charlie@rivosinc.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Zankel <chris@zankel.net>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davide Berardi <berardi.dav@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Eugene Syromyatnikov <evgsyr@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Renzo Davoi <renzo@cs.unibo.it>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Provide kernel-doc for free_pgd_range() so it's easier to understand what
the function does and how it is used.
Link: https://lkml.kernel.org/r/20250325181325.5774-1-soumish.das@gmail.com
Signed-off-by: SoumishDas <soumish.das@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace cluster_swap_free_nr() with swap_entries_put_[map/cache]() to
remove repeat code and leverage batch-remove for entries with last flag.
After removing cluster_swap_free_nr, only functions with "_nr" suffix
could free entries spanning cross clusters. Add corresponding description
in comment of swap_entries_put_map_nr() as is first function with "_nr"
suffix and have a non-suffix variant function swap_entries_put_map().
Link: https://lkml.kernel.org/r/20250325162528.68385-9-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Factor out helper swap_entries_put_cache() from put_swap_folio() to serve
as a general-purpose routine for dropping cache flag of entries within a
single cluster.
Link: https://lkml.kernel.org/r/20250325162528.68385-8-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
1. Factor out general swap_entries_put_map() helper to drop entries
belonging to one cluster. If entries are last map, free entries in
batch, otherwise put entries with cluster lock acquired and released
only once.
2. Iterate and call swap_entries_put_map() for each cluster in
swap_entries_put_nr() to leverage batch-remove for last map belonging
to one cluster and reduce lock acquire/release in fallback case.
3. As swap_entries_put_nr() won't handle SWAP_HSA_CACHE drop, rename
it to swap_entries_put_map_nr().
4. As we won't drop each entry invidually with swap_entry_put() now,
do reclaim in free_swap_and_cache_nr() because
swap_entries_put_map_nr() is general routine to drop reference and the
relcaim work should only be done in free_swap_and_cache_nr(). Remove
stale comment accordingly.
Link: https://lkml.kernel.org/r/20250325162528.68385-7-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The SWAP_MAP_SHMEM indicates last map from shmem. Therefore we can drop
SWAP_MAP_SHMEM in batch in similar way to drop last ref count in batch.
Link: https://lkml.kernel.org/r/20250325162528.68385-6-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use swap_entries_free() to directly free swap entries when the swap
entries are not cached and referenced, without needing to set swap entries
to set intermediate SWAP_HAS_CACHE state.
Link: https://lkml.kernel.org/r/20250325162528.68385-5-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In swap_entry_put_locked(), we will set slot to SWAP_HAS_CACHE before
using swap_entries_free() to do actual swap entry freeing. This introduce
an unnecessary intermediate state. By using swap_entries_free() in
swap_entry_put_locked(), we can eliminate the need to set slot to
SWAP_HAS_CACHE. This change would make the behavior of
swap_entry_put_locked() more consistent with other put() operations which
will do actual free work after put last reference.
Link: https://lkml.kernel.org/r/20250325162528.68385-4-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The original VM_BUG_ON only allows swap_entry_range_free() to drop last
SWAP_HAS_CACHE ref. By allowing other kind of last ref in VM_BUG_ON,
swap_entry_range_free() could be a more general-purpose function able to
handle all kind of last ref. Following thi change, also rename
swap_entry_range_free() to swap_entries_free() and update it's comment
accordingly.
This is a preparation to use swap_entries_free() to drop more kind of last
ref other than SWAP_HAS_CACHE.
[shikemeng@huaweicloud.com: add __maybe_unused attribute for swap_is_last_ref() and update comment]
Link: https://lkml.kernel.org/r/20250410153908.612984-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20250325162528.68385-3-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Tested-by: SeongJae Park <sj@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
swap_[entry/entries]_put[_locked]
Patch series "Minor cleanups and improvements to swap freeing code", v4.
This series contains some cleanups and improvements which are made
during learning swapfile. Here is a summary of the changes:
1. Function naming improvments.
- Use "put" instead of "free" to name functions which only do actual
free when count drops to zero.
- Use "entry" to name function only frees one swap slot. Use
"entries" to name function could may free multi swap slots within one
cluster. Use "_nr" suffix to name function which could free multi
swap slots spanning cross multi clusters.
2. Eliminate the need to set swap slot to intermediate SWAP_HAS_CACHE
value before do actual free by using swap_entry_range_free()
3. Add helpers swap_entries_put_map() and swap_entries_put_cache() as
a general-purpose routine to free swap entries within a single cluster
which will try batch-remove first and fallback to put eatch entry
indvidually with cluster lock acquired/released only once. By using
these helpers, we could remove repeated code, levarage batch-remove in
more cases and aoivd to acquire/release cluster lock for each single
swap entry.
This patch (of 8):
In __swap_entry_free[_locked] and __swap_entries_free, we decrease count
first and only free swap entry if count drops to zero. This behavior is
more akin to a put() operation rather than a free() operation. Therefore,
rename these functions with "put" instead of "free". Additionally, add
"_nr" suffix to swap_entries_put to indicate the input range may span swap
clusters.
Link: https://lkml.kernel.org/r/20250325162528.68385-1-shikemeng@huaweicloud.com
Link: https://lkml.kernel.org/r/20250325162528.68385-2-shikemeng@huaweicloud.com
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The replace_stock_objcg() is being called by only refill_obj_stock, so
manually inline it.
Link: https://lkml.kernel.org/r/20250404013913.1663035-10-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When handing slab objects, we use obj_cgroup_[un]charge() for (un)charging
and mod_objcg_state() to account NR_SLAB_[UN]RECLAIMABLE_B. All these
operations use the percpu stock for performance. However with the calls
being separate, the stock_lock is taken twice in each case.
By refactoring the code, we can turn mod_objcg_state() into
__account_obj_stock() which is called on a stock that's already locked and
validated. On the charging side we can call this function from
consume_obj_stock() when it succeeds, and refill_obj_stock() in the
fallback. We just expand parameters of these functions as necessary. The
uncharge side from __memcg_slab_free_hook() is just the call to
refill_obj_stock().
Other callers of obj_cgroup_[un]charge() (i.e. not slab) simply pass the
extra parameters as NULL/zeroes to skip the __account_obj_stock()
operation.
In __memcg_slab_post_alloc_hook() we now charge each object separately,
but that's not a problem as we did call mod_objcg_state() for each object
separately, and most allocations are non-bulk anyway. This could be
improved by batching all operations until slab_pgdat(slab) changes.
Some preliminary benchmarking with a kfree(kmalloc()) loop of 10M
iterations with/without __GFP_ACCOUNT:
Before the patch:
kmalloc/kfree !memcg: 581390144 cycles
kmalloc/kfree memcg: 783689984 cycles
After the patch:
kmalloc/kfree memcg: 658723808 cycles
More than half of the overhead of __GFP_ACCOUNT relative to
non-accounted case seems eliminated.
Link: https://lkml.kernel.org/r/20250404013913.1663035-9-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
For non-PREEMPT_RT kernels, drain_obj_stock() is always called with irq
disabled, so we can use __mod_memcg_state() instead of mod_memcg_state().
For PREEMPT_RT, we need to add memcg_stats_[un]lock in
__mod_memcg_state().
Link: https://lkml.kernel.org/r/20250404013913.1663035-8-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Previously we could not call obj_cgroup_put() inside the local lock
because on the put on the last reference, the release function
obj_cgroup_release() may try to re-acquire the local lock. However that
chain has been broken. Now simply do obj_cgroup_put() inside
drain_obj_stock() instead of returning the old objcg.
Link: https://lkml.kernel.org/r/20250404013913.1663035-7-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
obj_cgroup_release is called when all the references to the objcg have
been released i.e. no more memory objects are pointing to it. Most
probably objcg->memcg will be pointing to some ancestor memcg. In
obj_cgroup_release(), the kernel calls obj_cgroup_uncharge_pages() which
refills the local stock.
There is no need to refill the local stock with some ancestor memcg and
flush the local stock. Let's decouple obj_cgroup_release() from the local
stock by uncharging instead of refilling. One additional benefit of this
change is that it removes the requirement to only call obj_cgroup_put()
outside of local_lock.
Link: https://lkml.kernel.org/r/20250404013913.1663035-6-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There are no more multiple callers of __refill_stock(), so simply inline
it to refill_stock().
Link: https://lkml.kernel.org/r/20250404013913.1663035-5-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
At multiple places in memcontrol.c, the memory and memsw page counters are
being uncharged. This is error-prone. Let's move the functionality to a
newly introduced memcg_uncharge and call it from all those places.
Link: https://lkml.kernel.org/r/20250404013913.1663035-4-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently drain_obj_stock() can potentially call __refill_stock which
accesses local cpu stock and thus requires memcg stock's local_lock.
However if we look at the code paths leading to drain_obj_stock(), there
is never a good reason to refill the memcg stock at all from it.
At the moment, drain_obj_stock can be called from reclaim, hotplug cpu
teardown, mod_objcg_state() and refill_obj_stock(). For reclaim and
hotplug there is no need to refill. For the other two paths, most
probably the newly switched objcg would be used in near future and thus no
need to refill stock with the older objcg.
In addition, __refill_stock() from drain_obj_stock() happens on rare
cases, so performance is not really an issue. Let's just uncharge
directly instead of refill which will also decouple drain_obj_stock from
local cpu stock and local_lock requirements.
Link: https://lkml.kernel.org/r/20250404013913.1663035-3-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
refill_stock can not be called with root memcg, so there is no need to
check it. Instead add a warning if root is ever passed to it.
Link: https://lkml.kernel.org/r/20250404013913.1663035-1-shakeel.butt@linux.dev
Link: https://lkml.kernel.org/r/20250404013913.1663035-2-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The vmalloc region can either be charged to a single memcg or none. At
the moment kernel traverses all the pages backing the vmalloc region to
update the MEMCG_VMALLOC stat. However there is no need to look at all
the pages as all those pages will be charged to a single memcg or none.
Simplify the MEMCG_VMALLOC update by just looking at the first page of the
vmalloc region.
[shakeel.butt@linux.dev: add comment]
Link: https://lkml.kernel.org/r/bmlkdbqgwboyqrnxyom7n52fjmo76ux77jhqw5odc6c6dfon3h@zdylwtmlywbt
Link: https://lkml.kernel.org/r/20250403053326.26860-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Reduce the diff between low and high watermarks when compaction
proactiveness is set to high. This allows users who set the proactiveness
really high to have more stable fragmentation score over time.
Link: https://lkml.kernel.org/r/20250404111103.1994507-3-mclapinski@google.com
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/compaction: allow more aggressive proactive compaction",
v4.
Our goal is to keep memory usage of a VM low on the host. For that
reason, we use free page reporting which by default reports free pages of
order 9 and larger to the host to be freed. The feature works well only
if the memory in the guest is not fragmented below pages of order 9.
Proactive compaction can be reused to achieve defragmentation after some
parameter tweaking.
When the fragmentation score (lower is better) gets larger than the high
watermark, proactive compaction kicks in. Compaction stops when the score
goes below the low watermark (or no progress is made and backoff kicks
in). Let's define the difference between high and low watermarks as
leeway. Before these changes, the minimum possible value for low
watermark was 5 and the leeway was hardcoded to 10 (so minimum possible
value for high watermark was 15).
To test this, I created a VM with 19GB of memory and free page reporting
enabled. The VM was ~idle. I meassured the memory usage from inside the
guest (/proc/meminfo) and from the host (provided by the hypervisor).
Before:
https://drive.google.com/file/d/1Xw23lRry_PgEH3f6QRnSGvoHh2u9UHyI/view?usp=sharing
After:
https://drive.google.com/file/d/1wMhpIzepx6t44F70yCPA50n1S5V2AT-a/view?usp=sharing
This patch (of 2):
Previously a min cap of 5 has been set in the commit introducing proactive
compaction. This was to make sure users don't hurt themselves by setting
the proactiveness to 100 and making their system unresponsive. But the
compaction mechanism has a backoff mechanism that will sleep for 30s if no
progress is made, so I don't see a significant risk here. My system (19GB
of memory) has been perfectly fine with both watermarks hardcoded to 0.
Link: https://lkml.kernel.org/r/20250404111103.1994507-1-mclapinski@google.com
Link: https://lkml.kernel.org/r/20250404111103.1994507-2-mclapinski@google.com
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Refactor free_page_is_bad() to call bad_page() directly, removing the
intermediate free_page_is_bad_report(). This reduces unnecessary
indirection, improving code clarity and maintainability without changing
functionality.
Link: https://lkml.kernel.org/r/20250328012031.1204993-1-ye.liu@linux.dev
Signed-off-by: Ye Liu <liuye@kylinos.cn>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The writeback interface supports a page_index=N parameter which performs
writeback of the given page. Since we rarely need to writeback just one
single page, the typical use case involves a number of writeback calls,
each performing writeback of one page:
echo page_index=100 > zram0/writeback
...
echo page_index=200 > zram0/writeback
echo page_index=500 > zram0/writeback
...
echo page_index=700 > zram0/writeback
One obvious downside of this is that it increases the number of syscalls.
Less obvious, but a significantly more important downside, is that when
given only one page to post-process zram cannot perform an optimal target
selection. This becomes a critical limitation when writeback_limit is
enabled, because under writeback_limit we want to guarantee the highest
memory savings hence we first need to writeback pages that release the
highest amount of zsmalloc pool memory.
This patch adds page_indexes=LOW-HIGH parameter to the writeback
interface:
echo page_indexes=100-200 page_indexes=500-700 > zram0/writeback
This gives zram a chance to apply an optimal target selection strategy on
each iteration of the writeback loop.
We also now permit multiple page_index parameters per call (previously
zram would recognize only one page_index) and a mix or single pages and
page ranges:
echo page_index=42 page_index=99 page_indexes=100-200 \
page_indexes=500-700 > zram0/writeback
Apart from that the patch also unifies parameters passing and resembles
other "modern" zram device attributes (e.g. recompression), while the old
interface used a mixed scheme: values-less parameters for mode and a
key=value format for page_index. We still support the "old" value-less
format for compatibility reasons.
[senozhatsky@chromium.org: simplify parse_page_index() range checks, per Brian]
nk: https://lkml.kernel.org/r/20250404015327.2427684-1-senozhatsky@chromium.org
[sozhatsky@chromium.org: fix uninitialized variable in zram_writeback_slots(), per Dan]
nk: https://lkml.kernel.org/r/20250409112611.1154282-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20250327015818.4148660-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Richard Chang <richardycc@google.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Cppcheck warning:
int result is assigned to long long variable. If the variable is long long
to avoid loss of information, then you have loss of information.
This patch changes the type of page_size from 'unsigned int' to
'unsigned long' instead of using ULL suffixes. Changing hpage_size to
'unsigned long' was considered, but since gethugepage() expects an int,
this change was avoided.
Link: https://lkml.kernel.org/r/20250403101345.29226-1-siddarthsgml@gmail.com
Signed-off-by: Siddarth G <siddarthsgml@gmail.com>
Reported-by: David Binderman <dcb314@hotmail.com>
Closes: https://lore.kernel.org/all/AS8PR02MB10217315060BBFDB21F19643E9CA62@AS8PR02MB10217.eurprd02.prod.outlook.com/
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Refactors the si_meminfo_node() function by reducing redundant code and
improving readability.
Moved the calculation of managed_pages inside the existing loop that
processes pgdat->node_zones, eliminating the need for a separate loop.
Simplified the logic by removing unnecessary preprocessor conditionals.
Ensured that both totalram, totalhigh, and other memory statistics are
consistently set without duplication.
This change results in cleaner and more efficient code without altering
functionality.
Link: https://lkml.kernel.org/r/20250325073803.852594-1-ye.liu@linux.dev
Signed-off-by: Ye Liu <liuye@kylinos.cn>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
mm_struct.hiwater_rss can be accessed concurrently without proper
synchronization as reported by KCSAN.
This data race is benign as it only affects accounting information.
Annotate it with data_race() to make KCSAN happy.
Link: https://lkml.kernel.org/r/20250331-mm-maxrss-data-race-v2-1-cf958e6205bf@iencinas.com
Signed-off-by: Ignacio Encinas <ignacio@iencinas.com>
Reported-by: syzbot+419c4b42acc36c420ad3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67e3390c.050a0220.1ec46.0001.GAE@google.com/
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Pedro Falcato <pfalcato@suse.de>
Cc: Liam Howlett <liam.howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use a folio in the hugetlb pathway during the compaction migrate-able
pageblock scan.
This removes a call to compound_head().
Link: https://lkml.kernel.org/r/20250401021025.637333-2-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Capacity is stranded when CFMWS regions are not aligned to block size. On
x86, block size increases with capacity (2G blocks @ 64G capacity).
Use CFMWS base/size to report memory block size alignment advice.
Link: https://lkml.kernel.org/r/20250127153405.3379117-4-gourry@gourry.net
Signed-off-by: Gregory Price <gourry@gourry.net>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Bruno Faccini <bfaccini@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haibo Xu <haibo1.xu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <rrichter@amd.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Systems with hotplug may provide an advisement value on what the memblock
size should be. Probe this value when the rest of the configuration
values are considered.
The new heuristic is as follows
1) set_memory_block_size_order value if already set (cmdline param)
2) minimum block size if memory is less than large block limit
3) if no hotplug advice: Max block size if system is bare-metal,
otherwise use end of memory alignment.
4) if hotplug advice: lesser of advice and end of memory alignment.
Convert to cpu_feature_enabled() while at it.[1]
[1] https://lore.kernel.org/all/20241031103401.GBZyNdGQ-ZyXKyzC_z@fat_crate.local/
Link: https://lkml.kernel.org/r/20250127153405.3379117-3-gourry@gourry.net
Signed-off-by: Gregory Price <gourry@gourry.net>
Suggested-by: Borislav Petkov <bp@alien8.de>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Bruno Faccini <bfaccini@nvidia.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haibo Xu <haibo1.xu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Robert Richter <rrichter@amd.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "memory,x86,acpi: hotplug memory alignment advisement", v8.
When physical address regions are not aligned to memory block size, the
misaligned portion is lost (stranded capacity).
Block size (min/max/selected) is architecture defined. Most architectures
tend to use the minimum block size or some simplistic heurist. On x86,
memory block size increases up to 2GB, and is otherwise fitted to the
alignment of non-hotplug (i.e. not special purpose memory).
CXL exposes its memory for management through the ACPI CEDT (CXL Early
Detection Table) in a field called the CXL Fixed Memory Window. Per the
CXL specification, this memory must be aligned to at least 256MB.
When a CFMW aligns on a size less than the block size, this causes a loss
of up to 2GB per CFMW on x86. It is not uncommon for CFMW to be allocated
per-device - though this behavior is BIOS defined.
This patch set provides 3 things:
1) implement advise/query functions in driverse/base/memory.c to
report/query architecture agnostic hotplug block alignment advice.
2) update x86 memblock size logic to consider the hotplug advice
3) add code in acpi/numa/srat.c to report CFMW alignment advice
The advisement interfaces are design to be called during arch_init code
prior to allocator and smp_init. start_kernel will call these through
setup_arch() (via acpi and mm/init_64.c on x86), which occurs prior to
mm_core_init and smp_init - so no need for atomics.
There's an attempt to signal callers to advise() that query has already
occurred, but this is predicated on the notion that query actually occurs
(which presently only happens on the x86 arch). This is to assist
debugging future users. Otherwise, the advise() call has been marked
__init to help static discovery of bad call times.
Once query is called the first time, it will always return the same value.
Interfaces return -EBUSY and 0 respectively on systems without hotplug.
This patch (of 3):
Hotplug memory sources may have opinions on what the memblock size should
be - usually for alignment purposes. For example, CXL memory extents can
be 256MB with a matching alignment. If this size/alignment is smaller
than the block size, it can result in stranded capacity.
Implement memory_block_advise_max_size for use prior to allocator init,
for software to advise the system on the max block size.
Implement memory_block_probe_max_size for use by arch init code to
calculate the best block size. Use of advice is architecture defined.
The probe value can never change after first probe. Calls to advise after
probe will return -EBUSY to aid debugging.
On systems without hotplug, always return -ENODEV and 0 respectively.
Link: https://lkml.kernel.org/r/20250127153405.3379117-1-gourry@gourry.net
Link: https://lkml.kernel.org/r/20250127153405.3379117-2-gourry@gourry.net
Signed-off-by: Gregory Price <gourry@gourry.net>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Bruno Faccini <bfaccini@nvidia.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haibo Xu <haibo1.xu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Robert Richter <rrichter@amd.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|