Age | Commit message (Collapse) | Author |
|
The test has the initialization step during which threads are created. To
prevent the workers from starting prematurely a write lock was previously
used by the main setup thread, while each worker would block on a read
lock.
Replace this RWSEM based synchronization with a simpler SRCU based
approach. Which does two basic steps:
- Main thread wraps the setup phase in an SRCU read-side critical
section. Pair of srcu_read_lock()/srcu_read_unlock().
- Each worker calls synchronize_srcu() on entry, ensuring it waits for
the initialization phase to be completed.
This patch eliminates the need for down_read()/up_read() and
down_write()/up_write() pairs thus simplifying the logic and improving
clarity.
[urezki@gmail.com: fix compile error with CONFIG_TINY_RCU]
Link: https://lkml.kernel.org/r/20250420142029.103169-1-urezki@gmail.com
Link: https://lkml.kernel.org/r/20250417161216.88318-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Adrian Huang <ahuang12@lenovo.com>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christop Hellwig <hch@infradead.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Move IDLE pages tracking into a separate chapter because there are
multiple features that use (or depend on) it either in built-in variant
("mark all") or in extended variant (ac-time tracking).
In addition, recompression doesn't require memory tracking to be enabled
in order to be able to perform idle recompression.
Link: https://lkml.kernel.org/r/20250416042833.3858827-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Shin Kawamura <kawasin@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
After the check for queue_folio_required(), the code only cares about the
folio in the for loop, i.e the PTEs are redundant. Therefore, optimize
this loop by skipping over a PTE batch mapping the same folio.
With a test program migrating pages of the calling process, which includes
a mapped VMA of size 4GB with pte-mapped large folios of order-9, and
migrating once back and forth node-0 and node-1, the average execution
time reduces from 7.5 to 4 seconds, giving an approx 47% speedup.
Link: https://lkml.kernel.org/r/20250416053048.96479-1-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently the VMA and mmap locking logic is entangled in two of the most
overwrought files in mm - include/linux/mm.h and mm/memory.c. Separate
this logic out so we can more easily make changes and create an
appropriate MAINTAINERS entry that spans only the logic relating to
locking.
This should have no functional change. Care is taken to avoid dependency
loops, we must regrettably keep release_fault_lock() and
assert_fault_locked() in mm.h as a result due to the dependence on the
vm_fault type.
Additionally we must declare rcuwait_wake_up() manually to avoid a
dependency cycle on linux/rcuwait.h.
Additionally move the nommu implementatino of lock_mm_and_find_vma() to
mmap_lock.c so everything lock-related is in one place.
Link: https://lkml.kernel.org/r/bec6c8e29fa8de9267a811a10b1bdae355d67ed4.1744799282.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Memory cgroup accounting is expensive and to reduce the cost, the kernel
maintains per-cpu charge cache for a single memcg. So, if a charge
request comes for a different memcg, the kernel will flush the old memcg's
charge cache and then charge the newer memcg a fixed amount (64 pages),
subtracts the charge request amount and stores the remaining in the
per-cpu charge cache for the newer memcg.
This mechanism is based on the assumption that the kernel, for locality,
keep a process on a CPU for long period of time and most of the charge
requests from that process will be served by that CPU's local charge
cache.
However this assumption breaks down for incoming network traffic in a
multi-tenant machine. We are in the process of running multiple workloads
on a single machine and if such workloads are network heavy, we are seeing
very high network memory accounting cost. We have observed multiple CPUs
spending almost 100% of their time in net_rx_action and almost all of that
time is spent in memcg accounting of the network traffic.
More precisely, net_rx_action is serving packets from multiple workloads
and is observing/serving mix of packets of these workloads. The memcg
switch of per-cpu cache is very expensive and we are observing a lot of
memcg switches on the machine. Almost all the time is being spent on
charging new memcg and flushing older memcg cache. So, definitely we need
per-cpu cache that support multiple memcgs for this scenario.
This patch implements a simple (and dumb) multiple memcg percpu charge
cache. Actually we started with more sophisticated LRU based approach but
the dumb one was always better than the sophisticated one by 1% to 3%, so
going with the simple approach.
Some of the design choices are:
1. Fit all caches memcgs in a single cacheline.
2. The cache array can be mix of empty slots or memcg charged slots, so
the kernel has to traverse the full array.
3. The cache drain from the reclaim will drain all cached memcgs to keep
things simple.
To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
the following workload where each netperf client runs in a different
cgroup. The next-20250415 kernel is used as base.
$ netserver -6
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
number of clients | Without patch | With patch
6 | 42584.1 Mbps | 48603.4 Mbps (14.13% improvement)
12 | 30617.1 Mbps | 47919.7 Mbps (56.51% improvement)
18 | 25305.2 Mbps | 45497.3 Mbps (79.79% improvement)
24 | 20104.1 Mbps | 37907.7 Mbps (88.55% improvement)
30 | 14702.4 Mbps | 30746.5 Mbps (109.12% improvement)
36 | 10801.5 Mbps | 26476.3 Mbps (145.11% improvement)
The results show drastic improvement for network intensive workloads.
[shakeel.butt@linux.dev: add BUILD_BUG_ON() for MEMCG_CHARGE_BATCH]
Link: https://lkml.kernel.org/r/rlsgeosg3j7v5nihhbxxxbv3xfy4ejvigihj7lkkbt3n6imyne@2apxx2jm2e57
[shakeel.butt@linux.dev: simplify refill_stock]
Link: https://lkml.kernel.org/r/as5cdsm4lraxupg3t6onep2ixql72za25hvd4x334dsoyo4apr@zyzl4vkuevuv
[hughd@google.com: it's better to stock nr_pages than the uninitialized stock_pages]
Link: https://lkml.kernel.org/r/d542d18f-1caa-6fea-e2c3-3555c87bcf64@google.com
[shakeel.butt@linux.dev: add comment per Michal and use DEFINE_PER_CPU_ALIGNED instead of DEFINE_PER_CPU per Vlastimil]
Link: https://lkml.kernel.org/r/dieeei3squ2gcnqxdjayvxbvzldr266rhnvtl3vjzsqevxkevf@ckui5vjzl2qg
Link: https://lkml.kernel.org/r/20250416180229.2902751-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Eric Dumaze <edumazet@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
free_page_and_swap_cache() takes a struct page pointer as input parameter,
but it will immediately convert it to folio and all operations following
within use folio instead of page. It makes more sense to pass in folio
directly.
Convert free_page_and_swap_cache() to free_folio_and_swap_cache() to
consume folio directly.
Link: https://lkml.kernel.org/r/20250416201720.41678-1-nifan.cxl@gmail.com
Signed-off-by: Fan Ni <fan.ni@samsung.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Adam Manzanares <a.manzanares@samsung.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
commit c928807f6f6b6("mm/page_alloc: keep track of free highatomic")
adds a new variable nr_free_highatomic, which is useful for analyzing low
mem issues. add nr_free_highatomic in show_free_areas.
Signed-off-by: gao xu <gaoxu2@honor.com>
Link: https://lkml.kernel.org/r/d92eeff74f7a4578a14ac777cfe3603a@honor.com
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The scan and total_scan variables can be initialized to 0 when they are
defined, replacing the separate assignment statements.
Link: https://lkml.kernel.org/r/20250417092422.1333620-1-hao.ge@linux.dev
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Acked-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patch just fixes a typo in the comment.
Link: https://lkml.kernel.org/r/20250411073800.1444481-1-lienze@kylinos.cn
Signed-off-by: Enze Li <lienze@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The code style in fault_in_readable() and fault_in_writable() is a little
inconsistent with fault_in_safe_writeable(). In fault_in_readable() and
fault_in_writable(), it uses 'uaddr' passed in as loop cursor. While in
fault_in_safe_writeable(), local variable 'start' is used as loop cursor.
This may mislead people when reading code or making change in these codes.
Here define explicit loop cursor and use for loop to simplify codes in
these three functions. These cleanup can make them be consistent in code
style and improve readability.
[bhe@redhat.com: address minor concerns from David]
Link: https://lkml.kernel.org/r/Z/sbv3EmLXWgEE7+@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20250410035717.473207-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yanjun.Zhu <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In the current kernel, only pud huge page is supported in some
architectures. P4d and pgd huge pages haven't been supported yet. And in
mm/gup.c, there's no pgd huge page handling in the follow_page_mask() code
path. Hence it doesn't make sense to only have gup_fast_pgd_leaf() in
gup_fast code path.
Here remove gup_fast_pgd_leaf() and clean up the relevant codes.
Link: https://lkml.kernel.org/r/20250410035717.473207-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Yanjun.Zhu <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/gup: Minor fix, cleanup and improvements", v4.
These were made from code inspection in mm/gup.c.
This patch (of 3):
In __get_user_pages(), it will traverse page table and take a reference to
the page the given user address corresponds to if GUP_GET or GUP_PIN is
set. However, it's not supported both GUP_GET and GUP_PIN are set. Even
though this check need be done, it should be done earlier, but not doing
it till entering into follow_page_pte() and failed.
Furthermore, this checking has been done in is_valid_gup_args() and all
external users of __get_user_pages() will call is_valid_gup_args() to
catch the illegal setting. We don't need to worry about internal users of
__get_user_pages() because the gup_flags are set by MM code correctly.
Here remove the checking in follow_page_pte(), and add VM_WARN_ON_ONCE()
to catch the possible exceptional setting just in case.
And also change the VM_BUG_ON to VM_WARN_ON_ONCE() for checking (!!pages
!= !!(gup_flags & (FOLL_GET | FOLL_PIN))) because the checking has been
done in is_valid_gup_args() for external users of __get_user_pages().
Link: https://lkml.kernel.org/r/20250410035717.473207-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20250410035717.473207-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Yanjun.Zhu <yanjun.zhu@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
alloc_buddy_hugetlb_folio() allocates a rmappable folio, then strips the
rmappable part and freezes it. We can simplify all that by allocating
frozen pages directly.
Link: https://lkml.kernel.org/r/20250411132359.312708-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Switch from the atomic_long_add_return() to its relaxed version.
We do not need a full memory barrier or any memory ordering during
increasing the "vmap_lazy_nr" variable. What we only need is to do it
atomically. This is what atomic_long_add_return_relaxed() guarantees.
AARCH64:
<snip>
Default:
40ec: d34cfe94 lsr x20, x20, #12
40f0: 14000044 b 4200 <free_vmap_area_noflush+0x19c>
40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc>
40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area>
40fc: 91000000 add x0, x0, #0x0
4100: f8f40016 ldaddal x20, x22, [x0]
4104: 8b160296 add x22, x20, x22
Relaxed:
40ec: d34cfe94 lsr x20, x20, #12
40f0: 14000044 b 4200 <free_vmap_area_noflush+0x19c>
40f4: 94000000 bl 0 <__sanitizer_cov_trace_pc>
40f8: 90000000 adrp x0, 0 <__traceiter_alloc_vmap_area>
40fc: 91000000 add x0, x0, #0x0
4100: f8340016 ldadd x20, x22, [x0]
4104: 8b160296 add x22, x20, x22
<snip>
Link: https://lkml.kernel.org/r/20250415112646.113091-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Christop Hellwig <hch@infradead.org>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Before trying to allocate a page, gather_surplus_pages() sets up a
nodemask for the nodes we can allocate from, but instead of passing the
nodemask down the road to the page allocator, it iterates over the nodes
within that nodemask right there, meaning that the page allocator will
receive a preferred_nid and a null nodemask.
This is a problem when using a memory policy, because it might be that the
page allocator ends up using a node as a fallback which is not represented
in the policy.
Avoid that by passing the nodemask directly to the page allocator, so it
can filter out fallback nodes that are not part of the nodemask.
Link: https://lkml.kernel.org/r/20250415121503.376811-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON has dropped debugfs support; therefore, remove these unused scripts.
Link: https://lkml.kernel.org/r/20250411024332.1373861-1-enze.li@linux.dev
Fixes: 5ec4333b1967 ("mm/damon: remove DAMON debugfs interface")
Signed-off-by: Enze Li <lienze@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently the kernel maintains the stats updates per-memcg which is needed
to implement stats flushing threshold. On the update side, the update is
added to the per-cpu per-memcg update of the given memcg and all of its
ancestors. However when the given memcg has passed the flushing
threshold, all of its ancestors should have passed the threshold as well.
There is no need to traverse up the memcg tree to maintain the stats
updates.
Perf profile collected from our fleet shows that memcg_rstat_updated is
one of the most expensive memcg function i.e. a lot of cumulative CPU is
being spent on it. So, even small micro optimizations matter a lot. This
patch is microbenchmarked with multiple instances of netperf on a single
machine with locally running netserver and we see couple of percentage of
improvement.
Link: https://lkml.kernel.org/r/20250410025752.92159-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
hugetlb_reparenting_test.sh
During cleanup, the value of /proc/sys/vm/nr_hugepages is currently being
set to 0. At the end of the test, if all tests pass, the original
nr_hugepages value is restored. However, if any test fails, it remains
set to 0.
With this patch, we ensure that the original nr_hugepages value is
restored during cleanup, regardless of whether the test passes or fails.
Link: https://lkml.kernel.org/r/20250410100748.2310-1-donettom@linux.ibm.com
Fixes: 29750f71a9b4 ("hugetlb_cgroup: add hugetlb_cgroup reservation tests")
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Cc: Li Wang <liwang@redhat.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Move the unlikely case that mas->store_type is invalid to be the last
evaluated case and put liklier cases higher up.
Link: https://lkml.kernel.org/r/20250410191446.2474640-7-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Suggested-by: Liam R. Howlett <liam.howlett@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In order to support rebalancing and spanning stores using less than the
worst case number of nodes, we need to track more than just the vacant
height. Using only vacant height to reduce the worst case maple node
allocation count can lead to a shortcoming of nodes in the following
scenarios.
For rebalancing writes, when a leaf node becomes insufficient, it may be
combined with a sibling into a single node. This means that the parent
node which has entries for this children will lose one entry. If this
parent node was just meeting the minimum entries, losing one entry will
now cause this parent node to be insufficient. This leads to a cascading
operation of rebalancing at different levels and can lead to more node
allocations than simply using vacant height can return.
For spanning writes, a similar situation occurs. At the location at which
a spanning write is detected, the number of ancestor nodes may similarly
need to rebalanced into a smaller number of nodes and the same cascading
situation could occur.
To use less than the full height of the tree for the number of
allocations, we also need to track the height at which a non-leaf node
cannot become insufficient. This means even if a rebalance occurs to a
child of this node, it currently has enough entries that it can lose one
without any further action. This field is stored in the maple write state
as sufficient height. In mas_prealloc_calc() when figuring out how many
nodes to allocate, we check if the vacant node is lower in the tree than a
sufficient node (has a larger value). If it is, we cannot use the vacant
height and must use the difference in the height and sufficient height as
the basis for the number of nodes needed.
An off by one bug was also discovered in mast_overflow() where it is using
>= rather than >. This caused extra iterations of the
mas_spanning_rebalance() loop and lead to unneeded allocations. A test is
also added to check the number of allocations is correct.
Link: https://lkml.kernel.org/r/20250410191446.2474640-6-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This allows support for using the vacant height to calculate the worst
case number of nodes needed for wr_rebalance operation.
mas_spanning_rebalance() was seen to perform unnecessary node allocations.
We can reduce allocations by breaking early during the rebalancing loop
once we realize that we have ascended to a common ancestor.
Link: https://lkml.kernel.org/r/20250410191446.2474640-5-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Suggested-by: Liam Howlett <liam.howlett@oracle.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In order to determine the store type for a maple tree operation, a walk of
the tree is done through mas_wr_walk(). This function descends the tree
until a spanning write is detected or we reach a leaf node. While
descending, keep track of the height at which we encounter a node with
available space. This is done by checking if mas->end is less than the
number of slots a given node type can fit.
Now that the height of the vacant node is tracked, we can use the
difference between the height of the tree and the height of the vacant
node to know how many levels we will have to propagate creating new nodes.
Update mas_prealloc_calc() to consider the vacant height and reduce the
number of worst-case allocations.
Rebalancing and spanning stores are not supported and fall back to using
the full height of the tree for allocations.
Update preallocation testing assertions to take into account vacant
height.
Link: https://lkml.kernel.org/r/20250410191446.2474640-4-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
For the maple tree, the root node is defined to have a depth of 0 with a
height of 1. Each level down from the node, these values are incremented
by 1. Various code paths define a root with depth 1 which is inconsisent
with the definition. Modify the code to be consistent with this
definition.
In mas_spanning_rebalance(), l_mas.depth was being used to track the
height based on the number of iterations done in the main loop. This
information was then used in mas_put_in_tree() to set the height. Rather
than overload the l_mas.depth field to track height, simply keep track of
height in the local variable new_height and directly pass this to
mas_wmb_replace() which will be passed into mas_put_in_tree(). This
allows up to remove writes to l_mas.depth.
Link: https://lkml.kernel.org/r/20250410191446.2474640-3-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Track node vacancy to reduce worst case allocation counts", v5.
================ overview ========================
Currently, the maple tree preallocates the worst case number of nodes for
given store type by taking into account the whole height of the tree.
This comes from a worst case scenario of every node in the tree being full
and having to propagate node allocation upwards until we reach the root of
the tree. This can be optimized if there are vacancies in nodes that are
at a lower depth than the root node. This series implements tracking the
level at which there is a vacant node so we only need to allocate until
this level is reached, rather than always using the full height of the
tree. The ma_wr_state struct is modified to add a field which keeps track
of the vacant height and is updated during walks of the tree. This value
is then read in mas_prealloc_calc() when we decide how many nodes to
allocate.
For rebalancing and spanning stores, we also need to track the lowest
height at which a node has 1 more entry than the minimum sufficient number
of entries. This is because rebalancing can cause a parent node to become
insufficient which results in further node allocations. In this case, we
need to use the sufficient height as the worst case rather than the vacant
height.
patch 1-2: preparatory patches
patch 3: implement vacant height tracking + update the tests
patch 4: support vacant height tracking for rebalancing writes
patch 5: implement sufficient height tracking
patch 6: reorder switch case statements
================ results =========================
Bpftrace was used to profile the allocation path for requesting new maple
nodes while running stress-ng mmap 120s. The histograms below represent
requests to kmem_cache_alloc_bulk() and show the count argument. This
represnts how many maple nodes the caller is requesting in
kmem_cache_alloc_bulk()
command: stress-ng --mmap 4 --timeout 120
mm-unstable
@bulk_alloc_req:
[3, 4) 4 | |
[4, 5) 54170 |@ |
[5, 6) 0 | |
[6, 7) 893057 |@@@@@@@@@@@@@@@@@@@@ |
[7, 8) 4 | |
[8, 9) 2230287 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[9, 10) 55811 |@ |
[10, 11) 77834 |@ |
[11, 12) 0 | |
[12, 13) 1368684 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[13, 14) 0 | |
[14, 15) 0 | |
[15, 16) 367197 |@@@@@@@@ |
@maple_node_total: 46,630,160
@total_vmas: 46184591
mm-unstable + this series
@bulk_alloc_req:
[2, 3) 198 | |
[3, 4) 4 | |
[4, 5) 43 | |
[5, 6) 0 | |
[6, 7) 1069503 |@@@@@@@@@@@@@@@@@@@@@ |
[7, 8) 4 | |
[8, 9) 2597268 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[9, 10) 472191 |@@@@@@@@@ |
[10, 11) 191904 |@@@ |
[11, 12) 0 | |
[12, 13) 247316 |@@@@ |
[13, 14) 0 | |
[14, 15) 0 | |
[15, 16) 98769 |@ |
@maple_node_total: 37,813,856
@total_vmas: 43493287
This represents a ~19% reduction in the number of bulk maple nodes allocated.
For more reproducible results, a historgram of the return value of
mas_prealloc_calc() is displayed while running the maple_tree_tests whcih
have a deterministic store pattern
mas_prealloc_calc() return value mm-unstable
1 : (12068)
3 : (11836)
5 : ***** (271192)
7 : ************************************************** (2329329)
9 : *********** (534186)
10 : (435)
11 : *************** (704306)
13 : ******** (409781)
mas_prealloc_calc() return value mm-unstable + this series
1 : (12070)
3 : ************************************************** (3548777)
5 : ******** (633458)
7 : (65081)
9 : (11224)
10 : (341)
11 : (2973)
13 : (68)
do_mmap latency was also measured for regressions:
command: stress-ng --mmap 4 --timeout 120
mm-unstable:
avg = 7162 nsecs, total: 16101821292 nsecs, count: 2248034
mm-unstable + this series:
avg = 6689 nsecs, total: 15135391764 nsecs, count: 2262726
stress-ng --mmap4 --timeout 120
with vacant_height:
stress-ng: info: [257] 21526312 Maple Tree Read 0.176 M/sec
stress-ng: info: [257] 339979348 Maple Tree Write 2.774 M/sec
without vacant_height:
stress-ng: info: [8228] 20968900 Maple Tree Read 0.171 M/sec
stress-ng: info: [8228] 312214648 Maple Tree Write 2.547 M/sec
This represents an increase of ~3% read throughput and ~9% increase in
write throughput.
This patch (of 6):
In a subsequent patch, mas_prealloc_calc() will need to access fields only
in the ma_wr_state. Convert the function to take in a ma_wr_state and
modify all callers. There is no functional change.
Link: https://lkml.kernel.org/r/20250410191446.2474640-1-sidhartha.kumar@oracle.com
Link: https://lkml.kernel.org/r/20250410191446.2474640-2-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
MADV_DONTNEED[_LOCKED] handling for [process_]madvise() flushes tlb for
each vma of each address range. Update the logic to do tlb flushes in a
batched way. Initialize an mmu_gather object from do_madvise() and
vector_madvise(), which are the entry level functions for
[process_]madvise(), respectively. And pass those objects to the function
for per-vma work, via madvise_behavior struct. Make the per-vma logic not
flushes tlb on their own but just saves the tlb entries to the received
mmu_gather object. For this internal logic change, make
zap_page_range_single_batched() non-static and use it directly from
madvise_dontneed_single_vma(). Finally, the entry level functions flush
the tlb entries that gathered for the entire user request, at once.
Link: https://lkml.kernel.org/r/20250410000022.1901-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Some of zap_page_range_single() callers such as [process_]madvise() with
MADV_DONTNEED[_LOCKED] cannot batch tlb flushes because
zap_page_range_single() flushes tlb for each invocation. Split out the
body of zap_page_range_single() except mmu_gather object initialization
and gathered tlb entries flushing for such batched tlb flushing usage.
To avoid hugetlb pages allocation failures from concurrent page faults,
the tlb flush should be done before hugetlb faults unlocking, though. Do
the flush and the unlock inside the split out function in the order for
hugetlb vma case. Refer to commit 2820b0f09be9 ("hugetlbfs: close race
between MADV_DONTNEED and page fault") for more details about the
concurrent faults' page allocation failure problem.
Link: https://lkml.kernel.org/r/20250410000022.1901-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
MADV_FREE handling for [process_]madvise() flushes tlb for each vma of
each address range. Update the logic to do tlb flushes in a batched way.
Initialize an mmu_gather object from do_madvise() and vector_madvise(),
which are the entry level functions for [process_]madvise(), respectively.
And pass those objects to the function for per-vma work, via
madvise_behavior struct. Make the per-vma logic not flushes tlb on their
own but just saves the tlb entries to the received mmu_gather object.
Finally, the entry level functions flush the tlb entries that gathered for
the entire user request, at once.
Link: https://lkml.kernel.org/r/20250410000022.1901-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED and
MADV_FREE", v3.
When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or MADV_FREE
with multiple address ranges, tlb flushes happen for each of the given
address ranges. Because such tlb flushes are for the same process, doing
those in a batch is more efficient while still being safe. Modify
process_madvise() entry level code path to do such batched tlb flushes,
while the internal unmap logic do only gathering of the tlb entries to
flush.
In more detail, modify the entry functions to initialize an mmu_gather
object and pass it to the internal logic. And make the internal logic do
only gathering of the tlb entries to flush into the received mmu_gather
object. After all internal function calls are done, the entry functions
flush the gathered tlb entries at once.
Because process_madvise() and madvise() share the internal unmap logic,
make same change to madvise() entry code together, to make code consistent
and cleaner. It is only for keeping the code clean, and shouldn't degrade
madvise(). It could rather provide a potential tlb flushes reduction
benefit for a case that there are multiple vmas for the given address
range. It is only a side effect from an effort to keep code clean, so we
don't measure it separately.
Similar optimizations might be applicable to other madvise behavior such
as MADV_COLD and MADV_PAGEOUT. Those are simply out of the scope of this
patch series, though.
Patches Sequence
================
The first patch defines a new data structure for managing information that
is required for batched tlb flushes (mmu_gather and behavior), and update
code paths for MADV_DONTNEED[_LOCKED] and MADV_FREE handling internal
logic to receive it.
The second patch batches tlb flushes for MADV_FREE handling for both
madvise() and process_madvise().
Remaining two patches are for MADV_DONTNEED[_LOCKED] tlb flushes batching.
The third patch splits zap_page_range_single() for batching of
MADV_DONTNEED[_LOCKED] handling. The fourth patch batches tlb flushes for
the hint using the sub-logic that the third patch split out, and the
helpers for batched tlb flushes that introduced for the MADV_FREE case, by
the second patch.
Test Results
============
I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
using multiple process_madvise() calls. I apply the advice in 4 KiB sized
regions granularity, but with varying batch size per process_madvise()
call (vlen) from 1 to 1024. The source code for the measurement is
available at GitHub[1]. To reduce measurement errors, I did the
measurement five times.
The measurement results are as below. 'sz_batch' column shows the batch
size of process_madvise() calls. 'Before' and 'After' columns show the
average of latencies in nanoseconds that measured five times on kernels
that built without and with the tlb flushes batching of this series
(patches 3 and 4), respectively. For the baseline, mm-new tree of
2025-04-09[2] has been used, after reverting the second version of this
patch series and adding a temporal fix for !CONFIG_DEBUG_VM build
failure[3]. 'B-stdev' and 'A-stdev' columns show ratios of latency
measurements standard deviation to average in percent for 'Before' and
'After', respectively. 'Latency_reduction' shows the reduction of the
latency that the 'After' has achieved compared to 'Before', in percent.
Higher 'Latency_reduction' values mean more efficiency improvements.
sz_batch Before B-stdev After A-stdev Latency_reduction
1 146386348 2.78 111327360.6 3.13 23.95
2 108222130 1.54 72131173.6 2.39 33.35
4 93617846.8 2.76 51859294.4 2.50 44.61
8 80555150.4 2.38 44328790 1.58 44.97
16 77272777 1.62 37489433.2 1.16 51.48
32 76478465.2 2.75 33570506 3.48 56.10
64 75810266.6 1.15 27037652.6 1.61 64.34
128 73222748 3.86 25517629.4 3.30 65.15
256 72534970.8 2.31 25002180.4 0.94 65.53
512 71809392 5.12 24152285.4 2.41 66.37
1024 73281170.2 4.53 24183615 2.09 67.00
Unexpectedly the latency has reduced (improved) even with batch size one.
I think some of compiler optimizations have affected that, like also
observed with the first version of this patch series.
So, please focus on the proportion between the improvement and the batch
size. As expected, tlb flushes batching provides latency reduction that
proportional to the batch size. The efficiency gain ranges from about 33
percent with batch size 2, and up to 67 percent with batch size 1,024.
Please note that this is a very simple microbenchmark, so real efficiency
gain on real workload could be very different.
This patch (of 4):
To implement batched tlb flushes for MADV_DONTNEED[_LOCKED] and MADV_FREE,
an mmu_gather object in addition to the behavior integer need to be passed
to the internal logics. Using a struct can make it easy without
increasing the number of parameters of all code paths towards the internal
logic. Define a struct for the purpose and use it on the code path that
starts from madvise_do_behavior() and ends on madvise_dontneed_free().
Note that this changes madvise_walk_vmas() visitor type signature, too.
Specifically, it changes its 'arg' type from 'unsigned long' to the new
struct pointer.
Link: https://lkml.kernel.org/r/20250410000022.1901-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250410000022.1901-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <howlett@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When investigating performance issues during file folio unmap, I noticed
some behavioral differences in handling non-PMD-sized folios and PMD-sized
folios. For non-PMD-sized file folios, it will call folio_mark_accessed()
to mark the folio as having seen activity, but this is not done for
PMD-sized folios.
This might not cause obvious issues, but a potential problem could be
that, it might lead to reclaim of hot file folios under memory pressure,
as quoted from Johannes:
: Sometimes file contents are only accessed through relatively short-lived
: mappings. But they can nevertheless be accessed a lot and be hot. It's
: important to not lose that information on unmap, and end up kicking out a
: frequently used cache page.
Therefore, we should also add folio_mark_accessed() for PMD-sized file
folios when unmapping.
[baolin.wang@linux.alibaba.com: add comment]
Link: https://lkml.kernel.org/r/23fdc11d-e983-4627-89a8-79e9ecf9a45a@linux.alibaba.com
Link: https://lkml.kernel.org/r/fc117f60d7b686f87067f36a0ef7cdbc3a78109c.1744190345.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Prior to the recently applied commit that permits this merge,
mprotect()'ing a faulted VMA, adjacent to an unfaulted VMA, such that the
two share characteristics would fail to merge due to what appear to be
unintended consequences of commit 965f55dea0e3 ("mmap: avoid merging
cloned VMAs").
Now we have fixed this bug, assert that we can indeed merge anonymous VMAs
this way.
Also assert that forked source/target VMAs are equally rejected.
Previously, all empty target anon merges with one VMA faulted and the
other unfaulted would be rejected incorrectly, now we ensure that unforked
merge, but forked do not.
Additionally, add the new test file to the MEMORY MAPPING section in
MAINTAINERS, as these tests are explicitly memory mapping related.
Link: https://lkml.kernel.org/r/2b69330274a3b71721f7042c5eabe91143934415.1744104124.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The PROCMAP_QUERY ioctl() is very useful - it allows for binary access to
/proc/$pid/[s]maps data and thus convenient lookup of data contained
there.
This patch exposes this for convenient use by mm self tests so the state
of VMAs can easily be queried.
Link: https://lkml.kernel.org/r/ce83d877093d1fc594762cf4b82f0c27963030ee.1744104124.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "fix incorrectly disallowed anonymous VMA merges", v2.
It appears that we have been incorrectly rejecting merge cases for 15
years, apparently by mistake.
Imagine a range of anonymous mapped momemory divided into two VMAs like
this, with incompatible protection bits:
RW RWX
unfaulted faulted
|-----------|-----------|
| prev | vma |
|-----------|-----------|
mprotect(RW)
Now imagine mprotect()'ing vma so it is RW. This appears as if it should
merge, it does not.
Neither does this case, again mprotect()'ing vma RW:
RWX RW
faulted unfaulted
|-----------|-----------|
| vma | next |
|-----------|-----------|
mprotect(RW)
Nor:
RW RWX RW
unfaulted faulted unfaulted
|-----------|-----------|-----------|
| prev | vma | next |
|-----------|-----------|-----------|
mprotect(RW)
What's going on here?
In commit 5beb49305251 ("mm: change anon_vma linking to fix multi-process
server scalability issue"), from 2010, Rik von Riel took careful care to
account for these cases - commenting that '[this is] easily overlooked:
when mprotect shifts the boundary, make sure the expanding vma has
anon_vma set if the shrinking vma had, to cover any anon pages imported.'
However, commit 965f55dea0e3 ("mmap: avoid merging cloned VMAs")
introduced a little over a year later, appears to have accidentally
disallowed this.
By adjusting the is_mergeable_anon_vma() function to avoid lock contention
across large trees of forked anon_vma's, this commit wrongly assumed the
VMA being checked (the ostensible merge 'target') should be faulted, that
is, have an anon_vma, and thus an anon_vma_chain list established, but
only of length 1.
This appears to have been unintentional, as disallowing empty target VMAs
like this across the board makes no sense.
We already have logic that accounts for this case, the same logic Rik
introduced in 2010, now via dup_anon_vma() (and ultimately
anon_vma_clone()), so there is no problem permitting this.
This series fixes this mistake and also ensures that scalability concerns
remain addressed by explicitly checking that whatever VMA is being merged
has not been forked.
A full set of self tests which reproduce the issue are provided, as well
as updating userland VMA tests to assert this behaviour.
The self tests additionally assert scalability concerns are addressed.
This patch (of 3):
anon_vma_chain's were introduced by Rik von Riel in commit 5beb49305251
("mm: change anon_vma linking to fix multi-process server scalability
issue").
This patch was introduced in March 2010. As part of this change, careful
attention was made to the instance of mprotect() causing a VMA merge, with
one faulted (i.e. having anon_vma set) and another not:
/*
* Easily overlooked: when mprotect shifts the boundary,
* make sure the expanding vma has anon_vma set if the
* shrinking vma had, to cover any anon pages imported.
*/
In the modern VMA code, this is handled in dup_anon_vma() (and ultimately
anon_vma_clone()).
This case is one of the three configurations of adjacent VMA anon_vma
state that we might encounter on merge (where dst is the VMA which will be
merged into and src the one being merged into dst):
1. dst->anon_vma, src->anon_vma - These must be equal, no-op.
2. dst->anon_vma, !src->anon_vma - We simply use dst->anon_vma, no-op.
3. !dst->anon_vma, src->anon_vma - The case in question here.
In case 3, the instance addressed here - we duplicate the AVC connections
from src and place into dst.
However, in practice, we very often do NOT do this.
This appears to be due to an inadvertent consequence of the change
introduced by commit 965f55dea0e3 ("mmap: avoid merging cloned VMAs"),
introduced in May 2011.
This implies that this merge case was functional only for a little over a
year, and has since been broken for ~15 years.
Here, lock scalability concerns lead to us restricting anonymous merges
only to those VMAs with 1 entry in their vma->anon_vma_chain, that is, a
VMA that is not connected to any parent process's anon_vma.
The mergeability test looks like this:
static inline bool is_mergeable_anon_vma(struct anon_vma *anon_vma1,
struct anon_vma *anon_vma2, struct vm_area_struct *vma)
{
if ((!anon_vma1 || !anon_vma2) && (!vma ||
!vma->anon_vma || list_is_singular(&vma->anon_vma_chain)))
return true;
return anon_vma1 == anon_vma2;
}
However, we have a problem here - typically the vma passed here is the
destination VMA.
For instance in vma_merge_existing_range() we invoke:
can_vma_merge_left()
-> [ check that there is an immediately adjacent prior VMA ]
-> can_vma_merge_after()
-> is_mergeable_vma() for general attribute check
-> is_mergeable_anon_vma([ proposed anon_vma ], prev->anon_vma, prev)
So if we were considering a target unfaulted 'prev':
unfaulted faulted
|-----------|-----------|
| prev | vma |
|-----------|-----------|
This would call is_mergeable_anon_vma(NULL, vma->anon_vma, prev).
The list_is_singular() check for vma->anon_vma_chain, an empty list on
fault, would cause this merge to _fail_ even though all else indicates a
merge.
Equally a simple merge into a next VMA would hit the same problem:
faulted unfaulted
|-----------|-----------|
| vma | next |
|-----------|-----------|
can_vma_merge_right()
-> [ check that there is an immediately adjacent succeeding VMA ]
-> can_vma_merge_before()
-> is_mergeable_vma() for general attribute check
-> is_mergeable_anon_vma([ proposed anon_vma ], next->anon_vma, next)
For a 3-way merge, we'd also hit the same problem if it was configured like
this for instance:
unfaulted faulted unfaulted
|-----------|-----------|-----------|
| prev | vma | next |
|-----------|-----------|-----------|
As we'd call can_vma_merge_left() for prev, and can_vma_merge_right() for
next, both of which would fail.
vma_merge_new_range() (and relatedly, vma_expand()) are not impacted, as
the new VMA would never already be faulted (it is a proposed new range).
Because we already handle each of the aforementioned merge cases, and can
absolutely therefore deal with an existing VMA merge with !dst->anon_vma,
src->anon_vma, there is absolutely no reason to disallow this kind of
merge.
It seems that the intention of this patch is to ensure that, in the
instance of merging unfaulted VMAs with faulted ones, we never wish to do
so with those with multiple AVCs due to the fact that anon_vma lock's are
held across both parent and child anon_vma's (actually, the 'root' parent
anon_vma's lock is used).
In fact, the original commit alludes to this - "find_mergeable_anon_vma()
already considers this case".
In find_mergeable_anon_vma() however, we check the anon_vma which will be
merged from, if it is set, then we check
list_is_singular(vma->anon_vma_chain).
So to match this logic, update is_mergeable_anon_vma() to perform this
scalability check on the VMA whose anon_vma we ultimately merge into.
This matches existing behaviour with forked VMAs, only we no longer
wrongly disallow ALL empty target merges.
So we both allow merge cases and ensure the scalability check is correctly
applied.
We may wish to revisit these lock scalability concerns at a later date and
ensure they are still valid.
Additionally, correct userland VMA tests which were mistakenly not
asserting these cases correctly previously to now correctly assert this,
and to ensure vmg->anon_vma state is always consistent to account for
newly introduced asserts.
Link: https://lkml.kernel.org/r/cover.1744104124.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/18c756fc9eaf7ad082a710c91133b8346f8cd9a8.1744104124.git.lorenzo.stoakes@oracle.com
Fixes: 965f55dea0e3 ("mmap: avoid merging cloned VMAs")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We have introduced Rust bindings for core mm abstractions as part of this
series, so add an entry in MAINTAINERS to be explicit about who maintains
this.
Patches are anticipated to be taken through the mm tree as usual with
other mm code.
Link: https://rust-for-linux.com/rust-kernel-policy#how-is-rust-introduced-in-a-subsystem
Link: https://lore.kernel.org/all/33e64b12-aa07-4e78-933a-b07c37ff1d84@lucifer.local/
Link: https://lkml.kernel.org/r/20250408-vma-v16-9-d8b446e885d9@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce a new type called `CurrentTask` that lets you perform various
operations that are only safe on the `current` task. Use the new type to
provide a way to access the current mm without incrementing its refcount.
With this change, you can write stuff such as
let vma = current!().mm().lock_vma_under_rcu(addr);
without incrementing any refcounts.
This replaces the existing abstractions for accessing the current pid
namespace. With the old approach, every field access to current involves
both a macro and a unsafe helper function. The new approach simplifies
that to a single safe function on the `CurrentTask` type. This makes it
less heavy-weight to add additional current accessors in the future.
That said, creating a `CurrentTask` type like the one in this patch
requires that we are careful to ensure that it cannot escape the current
task or otherwise access things after they are freed. To do this, I
declared that it cannot escape the current "task context" where I defined
a "task context" as essentially the region in which `current` remains
unchanged. So e.g., release_task() or begin_new_exec() would leave the
task context.
If a userspace thread returns to userspace and later makes another
syscall, then I consider the two syscalls to be different task contexts.
This allows values stored in that task to be modified between syscalls,
even if they're guaranteed to be immutable during a syscall.
Ensuring correctness of `CurrentTask` is slightly tricky if we also want
the ability to have a safe `kthread_use_mm()` implementation in Rust. To
support that safely, there are two patterns we need to ensure are safe:
// Case 1: current!() called inside the scope.
let mm;
kthread_use_mm(some_mm, || {
mm = current!().mm();
});
drop(some_mm);
mm.do_something(); // UAF
and:
// Case 2: current!() called before the scope.
let mm;
let task = current!();
kthread_use_mm(some_mm, || {
mm = task.mm();
});
drop(some_mm);
mm.do_something(); // UAF
The existing `current!()` abstraction already natively prevents the first
case: The `&CurrentTask` would be tied to the inner scope, so the
borrow-checker ensures that no reference derived from it can escape the
scope.
Fixing the second case is a bit more tricky. The solution is to
essentially pretend that the contents of the scope execute on an different
thread, which means that only thread-safe types can cross the boundary.
Since `CurrentTask` is marked `NotThreadSafe`, attempts to move it to
another thread will fail, and this includes our fake pretend thread
boundary.
This has the disadvantage that other types that aren't thread-safe for
reasons unrelated to `current` also cannot be moved across the
`kthread_use_mm()` boundary. I consider this an acceptable tradeoff.
Link: https://lkml.kernel.org/r/20250408-vma-v16-8-d8b446e885d9@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add the ability to write a file_operations->mmap hook in Rust when using
the miscdevice abstraction. The `vma` argument to the `mmap` hook uses
the `VmaNew` type from the previous commit; this type provides the correct
set of operations for a file_operations->mmap hook.
Link: https://lkml.kernel.org/r/20250408-vma-v16-7-d8b446e885d9@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jann Horn <jannh@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This type will be used when setting up a new vma in an f_ops->mmap() hook.
Using a separate type from VmaRef allows us to have a separate set of
operations that you are only able to use during the mmap() hook. For
example, the VM_MIXEDMAP flag must not be changed after the initial setup
that happens during the f_ops->mmap() hook.
To avoid setting invalid flag values, the methods for clearing VM_MAYWRITE
and similar involve a check of VM_WRITE, and return an error if VM_WRITE
is set. Trying to use `try_clear_maywrite` without checking the return
value results in a compilation error because the `Result` type is marked
#[must_use].
For now, there's only a method for VM_MIXEDMAP and not VM_PFNMAP. When we
add a VM_PFNMAP method, we will need some way to prevent you from setting
both VM_MIXEDMAP and VM_PFNMAP on the same vma.
Link: https://lkml.kernel.org/r/20250408-vma-v16-6-d8b446e885d9@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Adds an MmWithUserAsync type that uses mmput_async when dropped but is
otherwise identical to MmWithUser. This has to be done using a separate
type because the thing we are changing is the destructor.
Rust Binder needs this to avoid a certain deadlock. See commit
9a9ab0d96362 ("binder: fix race between mmput() and do_exit()") for
details. It's also needed in the shrinker to avoid cleaning up the mm in
the shrinker's context.
Link: https://lkml.kernel.org/r/20250408-vma-v16-5-d8b446e885d9@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, the binder driver always uses the mmap lock to make changes to
its vma. Because the mmap lock is global to the process, this can involve
significant contention. However, the kernel has a feature called per-vma
locks, which can significantly reduce contention. For example, you can
take a vma lock in parallel with an mmap write lock. This is important
because contention on the mmap lock has been a long-term recurring
challenge for the Binder driver.
This patch introduces support for using `lock_vma_under_rcu` from Rust.
The Rust Binder driver will be able to use this to reduce contention on
the mmap lock.
Link: https://lkml.kernel.org/r/20250408-vma-v16-4-d8b446e885d9@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The vm_insert_page method is only usable on vmas with the VM_MIXEDMAP
flag, so we introduce a new type to keep track of such vmas.
The approach used in this patch assumes that we will not need to encode
many flag combinations in the type. I don't think we need to encode more
than VM_MIXEDMAP and VM_PFNMAP as things are now. However, if that
becomes necessary, using generic parameters in a single type would scale
better as the number of flags increases.
Link: https://lkml.kernel.org/r/20250408-vma-v16-3-d8b446e885d9@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This adds a type called VmaRef which is used when referencing a vma that
you have read access to. Here, read access means that you hold either the
mmap read lock or the vma read lock (or stronger).
Additionally, a vma_lookup method is added to the mmap read guard, which
enables you to obtain a &VmaRef in safe Rust code.
This patch only provides a way to lock the mmap read lock, but a follow-up
patch also provides a way to just lock the vma read lock.
Link: https://lkml.kernel.org/r/20250408-vma-v16-2-d8b446e885d9@google.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Rust support for mm_struct, vm_area_struct, and mmap", v16.
This updates the vm_area_struct support to use the approach we discussed
at LPC where there are several different Rust wrappers for vm_area_struct
depending on the kind of access you have to the vma. Each case allows a
different set of operations on the vma.
This includes an MM MAINTAINERS entry as proposed by Lorenzo:
https://lore.kernel.org/all/33e64b12-aa07-4e78-933a-b07c37ff1d84@lucifer.local/
This patch (of 9):
These abstractions allow you to reference a `struct mm_struct` using both
mmgrab and mmget refcounts. This is done using two Rust types:
* Mm - represents an mm_struct where you don't know anything about the
value of mm_users.
* MmWithUser - represents an mm_struct where you know at compile time
that mm_users is non-zero.
This allows us to encode in the type system whether a method requires that
mm_users is non-zero or not. For instance, you can always call
`mmget_not_zero` but you can only call `mmap_read_lock` when mm_users is
non-zero.
The struct is called Mm to keep consistency with the C side.
The ability to obtain `current->mm` is added later in this series.
The mm module is defined to only exist when CONFIG_MMU is set. This
avoids various errors due to missing types and functions when CONFIG_MMU
is disabled. More fine-grained cfgs can be considered in the future. See
the thread at [1] for more info.
Link: https://lkml.kernel.org/r/20250408-vma-v16-9-d8b446e885d9@google.com
Link: https://lkml.kernel.org/r/20250408-vma-v16-1-d8b446e885d9@google.com
Link: https://lore.kernel.org/all/202503091916.QousmtcY-lkp@intel.com/
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Constructors for PUD/P4D-level pgtables were recently introduced. They
should be called for all pgtables; make sure they are called for special
kernel mappings created by create_pgd_mapping() too.
While at it also switch to using pagetable_alloc() like in
alloc_{pte,pmd}_late().
Link: https://lkml.kernel.org/r/20250408095222.860601-13-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Constructors for PUD/P4D-level pgtables were recently introduced. They
should be called for all pgtables; make sure they are called for special
kernel mappings created by __create_pgd_mapping() too.
Link: https://lkml.kernel.org/r/20250408095222.860601-12-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
pagetable_{pte,pmd}_ctor(mm, ptdesc) skip the ptlock initialisation if mm
is &init_mm. To avoid unnecessary overhead, it is therefore preferable to
pass the actual mm associated to the PTE/PMD.
Unfortunately, this proves challenging for alloc_{pte,pmd}_late() as the
associated mm is not available at the point where they are called - in
fact not even top-level functions like create_pgd_mapping() are passed the
mm. As a result they both call the ctor with NULL as mm; this is safe but
potentially wasteful.
This is not a new situation, but let's add a couple of comments to clarify
it.
Link: https://lkml.kernel.org/r/20250408095222.860601-11-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
TL;DR: always call the PTE/PMD ctor, passing the appropriate mm to skip
ptlock_init() if unneeded.
__create_pgd_mapping() is used for creating different kinds of mappings,
and may allocate page table pages if passed an allocator callback. There
are currently three such cases:
1. create_pgd_mapping(), which is used to create the EFI mapping
2. arch_add_memory()
3. map_entry_trampoline()
1. uses pgd_pgtable_alloc() as allocator callback, which calls the
PTE/PMD ctor, while 2. and 3. use __pgd_pgtable_alloc(), which does not.
The rationale is most likely that pgtables associated with init_mm do not
make use of split page table locks, and it is therefore unnecessary to
initialise them by calling the ctor. 2. operates on swapper_pg_dir so
the allocated pgtables are clearly associated with init_mm, this is
arguably the case for 3. too (the trampoline mapping is never modified so
ptlocks are anyway irrelevant). 1. corresponds to efi_mm so ptlocks need
to be initialised in that case.
We are now moving towards calling the ctor for all page tables, even those
associated with init_mm. pagetable_{pte,pmd}_ctor() have become aware of
the associated mm so that the ptlock initialisation can be skipped for
init_mm. This patch therefore amends the allocator callbacks so that the
PTE/PMD ctor are always called, with an appropriate mm pointer to avoid
unnecessary ptlock overhead.
Modifying the prototype of the allocator callbacks to take the mm and
propagating that pointer all the way down would be pretty invasive.
Instead:
* __pgd_pgtable_alloc() (cases 2. and 3. above) is replaced with
pgd_pgtable_alloc_init_mm(), resulting in the ctors being called with
&init_mm. This is the main functional change in this patch; the ptlock
still isn't initialised, but other ctor actions (e.g.
accounting-related) are now carried out for those allocated pgtables.
* pgd_pgtable_alloc() (case 1. above) is replaced with
pgd_pgtable_alloc_special_mm(), resulting in the ctors being called with
NULL as mm. No functional change here; NULL essentially means "not
init_mm", and the ptlock is still initialised.
__pgd_pgtable_alloc() is now the common implementation of those two
helpers. While at it we switch it to using pagetable_alloc() like
standard pgtable allocator functions and remove the comment regarding ctor
calls (ctors are now always expected to be called).
Link: https://lkml.kernel.org/r/20250408095222.860601-10-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 90292aca9854 ("arm64: mm: use appropriate ctors for page tables")
introduced pgtable ctor calls in pgd_pgtable_alloc(). To identify the
pgtable level and call the appropriate ctor, the *_SHIFT value associated
with the pgtable level is used. However, those values do not
unambiguously identify a level, because if a given level is folded, the
*_SHIFT value will be equal to that of the upper level (e.g. PMD_SHIFT ==
PUD_SHIFT if PMD is folded).
As things stand, there is probably not much damaged done by calling the
ctor for a different level, and ARCH_ENABLE_SPLIT_PMD_PTLOCK is only
selected if PMD isn't folded (so we don't needlessly initialise
pmd_ptlock). Still, this is pretty confusing, and it would get even more
confusing when adding ctor calls for the remaining levels.
Let's simplify all this by using an enum to identify the pgtable level
instead; this way folding becomes irrelevant. This is inspired by one of
the m68k pgtable allocators (arch/m68k/include/asm/motorola_pgalloc.h).
Link: https://lkml.kernel.org/r/20250408095222.860601-9-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Split page table locks are not used for pgtables associated to init_mm, at
any level. pte_alloc_kernel() does not call ptlock_init() as a result.
There is however no separate alloc/free functions for kernel PMDs, and
pmd_ptlock_init() is called unconditionally. When ALLOC_SPLIT_PTLOCKS is
true (e.g. 32-bit architectures or if CONFIG_PREEMPT_RT is selected),
this results in unnecessary dynamic memory allocation every time a kernel
PMD is allocated.
Now that pagetable_pmd_ctor() is passed the associated mm, we can easily
remove this overhead by skipping pmd_ptlock_init() if the pgtable is
associated to init_mm. No special-casing is needed on the dtor path, as
ptlock_free() is already called unconditionally for all levels.
(ptlock_free() is a no-op unless a ptlock was allocated for the given
PTP.)
Link: https://lkml.kernel.org/r/20250408095222.860601-8-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The generic implementation of pte_{alloc_one,free}_kernel now calls the
[cd]tor, without initialising the ptlock needlessly as
pagetable_pte_ctor() skips it for init_mm.
Align sparc64 with the generic implementation by ensuring
pagetable_pte_[cd]tor() are called for kernel PTEs. As a result the
kernel and user alloc/free functions have the same implementation, and
since pgtable_t is defined as pte_t *, we can have both call a common
helper.
Link: https://lkml.kernel.org/r/20250408095222.860601-7-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The generic implementation of pte_{alloc_one,free}_kernel now calls the
[cd]tor, without initialising the ptlock needlessly as
pagetable_pte_ctor() skips it for init_mm.
On powerpc, all functions related to PTE allocation are implemented by
common helpers, which are passed a boolean to differentiate user from
kernel pgtables. This patch aligns the powerpc implementation with the
generic one by calling pagetable_pte_[cd]tor() unconditionally in those
helpers.
Link: https://lkml.kernel.org/r/20250408095222.860601-6-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The generic implementation of pte_{alloc_one,free}_kernel now calls the
[cd]tor. Align the m68k/ColdFire implementation of those functions by
calling the [cd]tor explicitly.
Link: https://lkml.kernel.org/r/20250408095222.860601-5-kevin.brodsky@arm.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Waleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <x86@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|