summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2025-03-16alloc_tag: uninline code gated by mem_alloc_profiling_key in slab allocatorSuren Baghdasaryan
When a sizable code section is protected by a disabled static key, that code gets into the instruction cache even though it's not executed and consumes the cache, increasing cache misses. This can be remedied by moving such code into a separate uninlined function. On a Pixel6 phone, slab allocation profiling overhead measured with CONFIG_MEM_ALLOC_PROFILING=y and profiling disabled is: baseline modified Big core 3.31% 0.17% Medium core 3.79% 0.57% Little core 6.68% 1.28% This improvement comes at the expense of the configuration when profiling gets enabled, since there is now an additional function call. The overhead from this additional call on Pixel6 is: Big core 0.66% Middle core 1.23% Little core 2.42% However this is negligible when compared with the overall overhead of the memory allocation profiling when it is enabled. On x86 this patch does not make noticeable difference because the overhead with mem_alloc_profiling_key disabled is much lower (under 1%) to start with, so any improvement is less visible and hard to distinguish from the noise. The overhead from additional call when profiling is enabled is also within noise levels. Link: https://lkml.kernel.org/r/20250201231803.2661189-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Minchan Kim <minchan@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sourav Panda <souravpanda@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: avoid extra mem_alloc_profiling_enabled() checksSuren Baghdasaryan
Refactor code to avoid extra mem_alloc_profiling_enabled() checks inside pgalloc_tag_get() function which is often called after that check was already done. Link: https://lkml.kernel.org/r/20250201231803.2661189-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: David Wang <00107082@163.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Minchan Kim <minchan@google.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: completely abstract unnecessary adj_start calculationLorenzo Stoakes
The adj_start calculation has been a constant source of confusion in the VMA merge code. There are two cases to consider, one where we adjust the start of the vmg->middle VMA (i.e. the vmg->__adjust_middle_start merge flag is set), in which case adj_start is calculated as: (1) adj_start = vmg->end - vmg->middle->vm_start And the case where we adjust the start of the vmg->next VMA (i.e. the vmg->__adjust_next_start merge flag is set), in which case adj_start is calculated as: (2) adj_start = -(vmg->middle->vm_end - vmg->end) We apply (1) thusly: vmg->middle->vm_start = vmg->middle->vm_start + vmg->end - vmg->middle->vm_start Which simplifies to: vmg->middle->vm_start = vmg->end Similarly, we apply (2) as: vmg->next->vm_start = vmg->next->vm_start + -(vmg->middle->vm_end - vmg->end) Noting that for these VMAs to be mergeable vmg->middle->vm_end == vmg->next->vm_start and so this simplifies to: vmg->next->vm_start = vmg->next->vm_start + -(vmg->next->vm_start - vmg->end) Which simplifies to: vmg->next->vm_start = vmg->end Therefore in each case, we simply need to adjust the start of the VMA to vmg->end (!) and can do away with this adj_start calculation. The only caveat is that we must ensure we update the vm_pgoff field correctly. We therefore abstract this entire calculation to a new function vmg_adjust_set_range() which performs this calculation and sets the adjusted VMA's new range using the general vma_set_range() function. We also must update vma_adjust_trans_huge() which expects the now-abstracted adj_start parameter. It turns out this is wholly unnecessary. In vma_adjust_trans_huge() the relevant code is: if (adjust_next > 0) { struct vm_area_struct *next = find_vma(vma->vm_mm, vma->vm_end); unsigned long nstart = next->vm_start; nstart += adjust_next; split_huge_pmd_if_needed(next, nstart); } The only case where this is relevant is when vmg->__adjust_middle_start is specified (in which case adj_next would have been positive), i.e. the one in which the vma specified is vmg->prev and this the sought 'next' VMA would be vmg->middle. We can therefore eliminate the find_vma() invocation altogether and simply provide the vmg->middle VMA in this instance, or NULL otherwise. Again we have an adj_next offset calculation: next->vm_start + vmg->end - vmg->middle->vm_start Where next == vmg->middle this simplifies to vmg->end as previously demonstrated. Therefore nstart is equal to vmg->end, which is already passed to vma_adjust_trans_huge() via the 'end' parameter and so this code (rather delightfully) simplifies to: if (next) split_huge_pmd_if_needed(next, end); With these changes in place, it becomes silly for commit_merge() to return vmg->target, as it is always the same and threaded through vmg, so we finally change commit_merge() to return an error value once again. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/7bce2cd4b5afb56211822835d145471280c3dccc.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: make vmg->target consistent and further simplify commit_merge()Lorenzo Stoakes
It is confusing for vmg->target to sometimes be the target merged VMA and in one case not. Fix this by having commit_merge() use its awareness of the vmg->_adjust_next_start case to know that it is manipulating a separate vma, abstracted in the 'vma' local variable. Place removal and adjust VMA determination logic into init_multi_vma_prep(), as the flags give us enough information to do so, and since this is the function that sets up the vma_prepare struct it makes sense to do so here. Doing this significantly simplifies commit_merge(), allowing us to eliminate the 'merge_target' handling, initialise the VMA iterator in a more sensible place and simply return vmg->target consistently. This also allows us to simplify setting vmg->target in vma_merge_existing_range() since we are then left only with two cases - merge left (or both) where the target is vmg->prev or merge right in which the target is vmg->next. This makes it easy for somebody reading the code to know what VMA will actually be the one returned and merged into and removes a great deal of the confusing 'adjust' nonsense. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/50f96e31ab1980eaaf1006e34a4f6e6dad9320b8.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: eliminate adj_start parameter from commit_merge()Lorenzo Stoakes
Introduce internal vmg->__adjust_middle_start and vmg->__adjust_next_start merge flags, enabling us to indicate to commit_merge() that we are performing a merge which either spans only part of vmg->middle, or part of vmg->next respectively. In the former instance, we change the start of vmg->middle to match the attributes of vmg->prev, without spanning all of vmg->middle. This implies that vmg->prev->vm_end and vmg->middle->vm_start are both increased to form the new merged VMA (vmg->prev) and the new subsequent VMA (vmg->middle). In the latter case, we change the end of vmg->middle to match the attributes of vmg->next, without spanning all of vmg->next. This implies that vmg->middle->vm_end and vmg->next->vm_start are both decreased to form the new merged VMA (vmg->next) and the new prior VMA (vmg->middle). Since we now have a stable set of prev, middle, next VMAs threaded through vmg and with these flags set know what is happening, we can perform the calculation in commit_merge() instead. This allows us to drop the confusing adj_start parameter and instead pass semantic information to commit_merge(). In the latter case the -(middle->vm_end - start) calculation becomes -(middle->vm-end - vmg->end), however this is correct as vmg->end is set to the start parameter. This is because in this case (rather confusingly), we manipulate vmg->middle, but ultimately return vmg->next, whose range will be correctly specified. At this point vmg->start, end is the new range for the prior VMA rather than the merged one. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/bcec0cd980b373a5eb02236cb033034ce1effe42.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: further refactor commit_merge()Lorenzo Stoakes
The current VMA merge mechanism contains a number of confusing mechanisms around removal of VMAs on merge and the shrinking of the VMA adjacent to vma->target in the case of merges which result in a partial merge with that adjacent VMA. Since we now have a STABLE set of VMAs - prev, middle, next - we are now able to have the caller of commit_merge() explicitly tell us which VMAs need deleting, using newly introduced internal VMA merge flags. Doing so allows us to embed this state within the VMG and remove the confusing remove, remove2 parameters from commit_merge(). We additionally are able to eliminate the highly confusing and misleading 'expanded' parameter - a parameter that in reality refers to whether or not the return VMA is the target one or the one immediately adjacent. We can infer which is the case from whether or not the adj_start parameter is negative. This also allows us to simplify further logic around iterator configuration and VMA iterator stores. Doing so means we can also eliminate the adjust parameter, as we are able to infer which VMA ought to be adjusted from adj_start - a positive value implies we adjust the start of 'middle', a negative one implies we adjust the start of 'next'. We are then able to have commit_merge() explicitly return the target VMA, or NULL on inability to pre-allocate memory. Errors were previously filtered so behaviour does not change. We additionally move from the slightly odd use of a bitwise-flag enum vmg->merge_flags field to vmg bitfields. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/7bf2ed24af68aac18672b7acebbd9102f48c5b03.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: simplify vma merge structure and expand commentsLorenzo Stoakes
Patch series "mm: further simplify VMA merge operation", v3. While significant efforts have been made to improve the VMA merge operation, there remains remnants of the bad (or rather confusing) old days, which make the code difficult to understand, more bug prone and thus harder to modify. This series attempts to significantly improve matters in a number of respects - with a focus on simplifying the commit_merge() function which actually actions the merge operation - and importantly, adjusting the two most confusing merge cases - those in which we 'adjust' the VMA immediately adjacent to the one being merged. One source of confusion are the VMAs being threaded through the operation themselves - vmg->prev, vmg->vma and vmg->next. At the start of the operation, vmg->vma is either NULL if a new VMA is propose to be added, or if not then a pointer to an existing VMA being modified, and prev/next are (perhaps not present) VMAs sat immediately before and after the range specified in vmg->start, end, respectively. However, during the VMA merge operation, we change vmg->start, end and pgoff to span the newly merged range and vmg->vma to either be: a. The ultimately returned VMA (in most cases) or b. A VMA which we will manipulate, but ultimately instead return vmg->next. Case b. especially here is confusing for somebody reading this code, but the fact we update this state, along with vmg->start, end, pgoff only makes matters worse. We simplify things by replacing vmg->vma with vmg->middle and never changing it - this is always either NULL (for a new VMA) or the VMA being modified between vmg->prev and vmg->next. We further simplify by placing the merged VMA in a new vmg->target field - whether case b. above is the case or not. The reader of the code can now simply rely on vmg->middle being the middle VMA and vmg->target being the ultimately merged VMA. We additionally tackle the confusing cases where we 'adjust' VMAs other than the one we ultimately return as the merged VMA (this includes case b. above). These are: (1) merge <-----------> |------||--------| |------------|---| | prev || middle | -> | target | m | |------||--------| |------------|---| In which case middle must be adjusted so middle->vm_start is increased as well as performing the merge. (2) (equivalent to case b. above) <-------------> |---------||------| |---|-------------| | middle || next | -> | m | target | |---------||------| |---|-------------| In which case next must be adjusted so next->vm_start is decreased as well as performing the merge. This cases have previously been performed by calculating and passing around a dubious and confusing 'adj_start' parameter along side a pointer to an 'adjust' VMA indicating which VMA requires additional adjustment (middle in case 1 and next in case 2). With the VMG structure in place we are able to avoid this by simply setting a merge flag to describe each case: (1) Sets the vmg->__adjust_middle_start flag (2) Sets the vmg->__adjust_next_start flag By doing so it turns out we can vastly simplify the logic and calculate what is required to perform the operation. Taken together the refactorings make it far easier to understand what is being done even in these more confusing cases, make the code far more maintainable, debuggable, and testable, providing more internal state indicating what is happening in the merge operation. The changes have no functional net impact on the merge operation and everything should still behave as it did before. This patch (of 5): The merge code, while much improved, still has a number of points of confusion. As part of a broader series cleaning this up to make this more maintainable, we start by addressing some confusion around vma_merge_struct fields. So far, the caller either provides no vmg->vma (a new VMA) or supplies the existing VMA which is being altered, setting vmg->start,end,pgoff to the proposed VMA dimensions. vmg->vma is then updated, as are vmg->start,end,pgoff as the merge process proceeds and the appropriate merge strategy is determined. This is rather confusing, as vmg->vma starts off as the 'middle' VMA between vmg->prev,next, but becomes the 'target' VMA, except in one specific edge case (merge next, shrink middle). Int his patch we introduce vmg->middle to describe the VMA that is between vmg->prev and vmg->next, and does NOT change during the merge operation. We replace vmg->vma with vmg->target, and use this only during the merge operation itself. Aside from the merge right, shrink middle case, this becomes the VMA that forms the basis of the VMA that is returned. This edge case can be addressed in a future commit. We also add a number of comments to explain what is going on. Finally, we adjust the ASCII diagrams showing each merge case in vma_merge_existing_range() to be clearer - the arrow range previously showed the vmg->start, end spanned area, but it is clearer to change this to show the final merged VMA. This patch has no change in functional behaviour. Link: https://lkml.kernel.org/r/cover.1738326519.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/4dfe60f1419d55e5d0516f56349695d73a57184c.1738326519.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: zbud: remove zbudYosry Ahmed
The zbud compressed pages allocator is rarely used, most users use zsmalloc. zbud consumes much more memory (only stores 1 or 2 compressed pages per physical page). The only advantage of zbud is a marginal performance improvement that by no means justify the memory overhead. Historically, zsmalloc had significantly worse latency than zbud and z3fold but offered better memory savings. This is no longer the case as shown by a simple recent analysis [1]. In a kernel build test on tmpfs in a limited cgroup, zbud 2-3% less time than zsmalloc, but at the cost of using ~32% more memory (1.5G vs 1.13G). The tradeoff does not make sense for zbud in any practical scenario. The only alleged advantage of zbud is not having the dependency on CONFIG_MMU, but CONFIG_SWAP already depends on CONFIG_MMU anyway, and zbud is only used by zswap. Remove zbud after z3fold's removal, leaving zsmalloc as the one and only zpool allocator. Leave the removal of the zpool API (and its associated config options) to a followup cleanup after no more allocators show up. Deprecating zbud for a few cycles before removing it was initially proposed [2], like z3fold was marked as deprecated for 2 cycles [3]. However, Johannes rightfully pointed out that the 2 cycles is too short for most downstream consumers, and z3fold was deprecated first only as a courtesy anyway. [1]https://lore.kernel.org/lkml/CAJD7tkbRF6od-2x_L8-A1QL3=2Ww13sCj4S3i4bNndqF+3+_Vg@mail.gmail.com/ [2]https://lore.kernel.org/lkml/Z5gdnSX5Lv-nfjQL@google.com/ [3]https://lore.kernel.org/lkml/20240904233343.933462-1-yosryahmed@google.com/ Link: https://lkml.kernel.org/r/20250129180633.3501650-3-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: z3fold: remove z3foldYosry Ahmed
Patch series "mm: zswap: remove z3fold and zbud", v2. After 2 cycles of deprecating z3fold, remove it as well as zbud (rationale in specific patches). This patch (of 2): Z3fold has been marked as deprecated for 2 cycles and no one complained, as expected. As there are no known users, remove the code now. Link: https://lkml.kernel.org/r/20250129180633.3501650-1-yosry.ahmed@linux.dev Link: https://lkml.kernel.org/r/20250129180633.3501650-2-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/vmscan: extract calculated pressure balance as a functionHao Zhang
Extract pressure balance calculation into a function.This doesn't change current behaviour. [akpm@linux-foundation.org: 80-col wrapping] Link: https://lkml.kernel.org/r/tencent_735DB36A2306C08B8568049E4C0B99716C07@qq.com Signed-off-by: Hao Zhang <zhanghao1@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/rmap: avoid -EBUSY from make_device_exclusive()David Hildenbrand
Failing to obtain the folio lock, for example because the folio is concurrently getting migrated or swapped out, can easily make the callers fail: for example, the hmm selftest can sometimes be observed to fail because of this. Instead of forcing the caller to retry, let's simply retry in this to-be-expected case. Similarly, avoid spurious failures simply because we raced with someone (e.g., swapout) modifying the page table such that our folio_walk fails. Simply unconditionally lock the folio, and retry GUP if our folio_walk fails. Note that the folio_walk repeatedly failing is not something we expect. Note that we might want to avoid grabbing the folio lock at some point; for now, keep that as is and only unconditionally lock the folio. With this change, the hmm selftests don't fail simply because the folio is already locked. While this fixes the selftests in some cases, it's likely not something that deserves a "Fixes:". Link: https://lkml.kernel.org/r/20250210193801.781278-18-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/rmap: keep mapcount untouched for device-exclusive entriesDavid Hildenbrand
Now that conversion to device-exclusive does no longer perform an rmap walk and all page_vma_mapped_walk() users were taught to properly handle device-exclusive entries, let's treat device-exclusive entries just as if they would be present, similar to how we handle device-private entries already. This fixes swapout/migration/split/hwpoison of folios with device-exclusive entries. We only had to take care of page_vma_mapped_walk() users, because these traditionally assume pte_present(). Other page table walkers already have to handle !pte_present(), and some of them might simply skip them (e.g., MADV_PAGEOUT) if they are not specialized on them. This change doesn't modify the latter. Note that while folios with device-exclusive PTEs can now get migrated, khugepaged will not collapse a THP if there is device-exclusive PTE. Doing so might also not be desired if the device frequently performs atomics to the same page. Similarly, KSM will never merge order-0 folios that are device-exclusive. Link: https://lkml.kernel.org/r/20250210193801.781278-17-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon: handle device-exclusive entries correctly in damon_folio_mkold_one()David Hildenbrand
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk(). damon_folio_mkold_one() is not prepared for that and calls damon_ptep_mkold() with PFN swap PTEs. Teach damon_ptep_mkold() to deal with these PFN swap PTEs. Note that device-private entries are so far not applicable on that path, as damon_get_folio() filters out non-lru folios. Should we just skip PFN swap PTEs completely? Possible, but it seems straight forward to just handle it correctly. Note that we could currently only run into this case with device-exclusive entries on THPs. We still adjust the mapcount on conversion to device-exclusive; this makes the rmap walk abort early for small folios, because we'll always have !folio_mapped() with a single device-exclusive entry. We'll adjust the mapcount logic once all page_vma_mapped_walk() users can properly handle device-exclusive entries. Link: https://lkml.kernel.org/r/20250210193801.781278-16-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: SeongJae Park <sj@kernel.org> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon: handle device-exclusive entries correctly in damon_folio_young_one()David Hildenbrand
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk(). damon_folio_young_one() is not prepared for that, so teach it about these PFN swap PTEs. Note that device-private entries are so far not applicable on that path, as we expect ZONE_DEVICE pages so far only in migration code when it comes to the RMAP. The impact is rather small: we'd be calling pte_young() on a non-present PTE, which is not really defined to have semantic. Note that we could currently only run into this case with device-exclusive entries on THPs. We still adjust the mapcount on conversion to device-exclusive; this makes the rmap walk abort early for small folios, because we'll always have !folio_mapped() with a single device-exclusive entry. We'll adjust the mapcount logic once all page_vma_mapped_walk() users can properly handle device-exclusive entries. Link: https://lkml.kernel.org/r/20250210193801.781278-15-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: SeongJae Park <sj@kernel.org> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/page_idle: handle device-exclusive entries correctly in ↵David Hildenbrand
page_idle_clear_pte_refs_one() Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk(). page_idle_clear_pte_refs_one() is not prepared for that, so let's teach it what to do with these PFN swap PTEs. Note that device-private entries are so far not applicable on that path, as page_idle_get_folio() filters out non-lru folios. Should we just skip PFN swap PTEs completely? Possible, but it seems straight forward to just handle them correctly. Note that we could currently only run into this case with device-exclusive entries on THPs. We still adjust the mapcount on conversion to device-exclusive; this makes the rmap walk abort early for small folios, because we'll always have !folio_mapped() with a single device-exclusive entry. We'll adjust the mapcount logic once all page_vma_mapped_walk() users can properly handle device-exclusive entries. Link: https://lkml.kernel.org/r/20250210193801.781278-14-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/rmap: handle device-exclusive entries correctly in page_vma_mkclean_one()David Hildenbrand
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk(). page_vma_mkclean_one() is not prepared for that, so teach it about these PFN swap PTEs. Note that device-private entries are so far not applicable on that path, as we expect ZONE_DEVICE pages so far only in migration code when it comes to the RMAP. Note that we could currently only run into this case with device-exclusive entries on THPs. We still adjust the mapcount on conversion to device-exclusive; this makes the rmap walk abort early for small folios, because we'll always have !folio_mapped() with a single device-exclusive entry. We'll adjust the mapcount logic once all page_vma_mapped_walk() users can properly handle device-exclusive entries. Link: https://lkml.kernel.org/r/20250210193801.781278-13-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one()David Hildenbrand
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk(). try_to_migrate_one() is not prepared for that, so teach it about these PFN swap PTEs. We already handle device-private entries by specializing on the folio, so we can reshuffle that code to make it work on the PFN swap PTEs instead. Get rid of the folio_is_device_private() handling. Note that we never currently expect device-private folios with HWPoison flag set at that point, so add a warning in case that ever changes and we can figure out what the right thing to do is. Note that we could currently only run into this case with device-exclusive entries on THPs. We still adjust the mapcount on conversion to device-exclusive; this makes the rmap walk abort early for small folios, because we'll always have !folio_mapped() with a single device-exclusive entry. We'll adjust the mapcount logic once all page_vma_mapped_walk() users can properly handle device-exclusive entries. Further note that try_to_migrate() calls MMU notifiers and holds the folio lock, so any device-exclusive users should be properly prepared for a device-exclusive PTE to "vanish". Link: https://lkml.kernel.org/r/20250210193801.781278-12-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/rmap: handle device-exclusive entries correctly in try_to_unmap_one()David Hildenbrand
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk(). try_to_unmap_one() is not prepared for that, so teach it about these PFN swap PTEs. Note that device-private entries are so far not applicable on that path, as we expect ZONE_DEVICE pages so far only in migration code when it comes to the RMAP. Note that we could currently only run into this case with device-exclusive entries on THPs. We still adjust the mapcount on conversion to device-exclusive; this makes the rmap walk abort early for small folios, because we'll always have !folio_mapped() with a single device-exclusive entry. We'll adjust the mapcount logic once all page_vma_mapped_walk() users can properly handle device-exclusive entries. Further note that try_to_unmap() calls MMU notifiers and holds the folio lock, so any device-exclusive users should be properly prepared for a device-exclusive PTE to "vanish". Link: https://lkml.kernel.org/r/20250210193801.781278-11-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/ksm: handle device-exclusive entries correctly in write_protect_page()David Hildenbrand
Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk(). write_protect_page() is not prepared for that, so teach it about these PFN swap PTEs. Note that device-private entries are so far not applicable on that path, because GUP would never have returned such folios (conversion to device-private happens by page migration, not in-place conversion of the PTE). There is a race between performing the folio_walk (which fails on non-present PTEs) and locking the folio to look it up using page_vma_mapped_walk() again, so this is likely a fix (unless something else could prevent that race, but it doesn't look like). In the future it could be handled if ever required, for now just give up and ignore them like folio_walk would. Link: https://lkml.kernel.org/r/20250210193801.781278-10-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/page_vma_mapped: device-exclusive entries are not migration entriesDavid Hildenbrand
It's unclear why they would be considered migration entries; they are not. Likely we'll never really trigger that case in practice, because migration (including folio split) of a folio that has device-exclusive entries is never started, as we would detect "additional references": device-exclusive entries adjust the mapcount, but not the refcount. Link: https://lkml.kernel.org/r/20250210193801.781278-8-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: use single SWP_DEVICE_EXCLUSIVE entry typeDavid Hildenbrand
There is no need for the distinction anymore; let's merge the readable and writable device-exclusive entries into a single device-exclusive entry type. Link: https://lkml.kernel.org/r/20250210193801.781278-7-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Simona Vetter <simona.vetter@ffwll.ch> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/memory: detect writability in restore_exclusive_pte() through ↵David Hildenbrand
can_change_pte_writable() Let's do it just like mprotect write-upgrade or during NUMA-hinting faults on PROT_NONE PTEs: detect if the PTE can be writable by using can_change_pte_writable(). Set the PTE only dirty if the folio is dirty: we might not necessarily have a write access, and setting the PTE writable doesn't require setting the PTE dirty. From a CPU perspective, these entries are clean. So only set the PTE dirty if the folios is dirty. With this change in place, there is no need to have separate readable and writable device-exclusive entry types, and we'll merge them next separately. Note that, during fork(), we first convert the device-exclusive entries back to ordinary PTEs, and we only ever allow conversion of writable PTEs to device-exclusive -- only mprotect can currently change them to readable-device-exclusive. Consequently, we always expect PageAnonExclusive(page)==true and can_change_pte_writable()==true, unless we are dealing with soft-dirty tracking or uffd-wp. But reusing can_change_pte_writable() for now is cleaner. Link: https://lkml.kernel.org/r/20250210193801.781278-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walkDavid Hildenbrand
We require a writable PTE and only support anonymous folio: we can only have exactly one PTE pointing at that page, which we can just lookup using a folio walk, avoiding the rmap walk and the anon VMA lock. So let's stop doing an rmap walk and perform a folio walk instead, so we can easily just modify a single PTE and avoid relying on rmap/mapcounts. We now effectively work on a single PTE instead of multiple PTEs of a large folio, allowing for conversion of individual PTEs from non-exclusive to device-exclusive -- note that the opposite direction always works on single PTEs: restore_exclusive_pte(). With this change, device-exclusive handling is fully compatible with THPs / large folios. We still require PMD-sized THPs to get PTE-mapped, and supporting PMD-mapped THP (without the PTE-remapping) is a different endeavour that might not be worth it at this point: it might even have negative side-effects [1]. This gets rid of the "folio_mapcount()" usage and let's us fix ordinary rmap walks (migration/swapout) next. Spell out that messing with the mapcount is wrong and must be fixed. [1] https://lkml.kernel.org/r/Z5tI-cOSyzdLjoe_@phenom.ffwll.local Link: https://lkml.kernel.org/r/20250210193801.781278-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()David Hildenbrand
The single "real" user in the tree of make_device_exclusive_range() always requests making only a single address exclusive. The current implementation is hard to fix for properly supporting anonymous THP / large folios and for avoiding messing with rmap walks in weird ways. So let's always process a single address/page and return folio + page to minimize page -> folio lookups. This is a preparation for further changes. Reject any non-anonymous or hugetlb folios early, directly after GUP. While at it, extend the documentation of make_device_exclusive() to clarify some things. Link: https://lkml.kernel.org/r/20250210193801.781278-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Simona Vetter <simona.vetter@ffwll.ch> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/rmap: reject hugetlb folios in folio_make_device_exclusive()David Hildenbrand
Even though FOLL_SPLIT_PMD on hugetlb now always fails with -EOPNOTSUPP, let's add a safety net in case FOLL_SPLIT_PMD usage would ever be reworked. In particular, before commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code"), GUP(FOLL_SPLIT_PMD) would just have returned a page. In particular, hugetlb folios that are not PMD-sized would never have been prone to FOLL_SPLIT_PMD. hugetlb folios can be anonymous, and page_make_device_exclusive_one() is not really prepared for handling them at all. So let's spell that out. Link: https://lkml.kernel.org/r/20250210193801.781278-3-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAsDavid Hildenbrand
Patch series "mm: fixes for device-exclusive entries (hmm)", v2. Discussing the PageTail() call in make_device_exclusive_range() with Willy, I recently discovered [1] that device-exclusive handling does not properly work with THP, making the hmm-tests selftests fail if THPs are enabled on the system. Looking into more details, I found that hugetlb is not properly fenced, and I realized that something that was bugging me for longer -- how device-exclusive entries interact with mapcounts -- completely breaks migration/swapout/split/hwpoison handling of these folios while they have device-exclusive PTEs. The program below can be used to allocate 1 GiB worth of pages and making them device-exclusive on a kernel with CONFIG_TEST_HMM. Once they are device-exclusive, these folios cannot get swapped out (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how much one forces memory reclaim), and when having a memory block onlined to ZONE_MOVABLE, trying to offline it will loop forever and complain about failed migration of a page that should be movable. # echo offline > /sys/devices/system/memory/memory136/state # echo online_movable > /sys/devices/system/memory/memory136/state # ./hmm-swap & ... wait until everything is device-exclusive # echo offline > /sys/devices/system/memory/memory136/state [ 285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x7f20671f7 pfn:0x442b6a [ 285.196618][T14882] memcg:ffff888179298000 [ 285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate| dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff) [ 285.201734][T14882] raw: ... [ 285.204464][T14882] raw: ... [ 285.207196][T14882] page dumped because: migration failure [ 285.209072][T14882] page_owner tracks the page as allocated [ 285.210915][T14882] page last allocated via order 0, migratetype Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774 [ 285.216765][T14882] post_alloc_hook+0x197/0x1b0 [ 285.218874][T14882] get_page_from_freelist+0x76e/0x3280 [ 285.220864][T14882] __alloc_frozen_pages_noprof+0x38e/0x2740 [ 285.223302][T14882] alloc_pages_mpol+0x1fc/0x540 [ 285.225130][T14882] folio_alloc_mpol_noprof+0x36/0x340 [ 285.227222][T14882] vma_alloc_folio_noprof+0xee/0x1a0 [ 285.229074][T14882] __handle_mm_fault+0x2b38/0x56a0 [ 285.230822][T14882] handle_mm_fault+0x368/0x9f0 ... This series fixes all issues I found so far. There is no easy way to fix without a bigger rework/cleanup. I have a bunch of cleanups on top (some previous sent, some the result of the discussion in v1) that I will send out separately once this landed and I get to it. I wish we could just use some special present PROT_NONE PTEs instead of these (non-present, non-none) fake-swap entries; but that just results in the same problem we keep having (lack of spare PTE bits), and staring at other similar fake-swap entries, that ship has sailed. With this series, make_device_exclusive() doesn't actually belong into mm/rmap.c anymore, but I'll leave moving that for another day. I only tested this series with the hmm-tests selftests due to lack of HW, so I'd appreciate some testing, especially if the interaction between two GPUs wanting a device-exclusive entry works as expected. <program> #include <stdio.h> #include <fcntl.h> #include <stdint.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <sys/ioctl.h> #include <linux/types.h> #include <linux/ioctl.h> #define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd) struct hmm_dmirror_cmd { __u64 addr; __u64 ptr; __u64 npages; __u64 cpages; __u64 faults; }; const size_t size = 1 * 1024 * 1024 * 1024ul; const size_t chunk_size = 2 * 1024 * 1024ul; int main(void) { struct hmm_dmirror_cmd cmd; size_t cur_size; int fd, ret; char *addr, *mirror; fd = open("/dev/hmm_dmirror1", O_RDWR, 0); if (fd < 0) { perror("open failed\n"); exit(1); } addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (addr == MAP_FAILED) { perror("mmap failed\n"); exit(1); } madvise(addr, size, MADV_NOHUGEPAGE); memset(addr, 1, size); mirror = malloc(chunk_size); for (cur_size = 0; cur_size < size; cur_size += chunk_size) { cmd.addr = (uintptr_t)addr + cur_size; cmd.ptr = (uintptr_t)mirror; cmd.npages = chunk_size / getpagesize(); ret = ioctl(fd, HMM_DMIRROR_EXCLUSIVE, &cmd); if (ret) { perror("ioctl failed\n"); exit(1); } } pause(); return 0; } </program> [1] https://lkml.kernel.org/r/25e02685-4f1d-47fa-be5b-01ff85bb0ce2@redhat.com This patch (of 17): We only have two FOLL_SPLIT_PMD users. While uprobe refuses hugetlb early, make_device_exclusive_range() can end up getting called on hugetlb VMAs. Right now, this means that with a PMD-sized hugetlb page, we can end up calling split_huge_pmd(), because pmd_trans_huge() also succeeds with hugetlb PMDs. For example, using a modified hmm-test selftest one can trigger: [ 207.017134][T14945] ------------[ cut here ]------------ [ 207.018614][T14945] kernel BUG at mm/page_table_check.c:87! [ 207.019716][T14945] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI [ 207.021072][T14945] CPU: 3 UID: 0 PID: ... [ 207.023036][T14945] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 [ 207.024834][T14945] RIP: 0010:page_table_check_clear.part.0+0x488/0x510 [ 207.026128][T14945] Code: ... [ 207.029965][T14945] RSP: 0018:ffffc9000cb8f348 EFLAGS: 00010293 [ 207.031139][T14945] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffffffff8249a0cd [ 207.032649][T14945] RDX: ffff88811e883c80 RSI: ffffffff8249a357 RDI: ffff88811e883c80 [ 207.034183][T14945] RBP: ffff888105c0a050 R08: 0000000000000005 R09: 0000000000000000 [ 207.035688][T14945] R10: 00000000ffffffff R11: 0000000000000003 R12: 0000000000000001 [ 207.037203][T14945] R13: 0000000000000200 R14: 0000000000000001 R15: dffffc0000000000 [ 207.038711][T14945] FS: 00007f2783275740(0000) GS:ffff8881f4980000(0000) knlGS:0000000000000000 [ 207.040407][T14945] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 207.041660][T14945] CR2: 00007f2782c00000 CR3: 0000000132356000 CR4: 0000000000750ef0 [ 207.043196][T14945] PKRU: 55555554 [ 207.043880][T14945] Call Trace: [ 207.044506][T14945] <TASK> [ 207.045086][T14945] ? __die+0x51/0x92 [ 207.045864][T14945] ? die+0x29/0x50 [ 207.046596][T14945] ? do_trap+0x250/0x320 [ 207.047430][T14945] ? do_error_trap+0xe7/0x220 [ 207.048346][T14945] ? page_table_check_clear.part.0+0x488/0x510 [ 207.049535][T14945] ? handle_invalid_op+0x34/0x40 [ 207.050494][T14945] ? page_table_check_clear.part.0+0x488/0x510 [ 207.051681][T14945] ? exc_invalid_op+0x2e/0x50 [ 207.052589][T14945] ? asm_exc_invalid_op+0x1a/0x20 [ 207.053596][T14945] ? page_table_check_clear.part.0+0x1fd/0x510 [ 207.054790][T14945] ? page_table_check_clear.part.0+0x487/0x510 [ 207.055993][T14945] ? page_table_check_clear.part.0+0x488/0x510 [ 207.057195][T14945] ? page_table_check_clear.part.0+0x487/0x510 [ 207.058384][T14945] __page_table_check_pmd_clear+0x34b/0x5a0 [ 207.059524][T14945] ? __pfx___page_table_check_pmd_clear+0x10/0x10 [ 207.060775][T14945] ? __pfx___mutex_unlock_slowpath+0x10/0x10 [ 207.061940][T14945] ? __pfx___lock_acquire+0x10/0x10 [ 207.062967][T14945] pmdp_huge_clear_flush+0x279/0x360 [ 207.064024][T14945] split_huge_pmd_locked+0x82b/0x3750 ... Before commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code"), we would have ignored the flag; instead, let's simply refuse the combination completely in check_vma_flags(): the caller is likely not prepared to handle any hugetlb folios. We'll teach make_device_exclusive_range() separately to ignore any hugetlb folios as a future-proof safety net. Link: https://lkml.kernel.org/r/20250210193801.781278-1-david@redhat.com Link: https://lkml.kernel.org/r/20250210193801.781278-2-david@redhat.com Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Barry Song <v-songbaohua@oppo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/huge_memory: allow split shmem large folio to any lower orderZi Yan
Commit 4d684b5f92ba ("mm: shmem: add large folio support for tmpfs") has added large folio support to shmem. Remove the restriction in split_huge_page*(). Link: https://lkml.kernel.org/r/20250122161928.1240637-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Yang Shi <yang@os.amperecomputing.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/oom_kill: fix trivial typo in commentCarlos Llamas
Update 'give' -> 'given' in the description of oom_reap_task_mm(). Link: https://lkml.kernel.org/r/20250123193523.1496909-1-cmllamas@google.com Signed-off-by: Carlos Llamas <cmllamas@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: memcontrol: move memsw charge callbacks to v1Johannes Weiner
The interweaving of two entirely different swap accounting strategies has been one of the more confusing parts of the memcg code. Split out the v1 code to clarify the implementation and a handful of callsites, and to avoid building the v1 bits when !CONFIG_MEMCG_V1. text data bss dec hex filename 39253 6446 4160 49859 c2c3 mm/memcontrol.o.old 38877 6382 4160 49419 c10b mm/memcontrol.o Link: https://lkml.kernel.org/r/20250124054132.45643-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: memcontrol: move stray ratelimit bits to v1Johannes Weiner
41213dd0f816 ("memcg: move mem_cgroup_event_ratelimit to v1 code") left this one behind. There are no v2 references. Link: https://lkml.kernel.org/r/20250124043859.18808-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: memcontrol: unshare v2-only charge API bits againJohannes Weiner
6b611388b626 ("memcg-v1: remove charge move code") removed the remaining v1 callers. Link: https://lkml.kernel.org/r/20250124043859.18808-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16memcg: add CONFIG_MEMCG_V1 for 'local' functionsChen Ridong
Add CONFIG_MEMCG_V1 for the 'local' functions, which are only used in memcg v1, so that they won't be built for v2. Link: https://lkml.kernel.org/r/20250124073514.2375622-5-chenridong@huaweicloud.com Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: David Finkel <davidf@vimeo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wang Weiyang <wangweiyang2@huawei.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16memcg: factor out the replace_stock_objcg functionChen Ridong
Factor out the 'replace_stock_objcg' function to make the code more cohesive. Link: https://lkml.kernel.org/r/20250124073514.2375622-4-chenridong@huaweicloud.com Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Finkel <davidf@vimeo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wang Weiyang <wangweiyang2@huawei.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16memcg: call the free function when allocation of pn failsChen Ridong
The 'free_mem_cgroup_per_node_info' function is used to free the 'mem_cgroup_per_node' struct. Using 'pn' as the input for the free_mem_cgroup_per_node_info function will be much clearer. Call 'free_mem_cgroup_per_node_info' when 'alloc_mem_cgroup_per_node_info' fails, to free 'pn' as a whole, which makes the code more cohesive. Link: https://lkml.kernel.org/r/20250124073514.2375622-3-chenridong@huaweicloud.com Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: David Finkel <davidf@vimeo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wang Weiyang <wangweiyang2@huawei.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16memcg: use OFP_PEAK_UNSET instead of -1Chen Ridong
Patch series "Some cleanup for memcg", v4. This patch (of 4): The 'OFP_PEAK_UNSET' has been defined, use it instead of '-1'. Link: https://lkml.kernel.org/r/20250124073514.2375622-1-chenridong@huaweicloud.com Link: https://lkml.kernel.org/r/20250124073514.2375622-2-chenridong@huaweicloud.com Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: David Finkel <davidf@vimeo.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wang Weiyang <wangweiyang2@huawei.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/page_alloc: fix memory accept before watermarks gets initializedKirill A. Shutemov
Watermarks are initialized during the postcore initcall. Until then, all watermarks are set to zero. This causes cond_accept_memory() to incorrectly skip memory acceptance because a watermark of 0 is always met. This can lead to a premature OOM on boot. To ensure progress, accept one MAX_ORDER page if the watermark is zero. Link: https://lkml.kernel.org/r/20250310082855.2587122-1-kirill.shutemov@linux.intel.com Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Reported-by: Farrah Chen <farrah.chen@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16memcg: drain obj stock on cpu hotplug teardownShakeel Butt
Currently on cpu hotplug teardown, only memcg stock is drained but we need to drain the obj stock as well otherwise we will miss the stats accumulated on the target cpu as well as the nr_bytes cached. The stats include MEMCG_KMEM, NR_SLAB_RECLAIMABLE_B & NR_SLAB_UNRECLAIMABLE_B. In addition we are leaking reference to struct obj_cgroup object. Link: https://lkml.kernel.org/r/20250310230934.2913113-1-shakeel.butt@linux.dev Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/huge_memory: drop beyond-EOF folios with the right number of refsZi Yan
When an after-split folio is large and needs to be dropped due to EOF, folio_put_refs(folio, folio_nr_pages(folio)) should be used to drop all page cache refs. Otherwise, the folio will not be freed, causing memory leak. This leak would happen on a filesystem with blocksize > page_size and a truncate is performed, where the blocksize makes folios split to >0 order ones, causing truncated folios not being freed. Link: https://lkml.kernel.org/r/20250310155727.472846-1-ziy@nvidia.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Hugh Dickins <hughd@google.com> Closes: https://lore.kernel.org/all/fcbadb7f-dd3e-21df-f9a7-2853b53183c4@google.com/ Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: fix error handling in __filemap_get_folio() with FGP_NOWAITRaphael S. Carvalho
original report: https://lore.kernel.org/all/CAKhLTr1UL3ePTpYjXOx2AJfNk8Ku2EdcEfu+CH1sf3Asr=B-Dw@mail.gmail.com/T/ When doing buffered writes with FGP_NOWAIT, under memory pressure, the system returned ENOMEM despite there being plenty of available memory, to be reclaimed from page cache. The user space used io_uring interface, which in turn submits I/O with FGP_NOWAIT (the fast path). retsnoop pointed to iomap_get_folio: 00:34:16.180612 -> 00:34:16.180651 TID/PID 253786/253721 (reactor-1/combined_tests): entry_SYSCALL_64_after_hwframe+0x76 do_syscall_64+0x82 __do_sys_io_uring_enter+0x265 io_submit_sqes+0x209 io_issue_sqe+0x5b io_write+0xdd xfs_file_buffered_write+0x84 iomap_file_buffered_write+0x1a6 32us [-ENOMEM] iomap_write_begin+0x408 iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… pos=0 len=4096 foliop=0xffffb32c296b7b80 ! 4us [-ENOMEM] iomap_get_folio iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… pos=0 len=4096 This is likely a regression caused by 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio"), which moved error handling from io_map_get_folio() to __filemap_get_folio(), but broke FGP_NOWAIT handling, so ENOMEM is being escaped to user space. Had it correctly returned -EAGAIN with NOWAIT, either io_uring or user space itself would be able to retry the request. It's not enough to patch io_uring since the iomap interface is the one responsible for it, and pwritev2(RWF_NOWAIT) and AIO interfaces must return the proper error too. The patch was tested with scylladb test suite (its original reproducer), and the tests all pass now when memory is pressured. Link: https://lkml.kernel.org/r/20250224143700.23035-1-raphaelsc@scylladb.com Fixes: 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio") Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: memcontrol: fix swap counter leak from offline cgroupMuchun Song
Commit 6769183166b3 removed the parameter of id from swap_cgroup_record() and get the memcg id from mem_cgroup_id(folio_memcg(folio)). However, the caller of it may update a different memcg's counter instead of folio_memcg(folio). E.g. in the caller of mem_cgroup_swapout(), @swap_memcg could be different with @memcg and update the counter of @swap_memcg, but swap_cgroup_record() records the wrong memcg's ID. When it is uncharged from __mem_cgroup_uncharge_swap(), the swap counter will leak since the wrong recorded ID. Fix it by bringing the parameter of id back. Link: https://lkml.kernel.org/r/20250306023133.44838-1-songmuchun@bytedance.com Fixes: 6769183166b3 ("mm/swap_cgroup: decouple swap cgroup recording and clearing") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/vma: do not register private-anon mappings with khugepaged during mmapDev Jain
We already are registering private-anon VMAs with khugepaged during fault time, in do_huge_pmd_anonymous_page(). Commit "register suitable readonly file vmas for khugepaged" moved the khugepaged registration logic from shmem_mmap to the generic mmap path. The userspace-visible effect should be this: khugepaged will unnecessarily scan mm's which haven't yet faulted in. Note that it won't actually collapse because all PTEs are none. Now that I think about it, the mm is going to have a file VMA anyways during fork+exec, so the mm already gets registered during mmap due to the non-anon case (I *think*), so at least one of either the mmap registration or fault-time registration is redundant. Make this logic specific for non-anon mappings. Link: https://lkml.kernel.org/r/20250306063037.16299-1-dev.jain@arm.com Fixes: 613bec092fe7 ("mm: mmap: register suitable readonly file vmas for khugepaged") Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/migrate: fix shmem xarray update during migrationZi Yan
A shmem folio can be either in page cache or in swap cache, but not at the same time. Namely, once it is in swap cache, folio->mapping should be NULL, and the folio is no longer in a shmem mapping. In __folio_migrate_mapping(), to determine the number of xarray entries to update, folio_test_swapbacked() is used, but that conflates shmem in page cache case and shmem in swap cache case. It leads to xarray multi-index entry corruption, since it turns a sibling entry to a normal entry during xas_store() (see [1] for a userspace reproduction). Fix it by only using folio_test_swapcache() to determine whether xarray is storing swap cache entries or not to choose the right number of xarray entries to update. [1] https://lore.kernel.org/linux-mm/Z8idPCkaJW1IChjT@casper.infradead.org/ Note: In __split_huge_page(), folio_test_anon() && folio_test_swapcache() is used to get swap_cache address space, but that ignores the shmem folio in swap cache case. It could lead to NULL pointer dereferencing when a in-swap-cache shmem folio is split at __xa_store(), since !folio_test_anon() is true and folio->mapping is NULL. But fortunately, its caller split_huge_page_to_list_to_order() bails out early with EBUSY when folio->mapping is NULL. So no need to take care of it here. Link: https://lkml.kernel.org/r/20250305200403.2822855-1-ziy@nvidia.com Fixes: fc346d0a70a1 ("mm: migrate high-order folios in swap cache correctly") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Liu Shixin <liushixin2@huawei.com> Closes: https://lore.kernel.org/all/28546fb4-5210-bf75-16d6-43e1f8646080@huawei.com/ Suggested-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/hugetlb: fix surplus pages in dissolve_free_huge_page()Jinjiang Tu
In dissolve_free_huge_page(), free huge pages are dissolved without adjusting surplus count. However, free huge pages may be accounted as surplus pages, and will lead to wrong surplus count. I reproduce this issue on qemu. The steps are: 1) Node1 is memory-less at first. Hot-add memory to node1 by executing the two commands in qemu monitor: object_add memory-backend-ram,id=mem1,size=1G device_add pc-dimm,id=dimm1,memdev=mem1,node=1 2) online one memory block of Node1 with: echo online_movable > /sys/devices/system/node/node1/memoryX/state 3) create 64 huge pages for node1 4) run a program to reserve (don't consume) all the huge pages 5) echo 0 > nr_huge_pages for node1. After this step, free huge pages in Node1 are surplus. 6) create 80 huge pages for node0 7) offline memory of node1, The memory range to offline contains the free surplus huge pages created in step3) ~ step5) echo offline > /sys/devices/system/node/node1/memoryX/state 8) kill the program in step 4) The result: Node0 Node1 total 80 0 free 80 0 surplus 0 61 To fix it, adjust surplus when destroying huge pages if the node has surplus pages in dissolve_free_hugetlb_folio(). The result with this patch: Node0 Node1 total 80 0 free 80 0 surplus 0 0 Link: https://lkml.kernel.org/r/20250304132106.2872754-1-tujinjiang@huawei.com Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Jinjiang Tu <tujinjiang@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon/core: initialize damos->walk_completed in damon_new_scheme()SeongJae Park
The function for allocating and initialize a 'struct damos' object, damon_new_scheme(), is not initializing damos->walk_completed field. Only damos_walk_complete() is setting the field. Hence the field will be eventually set and used correctly from second damos_walk() call for the scheme. But the first damos_walk() could mistakenly not walk on the regions. Actually, a common usage of DAMOS for taking an access pattern snapshot is installing a monitoring-purpose DAMOS scheme, doing damos_walk() to retrieve the snapshot, and then removing the scheme. DAMON user-space tool (damo) also gets runtime snapshot in the way. Hence the problem can continuously happen in such use cases. Initialize it properly in the allocation function. Link: https://lkml.kernel.org/r/20250228174450.41472-1-sj@kernel.org Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon: respect core layer filters' allowance decision on ops layerSeongJae Park
Filtering decisions are made in filters evaluation order. Once a decision is made by a filter, filters that scheduled to be evaluated after the decision-made filter should just respect it. This is the intended and documented behavior. Since core layer-handled filters are evaluated before operations layer-handled filters, decisions made on core layer should respected by ops layer. In case of reject filters, the decision is respected, since core layer-rejected regions are not passed to ops layer. But in case of allow filters, ops layer filters don't know if the region has passed to them because it was allowed by core filters or just because it didn't match to any core layer. The current wrong implementation assumes it was due to not matched by any core filters. As a reuslt, the decision is not respected. Pass the missing information to ops layer using a new filed in 'struct damos', and make the ops layer filters respect it. Link: https://lkml.kernel.org/r/20250228175336.42781-1-sj@kernel.org Fixes: 491fee286e56 ("mm/damon/core: support damos_filter->allow") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16filemap: move prefaulting out of hot write pathDave Hansen
There is a generic anti-pattern that shows up in the VFS and several filesystems where the hot write paths touch userspace twice when they could get away with doing it once. Dave Chinner suggested that they should all be fixed up[1]. I agree[2]. But, the series to do that fixup spans a bunch of filesystems and a lot of people. This patch fixes common code that absolutely everyone uses. It has measurable performance benefits[3]. I think this patch can go in and not be held up by the others. I will post them separately to their separate maintainers for consideration. But, honestly, I'm not going to lose any sleep if the maintainers don't pick those up. 1. https://lore.kernel.org/all/Z5f-x278Z3wTIugL@dread.disaster.area/ 2. https://lore.kernel.org/all/20250129181749.C229F6F3@davehans-spike.ostc.intel.com/ 3. https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/ This patch: There is a bit of a sordid history here. I originally wrote 998ef75ddb57 ("fs: do not prefault sys_write() user buffer pages") to fix a performance issue that showed up on early SMAP hardware. But that was reverted with 00a3d660cbac because it exposed an underlying filesystem bug. This is a reimplementation of the original commit along with some simplification and comment improvements. The basic problem is that the generic write path has two userspace accesses: one to prefault the write source buffer and then another to perform the actual write. On x86, this means an extra STAC/CLAC pair. These are relatively expensive instructions because they function as barriers. Keep the prefaulting behavior but move it into the slow path that gets run when the write did not make any progress. This avoids livelocks that can happen when the write's source and destination target the same folio. Contrary to the existing comments, the fault-in does not prevent deadlocks. That's accomplished by using an "atomic" usercopy that disables page faults. The end result is that the generic write fast path now touches userspace once instead of twice. 0day has shown some improvements on a couple of microbenchmarks: https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250228203722.CAEB63AC@davehans-spike.ostc.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/yxyuijjfd6yknryji2q64j3keq2ygw6ca6fs5jwyolklzvo45s@4u63qqqyosy2/ Cc: Ted Ts'o <tytso@mit.edu> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-13Revert "fanotify: disable readahead if we have pre-content watches"Amir Goldstein
This reverts commit fac84846a28c0950d4433118b3dffd44306df62d. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-7-amir73il@gmail.com
2025-03-13Revert "mm: don't allow huge faults for files with pre content watches"Amir Goldstein
This reverts commit 20bf82a898b65c129af76deb96a1b415d3098a28. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-6-amir73il@gmail.com
2025-03-13Revert "fsnotify: generate pre-content permission event on page fault"Amir Goldstein
This reverts commit 8392bc2ff8c8bf7c4c5e6dfa71ccd893a3c046f6. In the use case of buffered write whose input buffer is mmapped file on a filesystem with a pre-content mark, the prefaulting of the buffer can happen under the filesystem freeze protection (obtained in vfs_write()) which breaks assumptions of pre-content hook and introduces potential deadlock of HSM handler in userspace with filesystem freezing. Now that we have pre-content hooks at file mmap() time, disable the pre-content event hooks on page fault to avoid the potential deadlock. Reported-by: syzbot+7229071b47908b19d5b7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-fsdevel/7ehxrhbvehlrjwvrduoxsao5k3x4aw275patsb3krkwuq573yv@o2hskrfawbnc/ Fixes: 8392bc2ff8c8 ("fsnotify: generate pre-content permission event on page fault") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-5-amir73il@gmail.com
2025-03-12fsnotify: add pre-content hooks on mmap()Amir Goldstein
Pre-content hooks in page faults introduces potential deadlock of HSM handler in userspace with filesystem freezing. The requirement with pre-content event is that for every accessed file range an event covering at least this range will be generated at least once before the file data is accesses. In preparation to disabling pre-content event hooks on page faults, add pre-content hooks at mmap() variants for the entire mmaped range, so HSM can fill content when user requests to map a portion of the file. Note that exec() variant also calls vm_mmap_pgoff() internally to map code sections, so pre-content hooks are also generated in this case. Link: https://lore.kernel.org/linux-fsdevel/7ehxrhbvehlrjwvrduoxsao5k3x4aw275patsb3krkwuq573yv@o2hskrfawbnc/ Suggested-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-2-amir73il@gmail.com