summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-11-06memcg: workingset: remove folio_memcg_rcu usageShakeel Butt
The function workingset_activation() is called from folio_mark_accessed() with the guarantee that the given folio can not be freed under us in workingset_activation(). In addition, the association of the folio and its memcg can not be broken here because charge migration is no more. There is no need to use folio_memcg_rcu. Simply use folio_memcg_charged() because that is what this function cares about. [akpm@linux-foundation.org: provide folio_memcg_charged stub for CONFIG_MEMCG=n] Link: https://lkml.kernel.org/r/20241026163707.2479526-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Suggested-by: Yu Zhao <yuzhao@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/vma: the pgoff is correct if can_merge_rightWei Yang
By this point can_vma_merge_right() must have returned true, which implies can_vma_merge_before() also returned true, which already asserts that the pgoff is as expected for a merge with the following VMA, thus this assignment is redundant. Below is a more detail explanation. Current definition of can_vma_merge_right() is: static bool can_vma_merge_right(struct vma_merge_struct *vmg, bool can_merge_left) { if (!vmg->next || vmg->end != vmg->next->vm_start || !can_vma_merge_before(vmg)) return false; ... } And: static bool can_vma_merge_before(struct vma_merge_struct *vmg) { pgoff_t pglen = PHYS_PFN(vmg->end - vmg->start); ... if (vmg->next->vm_pgoff == vmg->pgoff + pglen) return true; ... } Which implies vmg->pgoff == vmg->next->vm_pgoff - pglen. None of these values are changed between the check and prior assignment, so this was an entirely redundant assignment. [akpm@linux-foundation.org: remove now-unused local] [lorenzo.stoakes@oracle.com: rephrase the changelog] Link: https://lkml.kernel.org/r/20241024093347.18057-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: defer second attempt at merge on mmap()Lorenzo Stoakes
Rather than trying to merge again when ostensibly allocating a new VMA, instead defer until the VMA is added and attempt to merge the existing range. This way we have no complicated unwinding logic midway through the process of mapping the VMA. In addition this removes limitations on the VMA not being able to be the first in the virtual memory address space which was previously implicitly required. In theory, for this very same reason, we should unconditionally attempt merge here, however this is likely to have a performance impact so it is better to avoid this given the unlikely outcome of a merge. [lorenzo.stoakes@oracle.com: remove unnecessary indirection] Link: https://lkml.kernel.org/r/5106696d-e625-4d8a-8545-9d1430301730@lucifer.local Link: https://lkml.kernel.org/r/d4f84502605d7651ac114587f507395c0fc76004.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: remove unnecessary reset state logic on merge new VMALorenzo Stoakes
The only place where this was used was in mmap_region(), which we have now adjusted to not require this to be performed (we reset ourselves in effect). It also created a dangerous assumption that VMG state could be safely reused after a merge, at which point it may have been mutated in unexpected ways, leading to subtle bugs. Note that it was discovered by Wei Yang that there was also an error in this code - we are comparing vmg->vma with prev after setting it to NULL. This however had no impact, as we previously reset VMA iterator state before attempting merge again, but it was useless effort. In any case, this patch removes all of the logic so also eliminates this wasted effort. Link: https://lkml.kernel.org/r/5d9a59eee6498ae017cc87d89aa723de7179f75d.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: refactor __mmap_region()Lorenzo Stoakes
We have seen bugs and resource leaks arise from the complexity of the __mmap_region() function. This, and the generally deeply fragile error handling logic and complexity which makes understanding the function difficult make it highly desirable to refactor it into something readable. Achieve this by separating the function into smaller logical parts which are easier to understand and follow, and which importantly very significantly simplify the error handling. Note that we now call vms_abort_munmap_vmas() in more error paths than we used to, however in cases where no abort need occur, vms->nr_pages will be equal to zero and we simply exit this function without doing more than we would have done previously. Importantly, the invocation of the driver mmap hook via mmap_file() now has very simple and obvious handling (this was previously the most problematic part of the mmap() operation). Use a generalised stack-based 'mmap state' to thread through values and also retrieve state as needed. Also avoid ever relying on vma merge (vmg) state after a merge is attempted, instead maintain meaningful state in the mmap state and establish vmg state as and when required. This avoids any subtle bugs arising from merge logic mutating this state and mmap_region() logic later relying upon it. Link: https://lkml.kernel.org/r/25bd2edc3275450f448cbfe0756ce2a7cd06810f.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: isolate mmap internal logic to mm/vma.cLorenzo Stoakes
In previous commits we effected improvements to the mmap() logic in mmap_region() and its newly introduced internal implementation function __mmap_region(). However as these changes are intended to be backported, we kept the delta as small as is possible and made as few changes as possible to the newly introduced mm/vma.* files. Take the opportunity to move this logic to mm/vma.c which not only isolates it, but also makes it available for later userland testing which can help us catch such logic errors far earlier. Link: https://lkml.kernel.org/r/93fc2c3aa37dd30590b7e4ee067dfd832007bf7e.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06tools: testing: add additional vma_internal.h stubsLorenzo Stoakes
Patch series "fix error handling in mmap_region() and refactor", v3. The mmap_region() function is somewhat terrifying, with spaghetti-like control flow and numerous means by which issues can arise and incomplete state, memory leaks and other unpleasantness can occur. This series goes to great lengths to simplify how mmap_region() works and to avoid unwinding errors late on in the process of setting up the VMA for the new mapping, and equally avoids such operations occurring while the VMA is in an inconsistent state. This series builds on the previously submitted hotfix patches (see link to v2 below) which addresses the most critical issues around mmap_region(), and further works to improve mmap_region() complexity, stability, and testability. This series moves the code to mm/vma.c to render it userland testable, refactors and simplifies it into smaller functions that are significantly more readable. It additionally avoids performing an attempt at a second merge mid-way through allocating a new VMA, a dubious proposition at best and one that is highly subject to subtle bugs. Rather than do this, we simply note that we ought to retry the merge and do this as a final step. This patch (of 3): Add some additional vma_internal.h stubs in preparation for __mmap_region() being moved to mm/vma.c. Without these the move would result in the tests no longer compiling. Link: https://lkml.kernel.org/r/cover.1729858176.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/74b27e159e261d2ac1fe66a130edad1d61fdc176.1729858176.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06memcg-v1: remove memcg move locking codeShakeel Butt
The memcg v1's charge move feature has been deprecated. All the places using the memcg move lock, have stopped using it as they don't need the protection any more. Let's proceed to remove all the locking code related to charge moving. Link: https://lkml.kernel.org/r/20241025012304.2473312-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06memcg-v1: no need for memcg locking for MGLRUShakeel Butt
While updating the generation of the folios, MGLRU requires that the folio's memcg association remains stable. With the charge migration deprecated, there is no need for MGLRU to acquire locks to keep the folio and memcg association stable. [yuzhao@google.com: remove !rcu_read_lock_held() assertion] Link: https://lkml.kernel.org/r/ZykEtcHrQRq-KrBC@google.com Link: https://syzkaller.appspot.com/bug?extid=24f45b8beab9788e467e Link: https://lore.kernel.org/lkml/67294349.050a0220.701a.0010.GAE@google.com/ [akpm@linux-foundation.org: remove now-unused local] [shakeel.butt@linux.dev: folio_rcu() fixup, per Yu Zhao] Link: https://lkml.kernel.org/r/iwmabnye3nl4merealrawt3bdvfii2pwavwrddrqpraoveet7h@ezrsdhjwwej7 Link: https://lkml.kernel.org/r/20241025012304.2473312-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06memcg-v1: no need for memcg locking for writeback trackingShakeel Butt
During the era of memcg charge migration, the kernel has to be make sure that the writeback stat updates do not race with the charge migration. Otherwise it might update the writeback stats of the wrong memcg. Now with the memcg charge migration gone, there is no more race for writeback stat updates and the previous locking can be removed. Link: https://lkml.kernel.org/r/20241025012304.2473312-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06memcg-v1: no need for memcg locking for dirty trackingShakeel Butt
During the era of memcg charge migration, the kernel has to be make sure that the dirty stat updates do not race with the charge migration. Otherwise it might update the dirty stats of the wrong memcg. Now with the memcg charge migration gone, there is no more race for dirty stat updates and the previous locking can be removed. Link: https://lkml.kernel.org/r/20241025012304.2473312-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06memcg-v1: remove charge move codeShakeel Butt
The memcg-v1 charge move feature has been deprecated completely and let's remove the relevant code as well. Link: https://lkml.kernel.org/r/20241025012304.2473312-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06memcg-v1: fully deprecate move_charge_at_immigrateShakeel Butt
Patch series "memcg-v1: fully deprecate charge moving". The memcg v1's charge moving feature has been deprecated for almost 2 years and the kernel warns if someone try to use it. This warning has been backported to all stable kernel and there have not been any report of the warning or the request to support this feature anymore. Let's proceed to fully deprecate this feature. This patch (of 6): Proceed with the complete deprecation of memcg v1's charge moving feature. The deprecation warning has been in the kernel for almost two years and has been ported to all stable kernel since. Now is the time to fully deprecate this feature. Link: https://lkml.kernel.org/r/20241025012304.2473312-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20241025012304.2473312-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: shmem: fallback to page size splice if large folio has poisoned pagesBaolin Wang
The tmpfs has already supported the PMD-sized large folios, and splice() can not read any pages if the large folio has a poisoned page, which is not good as Matthew pointed out in a previous email[1]: "so if we have hwpoison set on one page in a folio, we now can't read bytes from any page in the folio? That seems like we've made a bad situation worse." Thus add a fallback to the PAGE_SIZE splice() still allows reading normal pages if the large folio has hwpoisoned pages. [1] https://lore.kernel.org/all/Zw_d0EVAJkpNJEbA@casper.infradead.org/ [baolin.wang@linux.alibaba.com: code layout cleaup, per dhowells] Link: https://lkml.kernel.org/r/32dd938c-3531-49f7-93e4-b7ff21fec569@linux.alibaba.com Link: https://lkml.kernel.org/r/e3737fbd5366c4de4337bf5f2044817e77a5235b.1729915173.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/damon/vaddr: add 'nr_piece == 1' check in damon_va_evenly_split_region()Zheng Yejian
As discussed in [1], damon_va_evenly_split_region() is called to size-evenly split a region into 'nr_pieces' small regions, when nr_pieces == 1, no actual split is required. Check that case for better code readability and add a simple kunit testcase. [1] https://lore.kernel.org/all/20241021163316.12443-1-sj@kernel.org/ Link: https://lkml.kernel.org/r/20241022083927.3592237-3-zhengyejian@huaweicloud.com Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Fernand Sieber <sieberf@amazon.com> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Ye Weihua <yeweihua4@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/damon/vaddr: fix issue in damon_va_evenly_split_region()Zheng Yejian
Patch series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()". v2. According to the logic of damon_va_evenly_split_region(), currently following split case would not meet the expectation: Suppose DAMON_MIN_REGION=0x1000, Case: Split [0x0, 0x3000) into 2 pieces, then the result would be acutually 3 regions: [0x0, 0x1000), [0x1000, 0x2000), [0x2000, 0x3000) but NOT the expected 2 regions: [0x0, 0x1000), [0x1000, 0x3000) !!! The root cause is that when calculating size of each split piece in damon_va_evenly_split_region(): `sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, DAMON_MIN_REGION);` both the dividing and the ALIGN_DOWN may cause loss of precision, then each time split one piece of size 'sz_piece' from origin 'start' to 'end' would cause more pieces are split out than expected!!! To fix it, count for each piece split and make sure no more than 'nr_pieces'. In addition, add above case into damon_test_split_evenly(). And add 'nr_piece == 1' check in damon_va_evenly_split_region() for better code readability and add a corresponding kunit testcase. This patch (of 2): According to the logic of damon_va_evenly_split_region(), currently following split case would not meet the expectation: Suppose DAMON_MIN_REGION=0x1000, Case: Split [0x0, 0x3000) into 2 pieces, then the result would be acutually 3 regions: [0x0, 0x1000), [0x1000, 0x2000), [0x2000, 0x3000) but NOT the expected 2 regions: [0x0, 0x1000), [0x1000, 0x3000) !!! The root cause is that when calculating size of each split piece in damon_va_evenly_split_region(): `sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, DAMON_MIN_REGION);` both the dividing and the ALIGN_DOWN may cause loss of precision, then each time split one piece of size 'sz_piece' from origin 'start' to 'end' would cause more pieces are split out than expected!!! To fix it, count for each piece split and make sure no more than 'nr_pieces'. In addition, add above case into damon_test_split_evenly(). After this patch, damon-operations test passed: # ./tools/testing/kunit/kunit.py run damon-operations [...] ============== damon-operations (6 subtests) =============== [PASSED] damon_test_three_regions_in_vmas [PASSED] damon_test_apply_three_regions1 [PASSED] damon_test_apply_three_regions2 [PASSED] damon_test_apply_three_regions3 [PASSED] damon_test_apply_three_regions4 [PASSED] damon_test_split_evenly ================ [PASSED] damon-operations ================= Link: https://lkml.kernel.org/r/20241022083927.3592237-1-zhengyejian@huaweicloud.com Link: https://lkml.kernel.org/r/20241022083927.3592237-2-zhengyejian@huaweicloud.com Fixes: 3f49584b262c ("mm/damon: implement primitives for the virtual memory address spaces") Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Fernand Sieber <sieberf@amazon.com> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Ye Weihua <yeweihua4@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/page_alloc: use str_off_on() helper in build_all_zonelists()Thorsten Blum
Remove hard-coded strings by using the str_off_on() helper function. Link: https://lkml.kernel.org/r/20241021091340.5243-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/memcontrol: fix seq_buf size to save memory when PAGE_SIZE is largeRyan Roberts
Previously the seq_buf used for accumulating the memory.stat output was sized at PAGE_SIZE. But the amount of output is invariant to PAGE_SIZE; If 4K is enough on a 4K page system, then it should also be enough on a 64K page system, so we can save 60K on the static buffer used in mem_cgroup_print_oom_meminfo(). Let's make it so. This also has the beneficial side effect of removing a place in the code that assumed PAGE_SIZE is a compile-time constant. So this helps our quest towards supporting boot-time page size selection. Link: https://lkml.kernel.org/r/20241021130027.3615969-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: add missing mmu_notifier_clear_young for !MMU_NOTIFIERJames Houghton
Remove the now unnecessary ifdef in mm/damon/vaddr.c as well. Link: https://lkml.kernel.org/r/20241021160212.9935-1-jthoughton@google.com Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06tools/mm: free the allocated memoryLiu Jing
The comm_str memory needs to be freed if the search_pattern function call fails in get_comm [akpm@linux-foundation.org: fix whitespace] Link: https://lkml.kernel.org/r/20241022012526.7597-1-liujing@cmss.chinamobile.com Signed-off-by: Liu Jing <liujing@cmss.chinamobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/page-writeback: raise wb_thresh to prevent write blocking with strictlimitJim Zhao
With the strictlimit flag, wb_thresh acts as a hard limit in balance_dirty_pages() and wb_position_ratio(). When device write operations are inactive, wb_thresh can drop to 0, causing writes to be blocked. The issue occasionally occurs in fuse fs, particularly with network backends, the write thread is blocked frequently during a period. To address it, this patch raises the minimum wb_thresh to a controllable level, similar to the non-strictlimit case. Link: https://lkml.kernel.org/r/20241023100032.62952-1-jimzhao.ai@gmail.com Signed-off-by: Jim Zhao <jimzhao.ai@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/memory.c: simplify pfnmap_lockdep_assertManas
Use local `mapping' to reduce the pointer chasing. akpm: extracted from a bugfix which Linus fixed with b1b46751671be ("mm: fix follow_pfnmap API lockdep assert"). Link: https://lkml.kernel.org/r/20241004-fix-null-deref-v4-1-d0a8ec01ac85@iiitd.ac.in Signed-off-by: Manas <manas18244@iiitd.ac.in> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Anup Sharma <anupnewsmail@gmail.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/codetag: fix arg in pgalloc_tag_copy alloc_tag_subSourav Panda
alloc_tag_sub() takes bytes as opposed to number of pages as argument. Currently pgalloc_tag_copy() passes the number of pages. This fix passes the correct unit, which is the number of bytes allocated. Link: https://lkml.kernel.org/r/20241022232440.334820-1-souravpanda@google.com Fixes: e0a955bf7f61 ("mm/codetag: add pgalloc_tag_copy()") Signed-off-by: Sourav Panda <souravpanda@google.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Wei Xu <weixugc@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06maple_tree: fix outdated flag name in commentJann Horn
MAPLE_USE_RCU was renamed to MT_FLAGS_USE_RCU at some point, fix up the comment. Link: https://lkml.kernel.org/r/20241007-maple-tree-doc-fix-v1-1-6bbf89c1153d@google.com Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: shmem: improve the tmpfs large folio read performanceBaolin Wang
tmpfs already supports PMD-sized large folios, but the tmpfs read operation still performs copying at PAGE_SIZE granularity, which is unreasonable. This patch changes tmpfs to copy data at folio granularity, which can improve the read performance, as well as changing to use folio related functions. Moreover, if a large folio has a subpage that is hwpoisoned, it will still fall back to page granularity copying. Use 'fio bs=64k' to read a 1G tmpfs file populated with 2M THPs, and I can see about 20% performance improvement, and no regression with bs=4k. Before the patch: READ: bw=10.0GiB/s After the patch: READ: bw=12.0GiB/s Link: https://lkml.kernel.org/r/2129a21a5b9f77d3bb7ddec152c009ce7c5653c4.1729218573.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: shmem: update iocb->ki_pos directly to simplify tmpfs read logicBaolin Wang
Patch series "Improve the tmpfs large folio read performance", v2. tmpfs already supports PMD-sized large folios, but the tmpfs read operation still performs copying at PAGE_SIZE granularity, which is not perfect. This patchset changes tmpfs to copy data at the folio granularity, which can improve the read performance. Use 'fio bs=64k' to read a 1G tmpfs file populated with 2M THPs, and I can see about 20% performance improvement, and no regression with bs=4k. I also did some functional testing with the xfstests suite, and I did not find any regressions with the following xfstests config: FSTYP=tmpfs export TEST_DIR=/mnt/tempfs_mnt export TEST_DEV=/mnt/tempfs_mnt export SCRATCH_MNT=/mnt/scratchdir export SCRATCH_DEV=/mnt/scratchdir This patch (of 2): Using iocb->ki_pos to check if the read bytes exceeds the file size and to calculate the bytes to be read can help simplify the code logic. Meanwhile, this is also a preparation for improving tmpfs large folios read performance in the following patch. Link: https://lkml.kernel.org/r/cover.1729218573.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/e8863e289577e0dc1e365b5419bf2d1c9a24ae3d.1729218573.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: remove unused has_isolate_pageblockLuoxi Li
has_isolate_pageblock() has been unused since commit 55612e80e722 ("mm: page_alloc: close migratetype race between freeing and stealing") Remove it. Link: https://lkml.kernel.org/r/20241018092235.2764859-1-kaixa@kiloview.com Signed-off-by: Luoxi Li <kaixa@kiloview.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: remove redundant condition for THP folioDev Jain
folio_test_pmd_mappable() implies folio_test_large(), therefore, simplify the expression for is_thp. Link: https://lkml.kernel.org/r/20241018094151.3458-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/mremap: remove goto from mremap_to()Liam R. Howlett
mremap_to() has a goto label at the end that doesn't unwind anything. Removing the label makes the code cleaner. This commit also adds documentation to the function. Link: https://lkml.kernel.org/r/20241018174114.2871880-3-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Pedro Falcato <pedro.falcato@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/mremap: cleanup vma_to_resize()Liam R. Howlett
Patch series "mm/mremap: Remove extra vma tree walk", v2. An extra vma tree walk was discovered in some mremap call paths during the discussion on mseal() changes. This patch set removes the extra vma tree walk and further cleans up mremap_to(). This patch (of 2): vma_to_resize() is used in two locations to find and validate the vma for the mremap location. One of the two locations already has the vma, which is then re-found to validate the same vma. This code can be simplified by moving the vma_lookup() from vma_to_resize() to mremap_to() and changing the return type to an int error. Since the function now just validates the vma, the function is renamed to resize_is_valid() to better reflect what it is doing. This commit also adds documentation about the function. Link: https://lkml.kernel.org/r/20241018174114.2871880-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20241018174114.2871880-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Pedro Falcato <pedro.falcato@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06maple_tree: remove sanity check from mas_wr_slot_store()Wei Yang
After commit 5d659bbb52a2 ("maple_tree: introduce mas_wr_store_type()"), the check here is redundant. Let's remove it. Link: https://lkml.kernel.org/r/20241017015809.23392-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06maple_tree: calculate new_end when neededWei Yang
Patch series "Following cleanup after introduce mas_wr_store_type()", v2. Patch 1 postpone new_end calculation when needed. Patch 2 removes a unnecessary sanity check in mas_wr_slot_store(). This patch (of 2): For wr_exact_fit/wr_new_root, we don't need to calculate new_end. Let's postpone it until necessary. Link: https://lkml.kernel.org/r/20241017015809.23392-1-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20241017015809.23392-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: don't set readahead flag on a folio when lookahead_size > nr_to_readPankaj Raghav
The readahead flag is set on a folio based on the lookahead_size and nr_to_read. For example, when the readahead happens from index to index + nr_to_read, then the readahead `mark` offset from index is set at nr_to_read - lookahead_size. There are some scenarios where the lookahead_size > nr_to_read. For example, readahead window was created, but the file was truncated before the readahead starts. do_page_cache_ra() will clamp the nr_to_read if the readahead window extends beyond EOF after truncation. If this happens, readahead flag should not be set on any folio on the current readahead window. The current calculation for `mark` with mapping_min_order > 0 gives incorrect results when lookahead_size > nr_to_read due to rounding up operation: index = 128 nr_to_read = 16 lookahead_size = 28 mapping_min_order = 4 (16 pages) ra_folio_index = round_up(128 + 16 - 28, 16) = 128; mark = 128 - 128 = 0; # offset from index to set RA flag In the above example, the lookahead_size is actually lying outside the current readahead window. Without this patch, RA flag will be set incorrectly on the folio at index 128. This can lead to marking the readahead flag on the wrong folio, therefore, triggering a readahead when it is not necessary. Explicitly initialize `mark` to be ULONG_MAX and only calculate it when lookahead_size is within the readahead window. Link: https://lkml.kernel.org/r/20241017062342.478973-1-kernel@pankajraghav.com Fixes: 26cfdb395eef ("readahead: allocate folios with mapping_min_order in readahead") Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: shmem: remove __shmem_huge_global_enabled()Kefeng Wang
Remove __shmem_huge_global_enabled() since it as only one caller, and remove repeated check of VM_NOHUGEPAGE/MMF_DISABLE_THP as they are checked in shmem_allowable_huge_orders(), also remove unnecessary vma parameter. Link: https://lkml.kernel.org/r/20241017141457.1169092-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: huge_memory: move file_thp_enabled() into huge_memory.cKefeng Wang
file_thp_enabled() is only used in __thp_vma_allowable_orders(), so move it into huge_memory.c, also check READ_ONLY_THP_FOR_FS ahead to avoid unnecessary code if config disabled. Link: https://lkml.kernel.org/r/20241017141457.1169092-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06tmpfs: don't enable large folios if not supportedKefeng Wang
tmpfs can support large folios, but there are some configurable options (mount options and runtime deny/force) to enable/disable large folio allocation, so there is a performance issue when performing writes without large folios. The issue is similar to commit 4e527d5841e2 ("iomap: fault in smaller chunks for non-large folio mappings"). Since 'deny' is for emergencies and 'force' is for testing, performance issues should not be a problem in real production environments, so don't call mapping_set_large_folios() in __shmem_get_inode() when large folio is disabled with mount huge=never option (default policy). Link: https://lkml.kernel.org/r/20241017141742.1169404-1-wangkefeng.wang@huawei.com Fixes: 9aac777aaf94 ("filemap: Convert generic_perform_write() to support large folios") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06tools: testing: fix phys_addr_t size on 64-bit systemsLorenzo Stoakes
The phys_addr_t size is predicated on whether CONFIG_PHYS_ADDR_T_64BIT is set or not. In the VMA tests, virt_to_phys() from tools/include/linux casts a volatile void * pointer to phys_addr_t, if CONFIG_PHYS_ADDR_T_64BIT is not set, this will be 32-bit and trigger a warning. Obviously this might also lead to truncation, which we would rather avoid. Fix this by adjusting the generation of generated/bit-length.h to generate a CONFIG_PHYS_ADDR_T{bits}BIT define. This does result in the generation of the useless CONFIG_PHYS_ADDR_T_32BIT define for 32-bit systems, but this should have no effect, and makes implementation of this easier. This resolves the issue and the warning. [lorenzo.stoakes@oracle.com: VMA tests not properly importing bit-length.h] Link: https://lkml.kernel.org/r/a6183df9-3108-4d59-8128-4fc6c14e22a5@lucifer.local Link: https://lkml.kernel.org/r/20241017165638.95602-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/mglru: reset page lru tier bits when activatingWei Xu
When a folio is activated, lru_gen_add_folio() moves the folio to the youngest generation. But unlike folio_update_gen()/folio_inc_gen(), lru_gen_add_folio() doesn't reset the folio lru tier bits (LRU_REFS_MASK | LRU_REFS_FLAGS). This inconsistency can affect how pages are aged via folio_mark_accessed() (e.g. fd accesses), though no user visible impact related to this has been detected yet. Note that lru_gen_add_folio() cannot clear PG_workingset if the activation is due to workingset refault, otherwise PSI accounting will be skipped. So fix lru_gen_add_folio() to clear the lru tier bits other than PG_workingset when activating a folio, and also clear all the lru tier bits when a folio is activated via folio_activate() in lru_gen_look_around(). Link: https://lkml.kernel.org/r/20241017181528.3358821-1-weixugc@google.com Fixes: 018ee47f1489 ("mm: multi-gen LRU: exploit locality in rmap") Signed-off-by: Wei Xu <weixugc@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens <heftig@archlinux.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm: swap: use str_true_false() helper functionThorsten Blum
Remove hard-coded strings by using the helper function str_true_false(). Link: https://lkml.kernel.org/r/20241016141040.79168-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06percpu: add a test case for the specific 64-bit value additionAndy Shevchenko
It might be a corner case when we add UINT_MAX as 64-bit unsigned value to the percpu variable as it's not the same as -1 (ULONG_LONG_MAX). Add a test case for that. Link: https://lkml.kernel.org/r/20241016182635.1156168-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06x86/percpu: fix clang warning when dealing with unsigned typesAndy Shevchenko
Patch series "percpu: Add a test case and fix for clang", v2. Add a test case to percpu to check a corner case with the specific 64-bit unsigned value. This test case shows why the first patch is done in the way it's done. The before and after has been tested with binary comparison of the percpu_test module and runnig it on the real Intel system. This patch (of 2): When percpu_add_op() is used with an unsigned argument, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: net/ipv4/tcp_output.c:187:3: error: result of comparison of constant -1 with expression of type 'u8' (aka 'unsigned char') is always false [-Werror,-Wtautological-constant-out-of-range-compare] 187 | NET_ADD_STATS(sock_net(sk), LINUX_MIB_TCPACKCOMPRESSED, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 188 | tp->compressed_ack); | ~~~~~~~~~~~~~~~~~~~ ... arch/x86/include/asm/percpu.h:238:31: note: expanded from macro 'percpu_add_op' 238 | ((val) == 1 || (val) == -1)) ? \ | ~~~~~ ^ ~~ Fix this by casting -1 to the type of the parameter and then compare. Link: https://lkml.kernel.org/r/20241016182635.1156168-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20241016182635.1156168-2-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm, kasan, kmsan: instrument copy_from/to_kernel_nofaultSabyrzhan Tasbolatov
Instrument copy_from_kernel_nofault() with KMSAN for uninitialized kernel memory check and copy_to_kernel_nofault() with KASAN, KCSAN to detect the memory corruption. syzbot reported that bpf_probe_read_kernel() kernel helper triggered KASAN report via kasan_check_range() which is not the expected behaviour as copy_from_kernel_nofault() is meant to be a non-faulting helper. Solution is, suggested by Marco Elver, to replace KASAN, KCSAN check in copy_from_kernel_nofault() with KMSAN detection of copying uninitilaized kernel memory. In copy_to_kernel_nofault() we can retain instrument_write() explicitly for the memory corruption instrumentation. copy_to_kernel_nofault() is tested on x86_64 and arm64 with CONFIG_KASAN_SW_TAGS. On arm64 with CONFIG_KASAN_HW_TAGS, kunit test currently fails. Need more clarification on it. [akpm@linux-foundation.org: fix comment layout, per checkpatch Link: https://lore.kernel.org/linux-mm/CANpmjNMAVFzqnCZhEity9cjiqQ9CVN1X7qeeeAp_6yKjwKo8iw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20241011035310.2982017-1-snovitoll@gmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Reported-by: syzbot+61123a5daeb9f7454599@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=61123a5daeb9f7454599 Reported-by: Andrey Konovalov <andreyknvl@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=210505 Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> [KASAN] Tested-by: Andrey Konovalov <andreyknvl@gmail.com> [KASAN] Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06maple_tree: simplify mas_push_node()Wei Yang
When count is not 0, we know head is valid. So we can put the assignment in if (count) instead of checking the head pointer again. Also count represents current total, we can assign the new total by increasing the count by one. Link: https://lkml.kernel.org/r/20241015120746.15850-4-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06maple_tree: total is not changed for nomem_one caseWei Yang
If it jumps to nomem_one, the total allocated number is not changed. So we don't need to adjust it. For the nomem_bulk case, we know there is a valid mas->alloc. So we don't need to do the check. Link: https://lkml.kernel.org/r/20241015120746.15850-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06maple_tree: clear request_count for new allocated oneWei Yang
Patch series "maple_tree: simplify mas_push_node()", v2. When count is not 0, we know head is valid. So we can put the assignment in if (count) instead of checking the head pointer again. Also count represents current total, we can assign the new total by increasing the count by one. This patch (of 3): If this is not a new allocated one, the request_count has already been cleared in mas_set_alloc_req(). Link: https://lkml.kernel.org/r/20241015120746.15850-1-richard.weiyang@gmail.com Link: https://lkml.kernel.org/r/20241015120746.15850-2-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06maple_tree: root node could be handled by !p_slot tooWei Yang
For a root node, mte_parent_slot() return 0, this exactly fits the following !p_slot check. So we can remove the special handling for root node. Link: https://lkml.kernel.org/r/20240913063128.27391-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06maple_tree: add some alloc node test caseJiazi Li
Add some maple_tree alloc node tese case. Link: https://lkml.kernel.org/r/20240626160631.3636515-2-Liam.Howlett@oracle.com Signed-off-by: Jiazi Li <jqqlijiazi@gmail.com> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06maple_tree: fix alloc node fail issueJiazi Li
In the following code, the second call to the mas_node_count will return -ENOMEM: mas_node_count(mas, MAPLE_ALLOC_SLOTS + 1); mas_node_count(mas, MAPLE_ALLOC_SLOTS * 2 + 2); This is because there may be some full maple_alloc node in current maple state. Use full maple_alloc node will make max_req equal to 0. And it leads to mt_alloc_bulk return 0. As a result, mas_node_count set mas.node to MA_ERROR(-ENOMEM). Find a non-full maple_alloc node, and if necessary, use this non-full node in the next while loop. Link: https://lkml.kernel.org/r/20240626160631.3636515-1-Liam.Howlett@oracle.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Jiazi Li <jqqlijiazi@gmail.com> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06mm/vmstat: defer the refresh_zone_stat_thresholds after all CPUs bringupSaurabh Sengar
refresh_zone_stat_thresholds function has two loops which is expensive for higher number of CPUs and NUMA nodes. Below is the rough estimation of total iterations done by these loops based on number of NUMA and CPUs. Total number of iterations: nCPU * 2 * Numa * mCPU Where: nCPU = total number of CPUs Numa = total number of NUMA nodes mCPU = mean value of total CPUs (e.g., 512 for 1024 total CPUs) For the system under test with 16 NUMA nodes and 1024 CPUs, this results in a substantial increase in the number of loop iterations during boot-up when NUMA is enabled: No NUMA = 1024*2*1*512 = 1,048,576 : Here refresh_zone_stat_thresholds takes around 224 ms total for all the CPUs in the system under test. 16 NUMA = 1024*2*16*512 = 16,777,216 : Here refresh_zone_stat_thresholds takes around 4.5 seconds total for all the CPUs in the system under test. Calling this for each CPU is expensive when there are large number of CPUs along with multiple NUMAs. Fix this by deferring refresh_zone_stat_thresholds to be called later at once when all the secondary CPUs are up. Also, register the DYN hooks to keep the existing hotplug functionality intact. Link: https://lkml.kernel.org/r/1723443220-20623-1-git-send-email-ssengar@linux.microsoft.com Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Srivatsa S. Bhat (Microsoft) <srivatsa@csail.mit.edu> Cc: Saurabh Singh Sengar <ssengar@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06vmscan: add a vmscan event for reclaim_pagesJaewon Kim
reclaim_folio_list uses a dummy reclaim_stat and is not being used. To know the memory stat, add a new trace event. This is useful how how many pages are not reclaimed or why. This is an example: mm_vmscan_reclaim_pages: nid=0 nr_scanned=112 nr_reclaimed=112 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 nr_activate_file=0 nr_ref_keep=0 nr_unmap_fail=0 Currently reclaim_folio_list is only called by reclaim_pages, and reclaim_pages is used by damon and madvise. In the latest Android, reclaim_pages is also used by shmem to reclaim all pages in a address_space. [jaewon31.kim@samsung.com: use sc.nr_scanned rather than new counting] Link: https://lkml.kernel.org/r/20241016143227.961162-1-jaewon31.kim@samsung.com Link: https://lkml.kernel.org/r/20241011124928.1224813-1-jaewon31.kim@samsung.com Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jaewon Kim <jaewon31.kim@samsung.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>