summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-01-25mm/hugetlb: clean up map/global resv accounting when allocatePeter Xu
alloc_hugetlb_folio() isn't a function easy to read, especially on reservation accountings for either VMA or globally (majorly, spool only). The 1st complexity lies in the special private CoW path, aka, cow_from_owner=true case. The 2nd complexity may be the confusing updates of gbl_chg after it's set once, which looks like they can change anytime on the fly. Logically, cow_from_user is only about vma reservation. We could already decouple the flag and consolidate it into map charge flag very early. Then we don't need to keep checking the CoW special flag every time. This patch does it by making map_chg a tri-state flag. Tri-state needed is unfortunate, and it's because currently vma_needs_reservation() has a side effect internally, that it must be followed by either a end() or commit(). We keep the same semantic as before on one thing: "if (map_chg)" means we need a separate per-vma resv count. It keeps most of the old code like before untouched with the new enum. After this patch, we take these steps to decide these variables, hopefully slightly easier to follow: - First, decide map_chg. This will take cow_from_owner into account, once and for all. It's about whether we could take a resv count from the vma, no matter it's shared, private, etc. - Then, decide gbl_chg. The only diff here is spool, comparing to map_chg. Now only update each flag once and for all, instead of keep any of them flipping which can be very hard to follow. With cow_from_owner merged into map_chg, we could remove quite a few such checks all over. Side benefit of such is that we can get rid of one more confusing flag, which is deferred_reserve. Cleanup the comments a bit too. E.g., MAP_NORESERVE may not need to check against spool limit, AFAIU, if it's on a shared mapping, and if the page cache folio has its inode's resv map available (in which case map_chg would have been set zero, hence the code should be correct, not the comment). There's one trivial detail that needs attention that this patch touched, which is this check right after vma_commit_reservation(): if (map_chg > map_commit) It changes to: if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) It should behave the same like before, because previously the only way to make "map_chg > map_commit" happen is map_chg=1 && map_commit=0. That's exactly the rewritten line. Meanwhile, either commit() or end() will need to be skipped if ENFORCE, to keep the old behavior. Even though it looks a lot changed, but no functional change expected. Link: https://lkml.kernel.org/r/20250107204002.2683356-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: rename avoid_reserve to cow_from_ownerPeter Xu
The old name "avoid_reserve" can be too generic and can be used wrongly in the new call sites that want to allocate a hugetlb folio. It's confusing on two things: (1) whether one can opt-in to avoid global reservation, and (2) whether it should take more than one count. In reality, this flag is only used in an extremely hacky path, in an extremely hacky way in hugetlb CoW path only, and always use with 1 saying "skip global reservation". Rename the flag to avoid future abuse of this flag, making it a boolean so as to reflect its true representation that it's not a counter. To make it even harder to abuse, add a comment above the function to explain it. Link: https://lkml.kernel.org/r/20250107204002.2683356-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: stop using avoid_reserve flag in fork()Peter Xu
When fork() and stumble on top of a dma-pinned hugetlb private page, CoW must happen during fork() to guarantee dma coherency. In this specific path, hugetlb pages need to be allocated for the child process. Stop using avoid_reserve=1 flag here: it's not required to be used here, as dest_vma (which is destined to be a MAP_PRIVATE hugetlb vma) will have no private vma resv map, and that will make sure it won't be able to use a vma reservation later. No functional change intended with this change. Said that, it's still wanted to do this, so as to reduce the usage of avoid_reserve to the only one user, which is also why this flag was introduced initially in commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"). I don't see whoever else should set it at all. Further patch will clean up resv accounting based on this. Link: https://lkml.kernel.org/r/20250107204002.2683356-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/hugetlb: fix avoid_reserve to allow taking folio from subpoolPeter Xu
Patch series "mm/hugetlb: Refactor hugetlb allocation resv accounting", v2. This is a follow up on Ackerley's series here as replacement: https://lore.kernel.org/r/cover.1728684491.git.ackerleytng@google.com The goal of this series is to cleanup hugetlb resv accounting, especially during folio allocation, to decouple a few things: - Hugetlb folios v.s. Hugetlbfs: IOW, the hope is in the future hugetlb folios can be allocated completely without hugetlbfs. - Decouple VMA v.s. hugetlb folio allocations: allocating a hugetlb folio should not always require a hugetlbfs VMA. For example, either it got allocated from the inode level (see hugetlbfs_fallocate() where it used a pesudo VMA for allocation), or it can be allocated by other kernel subsystems. It paves way for other users to allocate hugetlb folios out of either system reservations, or subpools (instead of hugetlbfs, as a file system). For longer term, this prepares hugetlb as a separate concept versus hugetlbfs, so that hugetlb folios can be allocated by not only hugetlbfs and other things. Tests I've done: - I had a reproducer in patch 1 for the bug I found, this will start to work after patch 1 or the whole set applied. - Hugetlb regression tests (on x86_64 2MBs), includes: - All vmtests on hugetlbfs - libhugetlbfs test suite (which may fail some tests, but no new failures will be introduced by this series, so all such failures happen before this series so shouldn't be relevant). This patch (of 7): Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"), avoid_reserve was introduced for a special case of CoW on hugetlb private mappings, and only if the owner VMA is trying to allocate yet another hugetlb folio that is not reserved within the private vma reserved map. Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate"), alloc_huge_page() enforced to not consume any global reservation as long as avoid_reserve=true. This operation doesn't look correct, because even if it will enforce the allocation to not use global reservation at all, it will still try to take one reservation from the spool (if the subpool existed). Then since the spool reserved pages take from global reservation, it'll also take one reservation globally. Logically it can cause global reservation to go wrong. I wrote a reproducer below, trigger this special path, and every run of such program will cause global reservation count to increment by one, until it hits the number of free pages: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include <stdio.h> #include <fcntl.h> #include <errno.h> #include <unistd.h> #include <stdlib.h> #include <sys/mman.h> #define MSIZE (2UL << 20) int main(int argc, char *argv[]) { const char *path; int *buf; int fd, ret; pid_t child; if (argc < 2) { printf("usage: %s <hugetlb_file>\n", argv[0]); return -1; } path = argv[1]; fd = open(path, O_RDWR | O_CREAT, 0666); if (fd < 0) { perror("open failed"); return -1; } ret = fallocate(fd, 0, 0, MSIZE); if (ret != 0) { perror("fallocate"); return -1; } buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); if (buf == MAP_FAILED) { perror("mmap() failed"); return -1; } /* Allocate a page */ *buf = 1; child = fork(); if (child == 0) { /* child doesn't need to do anything */ exit(0); } /* Trigger CoW from owner */ *buf = 2; munmap(buf, MSIZE); close(fd); unlink(path); return 0; } It can only reproduce with a sub-mount when there're reserved pages on the spool, like: # sysctl vm.nr_hugepages=128 # mkdir ./hugetlb-pool # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool Then run the reproducer on the mountpoint: # ./reproducer ./hugetlb-pool/test Fix it by taking the reservation from spool if available. In general, avoid_reserve is IMHO more about "avoid vma resv map", not spool's. I copied stable, however I have no intention for backporting if it's not a clean cherry-pick, because private hugetlb mapping, and then fork() on top is too rare to hit. Link: https://lkml.kernel.org/r/20250107204002.2683356-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20250107204002.2683356-2-peterx@redhat.com Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate") Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: shmem: skip swapcache for swapin of synchronous swap deviceBaolin Wang
With fast swap devices (such as zram), swapin latency is crucial to applications. For shmem swapin, similar to anonymous memory swapin, we can skip the swapcache operation to improve swapin latency. Testing 1G shmem sequential swapin without THP enabled, I observed approximately a 6% performance improvement: (Note: I repeated 5 times and took the mean data for each test) w/o patch w/ patch changes 534.8ms 501ms +6.3% In addition, currently, we always split the large swap entry stored in the shmem mapping during shmem large folio swapin, which is not perfect, especially with a fast swap device. We should swap in the whole large folio instead of splitting the precious large folios to take advantage of the large folios and improve the swapin latency if the swap device is synchronous device, which is similar to anonymous memory mTHP swapin. Testing 1G shmem sequential swapin with 64K mTHP and 2M mTHP, I observed obvious performance improvement: mTHP=64K w/o patch w/ patch changes 550.4ms 169.6ms +69% mTHP=2M w/o patch w/ patch changes 542.8ms 126.8ms +77% Note that skipping swapcache requires attention to concurrent swapin scenarios. Fortunately the swapcache_prepare() and shmem_add_to_page_cache() can help identify concurrent swapin and large swap entry split scenarios, and return -EEXIST for retry. [akpm@linux-foundation.org: use IS_ENABLED(), tweak comment grammar] Link: https://lkml.kernel.org/r/3d9f3bd3bc6ec953054baff5134f66feeaae7c1e.1736301701.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/memmap: prevent double scanning of memmap by kmemleakGuo Weikang
kmemleak explicitly scans the mem_map through the valid struct page objects. However, memmap_alloc() was also adding this memory to the gray object list, causing it to be scanned twice. Remove memmap_alloc() from the scan list and add a comment to clarify the behavior. Link: https://lore.kernel.org/lkml/CAOm6qn=FVeTpH54wGDFMHuCOeYtvoTx30ktnv9-w3Nh8RMofEA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20250106021126.1678334-1-guoweikang.kernel@gmail.com Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/fake-numa: allow later numa node hotplugBruno Faccini
Current fake-numa implementation prevents new Numa nodes to be later hot-plugged by drivers. A common symptom of this limitation is the "node <X> was absent from the node_possible_map" message by associated warning in mm/memory_hotplug.c: add_memory_resource(). This comes from the lack of remapping in both pxm_to_node_map[] and node_to_pxm_map[] tables to take fake-numa nodes into account and thus triggers collisions with original and physical nodes only-mapping that had been determined from BIOS tables. This patch fixes this by doing the necessary node-ids translation in both pxm_to_node_map[]/node_to_pxm_map[] tables. node_distance[] table has also been fixed accordingly. Details: When trying to use fake-numa feature on our system where new Numa nodes are being "hot-plugged" upon driver load, this fails with the following type of message and warning with stack : node 8 was absent from the node_possible_map WARNING: CPU: 61 PID: 4259 at mm/memory_hotplug.c:1506 add_memory_resource+0x3dc/0x418 This issue prevents the use of the fake-NUMA debug feature with the system's full configuration, when it has proven to be sometimes extremely useful for performance testing of multi-tasked, memory-bound applications, as it enables better isolation of processes/ranks compared to fat NUMA nodes. Usual numactl output after driver has “hot-plugged”/unveiled some new Numa nodes with and without memory : $ numactl --hardware available: 9 nodes (0-8) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 490037 MB node 0 free: 484432 MB node 1 cpus: node 1 size: 97280 MB node 1 free: 97279 MB node 2 cpus: node 2 size: 0 MB node 2 free: 0 MB node 3 cpus: node 3 size: 0 MB node 3 free: 0 MB node 4 cpus: node 4 size: 0 MB node 4 free: 0 MB node 5 cpus: node 5 size: 0 MB node 5 free: 0 MB node 6 cpus: node 6 size: 0 MB node 6 free: 0 MB node 7 cpus: node 7 size: 0 MB node 7 free: 0 MB node 8 cpus: node 8 size: 0 MB node 8 free: 0 MB node distances: node 0 1 2 3 4 5 6 7 8 0: 10 80 80 80 80 80 80 80 80 1: 80 10 255 255 255 255 255 255 255 2: 80 255 10 255 255 255 255 255 255 3: 80 255 255 10 255 255 255 255 255 4: 80 255 255 255 10 255 255 255 255 5: 80 255 255 255 255 10 255 255 255 6: 80 255 255 255 255 255 10 255 255 7: 80 255 255 255 255 255 255 10 255 8: 80 255 255 255 255 255 255 255 10 With recent M.Rapoport set of fake-numa patches in mm-everything and using numa=fake=4 boot parameter : $ numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 122518 MB node 0 free: 117141 MB node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 219911 MB node 1 free: 219751 MB node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 2 size: 122599 MB node 2 free: 122541 MB node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 3 size: 122479 MB node 3 free: 122408 MB node distances: node 0 1 2 3 0: 10 10 10 10 1: 10 10 10 10 2: 10 10 10 10 3: 10 10 10 10 With recent M.Rapoport set of fake-numa patches in mm-everything, this patch on top, using numa=fake=4 boot parameter : # numactl —hardware available: 12 nodes (0-11) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 122518 MB node 0 free: 116429 MB node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 122631 MB node 1 free: 122576 MB node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 2 size: 122599 MB node 2 free: 122544 MB node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 3 size: 122479 MB node 3 free: 122419 MB node 4 cpus: node 4 size: 97280 MB node 4 free: 97279 MB node 5 cpus: node 5 size: 0 MB node 5 free: 0 MB node 6 cpus: node 6 size: 0 MB node 6 free: 0 MB node 7 cpus: node 7 size: 0 MB node 7 free: 0 MB node 8 cpus: node 8 size: 0 MB node 8 free: 0 MB node 9 cpus: node 9 size: 0 MB node 9 free: 0 MB node 10 cpus: node 10 size: 0 MB node 10 free: 0 MB node 11 cpus: node 11 size: 0 MB node 11 free: 0 MB node distances: node 0 1 2 3 4 5 6 7 8 9 10 11 0: 10 10 10 10 80 80 80 80 80 80 80 80 1: 10 10 10 10 80 80 80 80 80 80 80 80 2: 10 10 10 10 80 80 80 80 80 80 80 80 3: 10 10 10 10 80 80 80 80 80 80 80 80 4: 80 80 80 80 10 255 255 255 255 255 255 255 5: 80 80 80 80 255 10 255 255 255 255 255 255 6: 80 80 80 80 255 255 10 255 255 255 255 255 7: 80 80 80 80 255 255 255 10 255 255 255 255 8: 80 80 80 80 255 255 255 255 10 255 255 255 9: 80 80 80 80 255 255 255 255 255 10 255 255 10: 80 80 80 80 255 255 255 255 255 255 10 255 11: 80 80 80 80 255 255 255 255 255 255 255 10 Link: https://lkml.kernel.org/r/20250106120659.359610-2-bfaccini@nvidia.com Signed-off-by: Bruno Faccini <bfaccini@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon: remove DAMON debugfs interfaceSeongJae Park
It's time to remove DAMON debugfs interface, which has deprecated long before in February 2023. Read the cover letter of this patch series for more details. All documents and related tests are also removed. Finally remove the interface. Link: https://lkml.kernel.org/r/20250106191941.107070-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon: remove DAMON debugfs interface kunit testsSeongJae Park
It's time to remove DAMON debugfs interface, which has deprecated long before in February 2023. Read the cover letter of this patch series for more details. Remove kunit tests for the interface, to prevent unnecessary test failures. Link: https://lkml.kernel.org/r/20250106191941.107070-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25kunit: configs: remove configs for DAMON debugfs interface testsSeongJae Park
It's time to remove DAMON debugfs interface, which has deprecated long before in February 2023. Read the cover letter of this patch series for more details. Remove kernel configs for running DAMON debugfs interface kunit tests from the kunit all_tests configuration, to prevent unnecessary noises from tests. Link: https://lkml.kernel.org/r/20250106191941.107070-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25selftests/damon: remove tests for DAMON debugfs interfaceSeongJae Park
It's time to remove DAMON debugfs interface, which has deprecated long before in February 2023. Read the cover letter of this patch series for more details. Remove selftests for the interface, to prevent causing unnecessary test failures. Link: https://lkml.kernel.org/r/20250106191941.107070-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25selftests/damon/config: remove configs for DAMON debugfs interface selftestsSeongJae Park
It's time to remove DAMON debugfs interface, which has deprecated long before in February 2023. Read the cover letter of this patch series for more details. Remove configs for selftests of it from DAMON selftests config file, to prevent unnecessary noises from the tests. [1] https://lore.kernel.org/20230209192009.7885-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250106191941.107070-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/mm/damon/design: update for removal of DAMON debugfs interfaceSeongJae Park
It's time to remove DAMON debugfs interface, which has deprecated long before in February 2023. Read the cover letter of this patch series for more details. Update DAMON design documentation to stop mentioning about the interface, to avoid unnecessary confuses. Link: https://lkml.kernel.org/r/20250106191941.107070-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/admin-guide/mm/damon/usage: remove DAMON debugfs interface documentationSeongJae Park
It's time to remove DAMON debugfs interface, which has deprecated long before in February 2023. Read the cover letter of this patch series for more details. Remove DAMON debugfs interface usage documentation, to avoid confusing users with documents for an already removed thing. Link: https://lkml.kernel.org/r/20250106191941.107070-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/translations/*/admin-guide/mm/damon/usage: remove DAMON debugfs ↵SeongJae Park
interface documentation Patch series "mm/damon: remove DAMON debugfs interface". DAMON debugfs interface was the only user interface of DAMON at the beginning[1]. However, it turned out the interface would be not good enough for long-term flexibility and stability. In Feb 2022[2], we therefore introduced DAMON sysfs interface as an alternative user interface that aims long-term flexibility and stability. With its introduction, DAMON debugfs interface has announced to be deprecated in near future. In Feb 2023[3], we announced the official deprecation of DAMON debugfs interface. In Jan 2024[4], we further made the deprecation difficult to be ignored. In Oct 2024[5], we posted an RFC version of this patch series as the last notice. And as of this writing, no problem or concerns about the removal plan have reported. Apparently users are already moved to the alternative, or made good plans for the change. Remove the DAMON debugfs interface code from the tree. Given the past timeline and the absence of reported problems or concerns, it is safe enough to be done. [1] https://lore.kernel.org/20210716081449.22187-1-sj38.park@gmail.com [2] https://lore.kernel.org/20220228081314.5770-1-sj@kernel.org [3] https://lore.kernel.org/20230209192009.7885-1-sj@kernel.org [4] https://lore.kernel.org/20240130013549.89538-1-sj@kernel.org [5] https://lore.kernel.org/20241015175412.60563-1-sj@kernel.org This patch (of 8): It's time to remove DAMON debugfs interface, which has deprecated long before in February 2023. Read the cover letter of this patch sereis for more details. Remove DAMON debugfs interface usage documentation and references to it from translations, to avoid confusing users with documents for already removed things. Link: https://lkml.kernel.org/r/20250106191941.107070-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250106191941.107070-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rae Moar <rmoar@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/ABI/damon: document per-region DAMOS filter-passed bytes stat fileSeongJae Park
Document the new ABI for per-region operations set layer-handled DAMOS filters passed bytes statistic. Link: https://lkml.kernel.org/r/20250106193401.109161-17-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/admin-guide/mm/damon/usage: document sz_filtered_out of scheme tried ↵SeongJae Park
region directories Document the newly added DAMON sysfs interface file for per-scheme-tried region's bytes that passed the operations set handling DAMOS filters. Link: https://lkml.kernel.org/r/20250106193401.109161-16-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/mm/damon/design: document per-region sz_filter_passed statSeongJae Park
Update 'Regions Walking' section of design document for the newly added per-region operations set handling DAMOS filters-passed bytes. Link: https://lkml.kernel.org/r/20250106193401.109161-15-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs-schemes: expose per-region filter-passed bytesSeongJae Park
Per-region operations set-handled DAMOS filters passed memory size information is provided to only DAMON core API users. Further expose it to the user space by adding a new DAMON sysfs interface file under each scheme tried region directory. Link: https://lkml.kernel.org/r/20250106193401.109161-14-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/core: pass per-region filter-passed bytes to ↵SeongJae Park
damos_walk_control->walk_fn() Total size of memory that passed DAMON operations set layer-handled DAMOS filters per scheme is provided to DAMON core API and ABI (sysfs interface) users. Having it per-region in non-accumulated way can provide it in finer granularity. Provide it to damos_walk() core API users, by passing the data to damos_walk_control->walk_fn(). Link: https://lkml.kernel.org/r/20250106193401.109161-13-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/ABI/damon: document per-scheme filter-passed bytes stat fileSeongJae Park
Document the new ABI for per-scheme operations set layer-handled DAMOS filters passed bytes statistic on the ABI document. Link: https://lkml.kernel.org/r/20250106193401.109161-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/admin-guide/mm/damon/usage: document sz_ops_filter_passedSeongJae Park
Document the new per-scheme operations set layer-handled DAMOS filters passed bytes statistic file on DAMON sysfs interface usage document. Link: https://lkml.kernel.org/r/20250106193401.109161-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/mm/damon/design: document sz_ops_filter_passedSeongJae Park
Document the new per-scheme accumulated stat for total bytes that passed the operations set layer-handled DAMOS filters on the design document. Link: https://lkml.kernel.org/r/20250106193401.109161-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/syfs-schemes: implement per-scheme filter-passed bytes statSeongJae Park
Add a new DAMON sysfs interface file under scheme stat directory, namely 'sz_ops_filter_passed'. It represents total bytes that passed region-internal DAMOS filters of the scheme that handled by the DAMON operations set layer. Link: https://lkml.kernel.org/r/20250106193401.109161-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/core: implement per-scheme ops-handled filter-passed bytes statSeongJae Park
Implement a new per-DAMOS scheme statistic field, namely sz_ops_filter_passed, using the changed damon_operations->apply_scheme() interface. It counts total bytes of memory that given DAMOS action tried to be applied, and passed the operations layer handled region-internal filters of the scheme. DAMON API users can access it using DAMON-internal safe access features such as damon_call() and/or damos_walk(). Link: https://lkml.kernel.org/r/20250106193401.109161-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/paddr: report filter-passed bytes back for DAMOS_STAT actionSeongJae Park
DAMOS_STAT action handling of paddr DAMON operations set implementation is simply ignoring the region-internal DAMOS filters, and therefore not reporting back the filter-passed bytes. Apply the filters and report back the information. Before this change, DAMOS_STAT was doing nothing for DAMOS filters. Hence users might see some performance regressions. Such regression for use cases where no region-internal DAMOS filter is added to the scheme will be negligible, since this change avoids unnecessary filtering works if no such filter is installed. For old users who are using DAMOS_STAT with the types of filters, the regression could be visible depending on the size of the region and the overhead of the installed DAMOS filters. But, because the filters were completely ignored before in the use case, no real users would really depend on such use case that makes no point. Link: https://lkml.kernel.org/r/20250106193401.109161-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/paddr: report filter-passed bytes back for normal actionsSeongJae Park
damon_operations->apply_scheme() implementations are requested to report back how many bytes of the given region has passed DAMOS filter. 'paddr' operations set implementation supports some of region-internal DAMOS filter handling for normal DAMOS actions except DAMOS_STAT action. But, those are not respecting the request. Report the region-internal DAMOS filter-passed bytes back for the actions. Link: https://lkml.kernel.org/r/20250106193401.109161-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon: ask apply_scheme() to report filter-passed region-internal bytesSeongJae Park
Some DAMOS filter types including those for young page, anon page, and belonging memcg are handled by underlying DAMON operations set implementation, via damon_operations->apply_scheme() interface. How many bytes of the region have passed the filter can be useful for DAMOS scheme tuning and access pattern monitoring. Modify the interface to let the callback implementation reports back the number if possible. Link: https://lkml.kernel.org/r/20250106193401.109161-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/admin-guide/mm/damon/usage: link damos stat design docSeongJae Park
DAMON sysfs usage document focuses on usage, rather than the detail of the stat metric itself. Add a link to the design document on DAMOS stat usage section. Link: https://lkml.kernel.org/r/20250106193401.109161-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/mm/damon/design: add 'statistics' sectionSeongJae Park
DAMOS stats are important feature for tuning of DAMOS-based access-aware system operation, and efficient access pattern monitoring. But not well documented on the design document. Add a section on the document. Link: https://lkml.kernel.org/r/20250106193401.109161-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon: clarify trying vs applying on damos_stat kernel-doc commentSeongJae Park
Patch series "mm/damon: enable page level properties based monitoring". TL; DR ====== This patch series enables access monitoring based on page level properties including their anonymousness, belonging cgroups and young-ness, by extending DAMOS stats and regions walk features with region-internal DAMOS filters. Background ========== DAMOS has initially developed for only access-aware system operations. But, efficient acces monitoring results querying is yet another major usage of today's DAMOS. DAMOS stats and regions walk, which exposes accumulated counts and per-region monitoring results that filtered by DAMOS parameters including target access pattern, quotas and DAMOS filters, are the key features for that usage. For tunings and investigations, it can be more useful if only the information can be exposed without making real system operational change. Special DAMOS action, DAMOS_STAT, was introduced for the purpose. DAMOS fundametally works with only access pattern information in region granularity. For some use cases, fixed and fine granularity information based on non access pattern properties can be useful, though. For example, on systems having swap devices that much faster than storage devices for files, DAMOS-based proactive reclaim need to be applied differently for anonymous pages and file-backed pages. DAMOS filters is a feature that makes it possible. It supports non access pattern information including page level properties such as anonymousness, belonging cgroups, and young-ness (whether the page has accessed since the last access check of it). The information can be useful for tuning and investigations. DAMOS stat exposes some of it via {nr,sz}_applied, but it is mixed with operation failures. Also, exposing the information without making system operation change is impossible, since DAMOS_STAT simply ignores the page level properties based DAMOS filters. Design ====== Expose the exact information for every DAMOS action including DAMOS_STAT by implementing below changes. Extend the interface for DAMON operations set layer, which contains the implementation of the page level filters, to report back the amount of memory that passed the region-internal DAMOS filters to the core layer. On the core layer, account the operations set layer reported stat with DAMOS stat for per-scheme monitoring. Also, pass the information to regions walk for per-region monitoring. In this way, DAMON API users can efficiently get the fine-grained information. For the user-space, make DAMON sysfs interface collects the information using the updated DAMON core API, and expose those to new per-scheme stats file and per-DAMOS-tried region properties file. Practical Usages ================ With this patch series, DAMON users can query how many bytes of regions of specific access temperature is backed by pages of specific type. The type can be any of DAMOS filter-supporting one, including anonymousness, belonging cgroups, and young-ness. For example, users can visualize access hotness-based page granulairty histogram for different cgroups, backing content type, or youngness. In future, it could be extended to more types such as whether it is THP, position on LRU lists, etc. This can be useful for estimating benefits of a new or an existing access-aware system optimizations without really committing the changes. Patches Sequence ================ The patches are constructed in four sub-sequences. First three patches (patches 1-3) update documents to have missing background knowledges and better structures for easily introducing followup changes. Following three patches (patches 4-6) change the operations set layer interface to report back the region-internal filter passed memory size, and make the operations set implementations support the changed symantic. Following five patches (patches 7-11) implement per-scheme accumulated stat for region-internal filter-passed memory size on core API (damos_stat) and DAMON sysfs interface. First two patches of those are for code change, and following three patches are for documentation. Finally, five patches (patches 12-16) implementing per-region region-internal filter-passed memory size follows. Similar to that for per-scheme stat, first two patches implement core-API and sysfs interface change. Then three patches for documentation update follow. This patch (of 16): DAMOS stat kernel-doc documentation is using terms that bit ambiguous. Without reading the code, understanding it correctly is not that easy. Add the clarification on the kernel-doc comment. Link: https://lkml.kernel.org/r/20250106193401.109161-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250106193401.109161-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs: remove unused code for schemes tried regions updateSeongJae Park
DAMON sysfs interface was using damon_callback with its own complicated synchronization logics to update DAMOS scheme applied regions directories and files. But it is replaced to use damos_walk(), and the additional synchronization logics are no more being used. Remove those. Link: https://lkml.kernel.org/r/20250103174400.54890-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs: use damos_walk() for update_schemes_tried_{bytes,regions}SeongJae Park
DAMON sysfs interface uses damon_callback with its own complicated synchronization facility to handle update_schemes_tried_bytes and update_schemes_tried_regions commands. But damos_walk() can support the use case without the additional synchronizations. Convert the code to use damos_walk() instead. Link: https://lkml.kernel.org/r/20250103174400.54890-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25Docs/mm/damon/design: document DAMOS regions walkingSeongJae Park
DAMOS' regions walking is a feature for efficiently retrieving monitoring results or DAMOS-internal behavior. It can be useful for multiple purposes including investigations and tuning. Add a section for it on the design document. Link: https://lkml.kernel.org/r/20250103174400.54890-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/core: implement damos_walk()SeongJae Park
Introduce a new core layer interface, damos_walk(). It aims to replace some damon_callback usages that access DAMOS schemes applied regions of ongoing kdamond with additional synchronizations. It receives a function pointer and asks kdamond to invoke it for any region that it tried to apply any DAMOS action within one scheme apply interval for every scheme of it. The function further waits until the kdamond finishes the invocations for every scheme, or cancels the request, and returns. The kdamond invokes the function as requested within the main loop. If it is deactivated by DAMOS watermarks or going out of the main loop, it marks the request as canceled, so that damos_walk() can wakeup and return. Link: https://lkml.kernel.org/r/20250103174400.54890-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs: use damon_call() for update_schemes_effective_quotasSeongJae Park
DAMON sysfs interface uses damon_callback with its own synchronization facility to handle update_schemes_effective_quotas command. But damon_call() can support the use case without the additional synchronizations. Convert the code to use damon_call() instead. Link: https://lkml.kernel.org/r/20250103174400.54890-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs: use damon_call() for commit_schemes_quota_goalsSeongJae Park
DAMON sysfs interface uses damon_callback with its own synchronization facility to handle commit_schemes_quota_goals command. But damon_call() can support the use case without the additional synchronizations. Convert the code to use damon_call() instead. Link: https://lkml.kernel.org/r/20250103174400.54890-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs: use damon_call() for update_schemes_statsSeongJae Park
DAMON sysfs interface uses damon_callback with its own synchronization facility to handle update_schemes_stats kdamond command. But damon_call() can support the use case without the additional synchronizations. Convert the code to use damon_call() instead. Link: https://lkml.kernel.org/r/20250103174400.54890-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/core: introduce damon_call()SeongJae Park
Introduce a new DAMON core API function, damon_call(). It aims to replace some damon_callback usages that access damon_ctx of ongoing kdamond with additional synchronizations. It receives a function pointer, let the parallel kdamond invokes the function, and returns after the invocation is finished, or canceled due to some races. kdamond invokes the function inside the main loop after sampling is done. If it is deactivated by DAMOS watermarks or already out of the main loop, mark the request as canceled so that damon_call() can wakeup and return. Link: https://lkml.kernel.org/r/20250103174400.54890-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs: handle clear_schemes_tried_regions from DAMON sysfs contextSeongJae Park
DAMON sysfs interface handles clear_schemes_tried_regions request from the DAMON callback context (damon_sysfs_cmd_request_callback()), which is designed to be used for safe access to the related DAMON context internal data. But no DAMON context internal data is accessed for the work. Directly handle it from DAMON sysfs interface context, namely damon_sysfs_handle_cmd(). Link: https://lkml.kernel.org/r/20250103174400.54890-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/damon/sysfs-schemes: remove unnecessary schemes existence check in ↵SeongJae Park
damon_sysfs_schemes_clear_regions() Patch series "mm/damon: replace most damon_callback usages in sysfs with new core functions". DAMON provides damon_callback API that notifies monitoring events and allows safe access to damon_ctx internal data. The usage is simple. Users register and deregister callback functions for different monitoring events in damon_ctx. Then the DAMON worker thread (kdamond) of the damon_ctx calls back the registered functions on the events. It is designed in such simple way because it was sufficient for usages of DAMON at the early days. We also wanted to make it flexible so that API user code can implement any required additional features on top of damon_callback on their demands. As expected, more sophisticated usages have invented. Online updates of DAMON parameters and DAMOS auto-tuning inputs, and online retrieval of DAMOS statistics and tried regions information are such usages. Because damon_callback doesn't provide any explicit synchronization mechanism, the user ABIs for exposing such functionalities are implemented in asynchronous ways (DAMON_RECLAIM and DAMON_LRU_SORT}), or synchronous ways (DAMON_SYSFS) with additional synchronization mechanisms that built inside the ABI implementation, on top of damon_callback. So damon_callback is working as expected. However, the additional mechanisms built inside ABI on top of damon_callback is becoming somewhat too big and not easy to maintain. The additional mechanisms can be smaller and easier to maintain when implemented inside the core logic layer. Introduce two new DAMON core API, namely 'damon_call()' and 'damos_walk()'. The two functions support synchronous access to - damon_ctx internal data including DAMON parameters and monitoring results, and - DAMOS-specific data such as regions that each DAMOS action is applied, respectively. And replace most of damon_callback usages in DAMON sysfs interface with the new core API functions. damon_callback usage for online DAMON parameters tuning is not replaced in this series, since it has specific callback timing assumptions that require more works. Patch sequence ============== First two patches are fixups for simplifying the following changes. Those remove a unnecessary condition check and a synchronization, respectively. Third patch implements one of the new DAMON core APIs, namely damon_call(). Three patches replacing damon_callback usages in DAMON sysfs interface using damon_call() follow. Then, seventh and eighth patches introduces the other new DAMON API, damos_walk(), and document it on the design doc. Ninth patch replaces two damon_callback usages in DAMON sysfs interface using damos_walk(). The tenth patch finally cleans up code that no more being used. This patch (of 10): damon_sysfs_schemes_clear_regions() skips removing the scheme tried region directories only if the matching scheme is still ongoing. It is unnecessary check, since what users want is just removing the entire region directories. Remove the unnecessary check. Link: https://lkml.kernel.org/r/20250103174400.54890-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250103174400.54890-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: introduce ctor/dtor at PGD levelKevin Brodsky
Following on from the introduction of P4D-level ctor/dtor, let's finish the job and introduce ctor/dtor at PGD level. The incurred improvement in page accounting is minimal - the main motivation is to create a single, generic place where construction/destruction hooks can be added for all page table pages. This patch should cover all architectures and all configurations where PGDs are one or more regular pages. This excludes any configuration where PGDs are allocated from a kmem_cache object. Link: https://lkml.kernel.org/r/20250103184415.2744423-7-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25asm-generic: pgalloc: provide generic __pgd_{alloc,free}Kevin Brodsky
We already have a generic implementation of alloc/free up to P4D level, as well as pgd_free(). Let's finish the work and add a generic PGD-level alloc helper as well. Unlike at lower levels, almost all architectures need some specific magic at PGD level (typically initialising PGD entries), so introducing a generic pgd_alloc() isn't worth it. Instead we introduce two new helpers, __pgd_alloc() and __pgd_free(), and make use of them in the arch-specific pgd_alloc() and pgd_free() wherever possible. To accommodate as many arch as possible, __pgd_alloc() takes a page allocation order. Because pagetable_alloc() allocates zeroed pages, explicit zeroing in pgd_alloc() becomes redundant and we can get rid of it. Some trivial implementations of pgd_free() also become unnecessary once __pgd_alloc() is used; remove them. Another small improvement is consistent accounting of PGD pages by using GFP_PGTABLE_{USER,KERNEL} as appropriate. Not all PGD allocations can be handled by the generic helpers. In particular, multiple architectures allocate PGDs from a kmem_cache, and those PGDs may not be page-sized. Link: https://lkml.kernel.org/r/20250103184415.2744423-6-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25ARM: mm: rename PGD helpersKevin Brodsky
Generic implementations of __pgd_alloc and __pgd_free are about to be introduced. Rename the macros in arch/arm/mm/pgd.c to avoid clashes. While we're at it, also pass down the mm as argument to those helpers, as it will be needed to call the generic __pgd_{alloc,free}. Link: https://lkml.kernel.org/r/20250103184415.2744423-5-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25m68k: mm: add calls to pagetable_pmd_[cd]torKevin Brodsky
get_pointer_table() and free_pointer_table() already special-case TABLE_PTE to call pagetable_pte_[cd]tor. Let's do the same at PMD level to improve accounting further. TABLE_PGD and TABLE_PMD are currently defined to the same value, so we first need to separate them. That also implies separating ptable_list for PMD/PGD levels. Link: https://lkml.kernel.org/r/20250103184415.2744423-4-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25parisc: mm: ensure pagetable_pmd_[cd]tor are calledKevin Brodsky
The implementation of pmd_{alloc_one,free} on parisc requires a non-zero allocation order, but is completely standard aside from that. Let's reuse the generic implementation of pmd_alloc_one(). Explicit zeroing is not needed as GFP_PGTABLE_KERNEL includes __GFP_ZERO. The generic pmd_free() can handle higher allocation orders so we don't need to define our own. These changes ensure that pagetable_pmd_[cd]tor are called, improving the accounting of page table pages. Link: https://lkml.kernel.org/r/20250103184415.2744423-3-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: move common part of pagetable_*_ctor to helperKevin Brodsky
Patch series "Account page tables at all levels". This series should be considered in conjunction with Qi's series [1]. Together, they ensure that page table ctor/dtor are called at all levels (PTE to PGD) and all architectures, where page tables are regular pages. Besides the improvement in accounting and general cleanup, this also create a single place where construction/destruction hooks can be called for all page tables, namely the now-generic pagetable_dtor() introduced by Qi, and __pagetable_ctor() introduced in this series. [1] https://lore.kernel.org/linux-mm/cover.1735549103.git.zhengqi.arch@bytedance.com/ This patch (of 6): pagetable_*_ctor all have the same basic implementation. Move the common part to a helper to reduce duplication. Link: https://lkml.kernel.org/r/20250103184415.2744423-1-kevin.brodsky@arm.com Link: https://lkml.kernel.org/r/20250103184415.2744423-2-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/debug: prefer VM_WARN_ON_VMG() to report VMG debug warningsLorenzo Stoakes
Now we have VM_WARN_ON_VMG() to provide us with considerably more debug output when a debug assert fails, utilise it everywhere we can. This allows us to have considerably more information to go on when things go wrong, especially when a non-repro issue occurs as reported by syzkaller or the like. Link: https://lkml.kernel.org/r/986e45e9549e71284ac7a7fa878688568a94d58b.1735932169.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm/debug: introduce VM_WARN_ON_VMG() to dump VMA merge stateLorenzo Stoakes
Patch series "mm/debug: introduce and use VM_WARN_ON_VMG()". We use a number of asserts, enabled only when CONFIG_DEBUG_VM is set, during VMA merge operations to ensure state is as expected. However, when syzkaller or the like encounters these asserts, often the information provided by the report is insufficient to narrow down what the problem is. We noticed this recently in [0], where a non-repro issue resisted debugging due to simply not having sufficient information to go on. This series improves the situation by providing VM_WARN_ON_VMG() which acts like VM_WARN_ON() (i.e. only actually being invoked if CONFIG_DEBUG_VM is set), while dumping significant information about the VMA merge state, the mm_struct describing the virtual address space, all associated VMAs and, if CONFIG_DEBUG_VM_MAPLE_TREE is set, the associated maple tree. [0]:https://lore.kernel.org/all/6774c98f.050a0220.25abdd.0991.GAE@google.com/ This patch (of 2): We use a number of asserts, enabled only when CONFIG_DEBUG_VM is set, during VMA merge operations to ensure state is as expected. However, when syzkaller or the like encounters these asserts, often the information provided by the report is insufficient to narrow down what the problem is. This might not be so much of an issue if the reported problem is reproducible, but if it is a rarely encountered race or some other case which precludes a repro, it is a very big problem (see [0] for the motivating case). It is therefore sensible to provide a means by which we can easily and conveniently dump a lot more information in these circumstances. The aggregation of merge state into a single struct threaded through the operation makes this trivial - we can simply introduce a variant on VM_WARN_ON() which takes the VMA merge state object (vmg) and use that to dump information. This patch therefore introduces VM_WARN_ON_VMG() which provides this functionality. It additionally dumps full mm state, VMA state for each of the three VMAs the vmg contains (prev, next, vma) and if CONFIG_DEBUG_VM_MAPLE_TREE is enabled, dumps the maple tree from the provided VMA iterator if non-NULL. This patch has no functional impact if CONFIG_DEBUG_VM is not set. [0]:https://lore.kernel.org/all/6774c98f.050a0220.25abdd.0991.GAE@google.com/ Link: https://lkml.kernel.org/r/cover.1735932169.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/13b09b52d4d103ee86acaf0ae612539648ae29e0.1735932169.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25lib/list_debug.c: add object information in case of invalid objectManinder Singh
As of now during link list corruption it prints about cluprit address and its wrong value, but sometime it is not enough to catch the actual issue point. If it prints allocation and free path of that corrupted node, it will be a lot easier to find and fix the issues. Adding the same information when data mismatch is found in link list debug data: [ 14.243055] slab kmalloc-32 start ffff0000cda19320 data offset 32 pointer offset 8 size 32 allocated at add_to_list+0x28/0xb0 [ 14.245259] __kmalloc_cache_noprof+0x1c4/0x358 [ 14.245572] add_to_list+0x28/0xb0 ... [ 14.248632] do_el0_svc_compat+0x1c/0x34 [ 14.249018] el0_svc_compat+0x2c/0x80 [ 14.249244] Free path: [ 14.249410] kfree+0x24c/0x2f0 [ 14.249724] do_force_corruption+0xbc/0x100 ... [ 14.252266] el0_svc_common.constprop.0+0x40/0xe0 [ 14.252540] do_el0_svc_compat+0x1c/0x34 [ 14.252763] el0_svc_compat+0x2c/0x80 [ 14.253071] ------------[ cut here ]------------ [ 14.253303] list_del corruption. next->prev should be ffff0000cda192a8, but was 6b6b6b6b6b6b6b6b. (next=ffff0000cda19348) [ 14.254255] WARNING: CPU: 3 PID: 84 at lib/list_debug.c:65 __list_del_entry_valid_or_report+0x158/0x164 Moved prototype of mem_dump_obj() to bug.h, as mm.h can not be included in bug.h. Link: https://lkml.kernel.org/r/20241230101043.53773-1-maninder1.s@samsung.com Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Rohit Thapliyal <r.thapliyal@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>