summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2023-10-25mempolicy: remove confusing MPOL_MF_LAZY dead codeHugh Dickins
v3.8 commit b24f53a0bea3 ("mm: mempolicy: Add MPOL_MF_LAZY") introduced MPOL_MF_LAZY, and included it in the MPOL_MF_VALID flags; but a720094ded8 ("mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now") immediately removed it from MPOL_MF_VALID flags, pending further review. "This will need to be revisited", but it has not been reinstated. The present state is confusing: there is dead code in mm/mempolicy.c to handle MPOL_MF_LAZY cases which can never occur. Remove that: it can be resurrected later if necessary. But keep the definition of MPOL_MF_LAZY, which must remain in the UAPI, even though it always fails with EINVAL. https://lore.kernel.org/linux-mm/1553041659-46787-1-git-send-email-yang.shi@linux.alibaba.com/ links to a previous request to remove MPOL_MF_LAZY. Link: https://lkml.kernel.org/r/80c9665c-1c3f-17ba-21a3-f6115cebf7d@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mempolicy: mpol_shared_policy_init() without pseudo-vmaHugh Dickins
mpol_shared_policy_init() does not need to use a pseudo-vma: it can use sp_alloc() and sp_insert() directly, since the object's shared policy tree is empty and inaccessible (needing no lock) at get_inode() time. Link: https://lkml.kernel.org/r/3bef62d8-ae78-4c2-533-56a44ae425c@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mempolicy trivia: use pgoff_t in shared mempolicy treeHugh Dickins
Prefer the more explicit "pgoff_t" to "unsigned long" when dealing with a shared mempolicy tree. Delete confusing comment about pseudo mm vmas. Link: https://lkml.kernel.org/r/5451157-3818-4af5-fd2c-5d26a5d1dc53@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mempolicy trivia: slightly more consistent namingHugh Dickins
Before getting down to work, do a little cleanup, mainly of inconsistent variable naming. I gave up trying to rationalize mpol versus pol versus policy, and node versus nid, but let's avoid p and nd. Remove a few superfluous blank lines, but add one; and here prefer vma->vm_policy to vma_policy(vma) - the latter being appropriate in other sources, which have to allow for !CONFIG_NUMA. That intriguing line about KERNEL_DS? should have gone in v2.6.15, when numa_policy_init() stopped using set_mempolicy(2)'s system call handler. Link: https://lkml.kernel.org/r/68287974-b6ae-7df-4ba-d19ddd69cbf@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mempolicy trivia: delete those ancient pr_debug()sHugh Dickins
Delete those ancient pr_debug()s - PDprintk()s in Andi Kleen's original submission of core NUMA API, and useful when debugging shared mempolicy lifetime back then, but not used recently. Link: https://lkml.kernel.org/r/f25135-ffb2-40d8-9577-720772b333@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mempolicy: fix migrate_pages(2) syscall return nr_failedHugh Dickins
"man 2 migrate_pages" says "On success migrate_pages() returns the number of pages that could not be moved". Although 5.3 and 5.4 commits fixed mbind(MPOL_MF_STRICT|MPOL_MF_MOVE*) to fail with EIO when not all pages could be moved (because some could not be isolated for migration), migrate_pages(2) was left still reporting only those pages failing at the migration stage, forgetting those failing at the earlier isolation stage. Fix that by accumulating a long nr_failed count in struct queue_pages, returned by queue_pages_range() when it's not returning an error, for adding on to the nr_failed count from migrate_pages() in mm/migrate.c. A count of pages? It's more a count of folios, but changing it to pages would entail more work (also in mm/migrate.c): does not seem justified. queue_pages_range() itself should only return -EIO in the "strictly unmovable" case (STRICT without any MOVEs): in that case it's best to break out as soon as nr_failed gets set; but otherwise it should continue to isolate pages for MOVing even when nr_failed - as the mbind(2) manpage promises. There's a case when nr_failed should be incremented when it was missed: queue_folios_pte_range() and queue_folios_hugetlb() count the transient migration entries, like queue_folios_pmd() already did. And there's a case when nr_failed should not be incremented when it would have been: in meeting later PTEs of the same large folio, which can only be isolated once: fixed by recording the current large folio in struct queue_pages. Clean up the affected functions, fixing or updating many comments. Bool migrate_folio_add(), without -EIO: true if adding, or if skipping shared (but its arguable folio_estimated_sharers() heuristic left unchanged). Use MPOL_MF_WRLOCK flag to queue_pages_range(), instead of bool lock_vma. Use explicit STRICT|MOVE* flags where queue_pages_test_walk() checks for skipping, instead of hiding them behind MPOL_MF_VALID. Link: https://lkml.kernel.org/r/9a6b0b9-3bb-dbef-8adf-efab4397b8d@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()SeongJae Park
damon_sysfs_set_targets() had a bug that can result in unexpected memory usage and monitoring overhead increase. The bug has fixed by a previous commit. Add a unit test for avoiding a similar bug of future. Link: https://lkml.kernel.org/r/20231022210735.46409-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/damon/core: avoid divide-by-zero from pseudo-moving window length calculationSeongJae Park
When calculating the pseudo-moving access rate, DAMON divides some values by the maximum nr_accesses. However, due to the type of the related variables, simple division-based calculation of the divisor can return zero. As a result, divide-by-zero is possible. Fix it by using damon_max_nr_accesses(), which handles the case. Note that this is a fix for a commit that not in the mainline but mm tree. Link: https://lkml.kernel.org/r/20231019194924.100347-6-sj@kernel.org Fixes: ace30fb21af5 ("mm/damon/core: use pseudo-moving sum for nr_accesses_bp") Reported-by: Jakub Acs <acsjakub@amazon.de> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/damon/lru_sort: avoid divide-by-zero in hot threshold calculationSeongJae Park
When calculating the hotness threshold for lru_prio scheme of DAMON_LRU_SORT, the module divides some values by the maximum nr_accesses. However, due to the type of the related variables, simple division-based calculation of the divisor can return zero. As a result, divide-by-zero is possible. Fix it by using damon_max_nr_accesses(), which handles the case. Link: https://lkml.kernel.org/r/20231019194924.100347-5-sj@kernel.org Fixes: 40e983cca927 ("mm/damon: introduce DAMON-based LRU-lists Sorting") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Jakub Acs <acsjakub@amazon.de> Cc: <stable@vger.kernel.org> [6.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/damon/ops-common: avoid divide-by-zero during region hotness calculationSeongJae Park
When calculating the hotness of each region for the under-quota regions prioritization, DAMON divides some values by the maximum nr_accesses. However, due to the type of the related variables, simple division-based calculation of the divisor can return zero. As a result, divide-by-zero is possible. Fix it by using damon_max_nr_accesses(), which handles the case. Link: https://lkml.kernel.org/r/20231019194924.100347-4-sj@kernel.org Fixes: 198f0f4c58b9 ("mm/damon/vaddr,paddr: support pageout prioritization") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Jakub Acs <acsjakub@amazon.de> Cc: <stable@vger.kernel.org> [5.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/damon/core: avoid divide-by-zero during monitoring results updateSeongJae Park
When monitoring attributes are changed, DAMON updates access rate of the monitoring results accordingly. For that, it divides some values by the maximum nr_accesses. However, due to the type of the related variables, simple division-based calculation of the divisor can return zero. As a result, divide-by-zero is possible. Fix it by using damon_max_nr_accesses(), which handles the case. Link: https://lkml.kernel.org/r/20231019194924.100347-3-sj@kernel.org Fixes: 2f5bef5a590b ("mm/damon/core: update monitoring results for new monitoring attributes") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Jakub Acs <acsjakub@amazon.de> Cc: <stable@vger.kernel.org> [6.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: mlock: avoid folio_within_range() on KSM pagesHugh Dickins
Since commit dc68badcede4 ("mm: mlock: update mlock_pte_range to handle large folio") I've just occasionally seen VM_WARN_ON_FOLIO(folio_test_ksm) warnings from folio_within_range(), in a splurge after testing with KSM hyperactive. folio_referenced_one()'s use of folio_within_vma() is safe because it checks folio_test_large() first; but allow_mlock_munlock() needs to do the same to avoid those warnings (or check !folio_test_ksm() itself? Or move either check into folio_within_range()? Hard to tell without more examples of its use). Link: https://lkml.kernel.org/r/23852f6a-5bfa-1ffd-30db-30c5560ad426@google.com Fixes: dc68badcede4 ("mm: mlock: update mlock_pte_range to handle large folio") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: migrate: record the mlocked page status to remove unnecessary lru drainBaolin Wang
When doing compaction, I found the lru_add_drain() is an obvious hotspot when migrating pages. The distribution of this hotspot is as follows: - 18.75% compact_zone - 17.39% migrate_pages - 13.79% migrate_pages_batch - 11.66% migrate_folio_move - 7.02% lru_add_drain + 7.02% lru_add_drain_cpu + 3.00% move_to_new_folio 1.23% rmap_walk + 1.92% migrate_folio_unmap + 3.20% migrate_pages_sync + 0.90% isolate_migratepages The lru_add_drain() was added by commit c3096e6782b7 ("mm/migrate: __unmap_and_move() push good newpage to LRU") to drain the newpage to LRU immediately, to help to build up the correct newpage->mlock_count in remove_migration_ptes() for mlocked pages. However, if there are no mlocked pages are migrating, then we can avoid this lru drain operation, especailly for the heavy concurrent scenarios. So we can record the source pages' mlocked status in migrate_folio_unmap(), and only drain the lru list when the mlocked status is set in migrate_folio_move(). In addition, the page was already isolated from lru when migrating, so checking the mlocked status is stable by folio_test_mlocked() in migrate_folio_unmap(). After this patch, I can see the hotpot of the lru_add_drain() is gone: - 9.41% migrate_pages_batch - 6.15% migrate_folio_move - 3.64% move_to_new_folio + 1.80% migrate_folio_extra + 1.70% buffer_migrate_folio + 1.41% rmap_walk + 0.62% folio_add_lru + 3.07% migrate_folio_unmap Meanwhile, the compaction latency shows some improvements when running thpscale: base patched Amean fault-both-1 1131.22 ( 0.00%) 1112.55 * 1.65%* Amean fault-both-3 2489.75 ( 0.00%) 2324.15 * 6.65%* Amean fault-both-5 3257.37 ( 0.00%) 3183.18 * 2.28%* Amean fault-both-7 4257.99 ( 0.00%) 4079.04 * 4.20%* Amean fault-both-12 6614.02 ( 0.00%) 6075.60 * 8.14%* Amean fault-both-18 10607.78 ( 0.00%) 8978.86 * 15.36%* Amean fault-both-24 14911.65 ( 0.00%) 11619.55 * 22.08%* Amean fault-both-30 14954.67 ( 0.00%) 14925.66 * 0.19%* Amean fault-both-32 16654.87 ( 0.00%) 15580.31 * 6.45%* Link: https://lkml.kernel.org/r/06e9153a7a4850352ec36602df3a3a844de45698.1697859741.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: hugetlb_vmemmap: fix reference to nonexistent fileVegard Nossum
The directory this file is in was renamed but the reference didn't get updated. Fix it. Link: https://lkml.kernel.org/r/20231022185619.919397-1-vegard.nossum@oracle.com Fixes: ee65728e103b ("docs: rename Documentation/vm to Documentation/mm") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Wu XiangCheng <bobwxc@email.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: page_alloc: check the order of compound page even when the order is zeroHyesoo Yu
For compound pages, the head sets the PG_head flag and the tail sets the compound_head to indicate the head page. If a user allocates a compound page and frees it with a different order, the compound page information will not be properly initialized. To detect this problem, compound_order(page) and the order argument are compared, but this is not checked when the order argument is zero. That error should be checked regardless of the order. Link: https://lkml.kernel.org/r/20231023083217.1866451-1-hyesoo.yu@samsung.com Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: fix multiple typos in multiple filesMuhammad Muzammil
Link: https://lkml.kernel.org/r/20231023124405.36981-1-m.muzzammilashraf@gmail.com Signed-off-by: Muhammad Muzammil <m.muzzammilashraf@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Muzammil <m.muzzammilashraf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/khugepaged: convert collapse_pte_mapped_thp() to use foliosVishal Moola (Oracle)
This removes 2 calls to compound_head() and helps convert khugepaged to use folios throughout. Previously, if the address passed to collapse_pte_mapped_thp() corresponded to a tail page, the scan would fail immediately. Using filemap_lock_folio() we get the corresponding folio back and try to operate on the folio instead. Link: https://lkml.kernel.org/r/20231020183331.10770-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/khugepaged: convert alloc_charge_hpage() to use foliosVishal Moola (Oracle)
Also remove count_memcg_page_event now that its last caller no longer uses it and reword hpage_collapse_alloc_page() to hpage_collapse_alloc_folio(). This removes 1 call to compound_head() and helps convert khugepaged to use folios throughout. Link: https://lkml.kernel.org/r/20231020183331.10770-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/khugepaged: convert is_refcount_suitable() to use foliosVishal Moola (Oracle)
Both callers of is_refcount_suitable() have been converted to use folios, so convert it to take in a folio. Both callers only operate on head pages of folios so mapcount/refcount conversions here are trivial. Removes 3 calls to compound head, and removes 315 bytes of kernel text. Link: https://lkml.kernel.org/r/20231020183331.10770-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/khugepaged: convert hpage_collapse_scan_pmd() to use foliosVishal Moola (Oracle)
Replaces 5 calls to compound_head(), and removes 1385 bytes of kernel text. Link: https://lkml.kernel.org/r/20231020183331.10770-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/khugepaged: convert __collapse_huge_page_isolate() to use foliosVishal Moola (Oracle)
Patch series "Some khugepaged folio conversions", v3. This patchset converts a number of functions to use folios. This cleans up some khugepaged code and removes a large number of hidden compound_head() calls. This patch (of 5): Replaces 11 calls to compound_head() with 1, and removes 1348 bytes of kernel text. Link: https://lkml.kernel.org/r/20231020183331.10770-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20231020183331.10770-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: memory_hotplug: drop memoryless node from fallback listsQi Zheng
In offline_pages(), if a node becomes memoryless, we will clear its N_MEMORY state by calling node_states_clear_node(). But we do this after rebuilding the zonelists by calling build_all_zonelists(), which will cause this memoryless node to still be in the fallback nodes (node_order[]) of other nodes. To drop memoryless nodes from fallback nodes in this case, just call node_states_clear_node() before calling build_all_zonelists(). In this way, we will not try to allocate pages from memoryless node0, then the panic mentioned in [1] will also be fixed. Even though this problem has been solved by dropping the NODE_MIN_SIZE constrain in x86 [2], it would be better to fix it in the core MM as well. https://lore.kernel.org/all/20230212110305.93670-1-zhengqi.arch@bytedance.com/ [1] https://lore.kernel.org/all/20231017062215.171670-1-rppt@kernel.org/ [2] Link: https://lkml.kernel.org/r/9f1dbe7ee1301c7163b2770e32954ff5e3ecf2c4.1697711415.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: page_alloc: skip memoryless nodes entirelyQi Zheng
Patch series "handle memoryless nodes more appropriately", v3. Currently, in the process of initialization or offline memory, memoryless nodes will still be built into the fallback list of itself or other nodes. This is not what we expected, so this patch series removes memoryless nodes from the fallback list entirely. This patch (of 2): In find_next_best_node(), we skipped the memoryless nodes when building the zonelists of other normal nodes (N_NORMAL), but did not skip the memoryless node itself when building the zonelist. This will cause it to be traversed at runtime. For example, say we have node0 and node1, node0 is memoryless node, then the fallback order of node0 and node1 as follows: [ 0.153005] Fallback order for Node 0: 0 1 [ 0.153564] Fallback order for Node 1: 1 After this patch, we skip memoryless node0 entirely, then the fallback order of node0 and node1 as follows: [ 0.155236] Fallback order for Node 0: 1 [ 0.155806] Fallback order for Node 1: 1 So it becomes completely invisible, which will reduce runtime overhead. And in this way, we will not try to allocate pages from memoryless node0, then the panic mentioned in [1] will also be fixed. Even though this problem has been solved by dropping the NODE_MIN_SIZE constrain in x86 [2], it would be better to fix it in core MM as well. [1]. https://lore.kernel.org/all/20230212110305.93670-1-zhengqi.arch@bytedance.com/ [2]. https://lore.kernel.org/all/20231017062215.171670-1-rppt@kernel.org/ [zhengqi.arch@bytedance.com: update comment, per Ingo] Link: https://lkml.kernel.org/r/7300fc00a057eefeb9a68c8ad28171c3f0ce66ce.1697799303.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/cover.1697799303.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/cover.1697711415.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/157013e978468241de4a4c05d5337a44638ecb0e.1697711415.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/migrate: add nr_split to trace_mm_migrate_pages stats.Zi Yan
Add nr_split to trace_mm_migrate_pages for large folio (including THP) split events. [akpm@linux-foundation.org: cleanup per Huang, Ying] Link: https://lkml.kernel.org/r/20231017163129.2025214-2-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/migrate: correct nr_failed in migrate_pages_sync()Zi Yan
nr_failed was missing the large folio splits from migrate_pages_batch() and can cause a mismatch between migrate_pages() return value and the number of not migrated pages, i.e., when the return value of migrate_pages() is 0, there are still pages left in the from page list. It will happen when a non-PMD THP large folio fails to migrate due to -ENOMEM and is split successfully but not all the split pages are not migrated, migrate_pages_batch() would return non-zero, but astats.nr_thp_split = 0. nr_failed would be 0 and returned to the caller of migrate_pages(), but the not migrated pages are left in the from page list without being added back to LRU lists. Fix it by adding a new nr_split counter for large folio splits and adding it to nr_failed in migrate_page_sync() after migrate_pages_batch() is done. Link: https://lkml.kernel.org/r/20231017163129.2025214-1-zi.yan@sent.com Fixes: 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly") Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: Huang Ying <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/kmemleak: move the initialisation of object to __link_objectLiu Shixin
In patch (mm: kmemleak: split __create_object into two functions), the initialisation of object has been splited in two places. Catalin said it feels a bit weird and error prone. So leave __alloc_object() to just do the actual allocation and let __link_object() do the full initialisation. Link: https://lkml.kernel.org/r/20231023025125.90972-1-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/kmemleak: fix partially freeing unknown object warningLiu Shixin
delete_object_part() can be called by multiple callers in the same time. If an object is found and removed by a caller, and then another caller try to find it too, it failed and return directly. It still be recorded by kmemleak even if it has already been freed to buddy. With DEBUG on, kmemleak will report the following warning, kmemleak: Partially freeing unknown object at 0xa1af86000 (size 4096) CPU: 0 PID: 742 Comm: test_huge Not tainted 6.6.0-rc3kmemleak+ #54 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x37/0x50 kmemleak_free_part_phys+0x50/0x60 hugetlb_vmemmap_optimize+0x172/0x290 ? __pfx_vmemmap_remap_pte+0x10/0x10 __prep_new_hugetlb_folio+0xe/0x30 prep_new_hugetlb_folio.isra.0+0xe/0x40 alloc_fresh_hugetlb_folio+0xc3/0xd0 alloc_surplus_hugetlb_folio.constprop.0+0x6e/0xd0 hugetlb_acct_memory.part.0+0xe6/0x2a0 hugetlb_reserve_pages+0x110/0x2c0 hugetlbfs_file_mmap+0x11d/0x1b0 mmap_region+0x248/0x9a0 ? hugetlb_get_unmapped_area+0x15c/0x2d0 do_mmap+0x38b/0x580 vm_mmap_pgoff+0xe6/0x190 ksys_mmap_pgoff+0x18a/0x1f0 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Expand __create_object() and move __alloc_object() to the beginning. Then use kmemleak_lock to protect __find_and_remove_object() and __link_object() as a whole, which can guarantee all objects are processed sequentialally. Link: https://lkml.kernel.org/r/20231018102952.3339837-8-liushixin2@huawei.com Fixes: 53238a60dd4a ("kmemleak: Allow partial freeing of memory blocks") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: kmemleak: add __find_and_remove_object()Liu Shixin
Add new __find_and_remove_object() without kmemleak_lock protect, it is in preparation for the next patch. Link: https://lkml.kernel.org/r/20231018102952.3339837-7-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: kmemleak: use mem_pool_free() to free objectLiu Shixin
The kmemleak object is allocated by mem_pool_alloc(), which could be from slab or mem_pool[], so it's not suitable using __kmem_cache_free() to free the object, use __mem_pool_free() instead. Link: https://lkml.kernel.org/r/20231018102952.3339837-6-liushixin2@huawei.com Fixes: 0647398a8c7b ("mm: kmemleak: simple memory allocation pool for kmemleak objects") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: kmemleak: split __create_object into two functionsLiu Shixin
__create_object() consists of two part, the first part allocate a kmemleak object and initialize it, the second part insert it into object tree. This function need kmemleak_lock but actually only the second part need lock. Split it into two functions, the first function __alloc_object only allocate a kmemleak object, and the second function __link_object() will initialize the object and insert it into object tree, use the kmemleak_lock to protect __link_object() only. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20231018102952.3339837-5-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/kmemleak: fix print format of pointer in pr_debug()Liu Shixin
With 0x%p, the pointer will be hashed and print (____ptrval____) instead. And with 0x%pa, the pointer can be successfully printed but with duplicate prefixes, which looks like: kmemleak: kmemleak_free(0x(____ptrval____)) kmemleak: kmemleak_free_percpu(0x(____ptrval____)) kmemleak: kmemleak_free_part_phys(0x0x0000000a1af86000) Use 0x%px instead of 0x%p or 0x%pa to print the pointer. Then the print will be like: kmemleak: kmemleak_free(0xffff9111c145b020) kmemleak: kmemleak_free_percpu(0x00000000000333b0) kmemleak: kmemleak_free_part_phys(0x0000000a1af80000) Link: https://lkml.kernel.org/r/20231018102952.3339837-4-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25bootmem: use kmemleak_free_part_phys in put_page_bootmemLiu Shixin
Patch series "Some bugfix about kmemleak", v3. Some bugfixes for kmemleak and the printed info from debug mode. This patch (of 7): Since kmemleak_alloc_phys() rather than kmemleak_alloc() was called from memblock_alloc_range_nid(), kmemleak_free_part_phys() should be used to delete kmemleak object in put_page_bootmem(). In debug mode, there are following warning: kmemleak: Partially freeing unknown object at 0xffff97345aff7000 (size 4096) Link: https://lkml.kernel.org/r/20231018102952.3339837-1-liushixin2@huawei.com Link: https://lkml.kernel.org/r/20231018102952.3339837-2-liushixin2@huawei.com Fixes: dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: remove page_cpupid_xchg_last()Kefeng Wang
Since all calls use folio_xchg_last_cpupid(), remove page_cpupid_xchg_last(). Link: https://lkml.kernel.org/r/20231018140806.2783514-20-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: use folio_xchg_last_cpupid() in wp_page_reuse()Kefeng Wang
Convert to use folio_xchg_last_cpupid() in wp_page_reuse(), and remove page variable. Since now only normal and PMD-mapped page is handled by numa balancing, it's enough to only update the entire folio's last cpupid. Link: https://lkml.kernel.org/r/20231018140806.2783514-19-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: convert wp_page_reuse() and finish_mkwrite_fault() to take a folioKefeng Wang
Saves one compound_head() call, also in preparation for page_cpupid_xchg_last() conversion. Link: https://lkml.kernel.org/r/20231018140806.2783514-18-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: make finish_mkwrite_fault() staticKefeng Wang
Make finish_mkwrite_fault static since it is not used outside of memory.c. Link: https://lkml.kernel.org/r/20231018140806.2783514-17-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: huge_memory: use folio_xchg_last_cpupid() in __split_huge_page_tail()Kefeng Wang
Convert to use folio_xchg_last_cpupid() in __split_huge_page_tail(). Link: https://lkml.kernel.org/r/20231018140806.2783514-16-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: migrate: use folio_xchg_last_cpupid() in folio_migrate_flags()Kefeng Wang
Convert to use folio_xchg_last_cpupid() in folio_migrate_flags(), also directly use folio_nid() instead of page_to_nid(&folio->page). Link: https://lkml.kernel.org/r/20231018140806.2783514-15-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: huge_memory: use a folio in change_huge_pmd()Kefeng Wang
Use a folio in change_huge_pmd(), which helps to remove last xchg_page_access_time() caller. Link: https://lkml.kernel.org/r/20231018140806.2783514-11-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: mprotect: use a folio in change_pte_range()Kefeng Wang
Use a folio in change_pte_range() to save three compound_head() calls. Since now only normal and PMD-mapped page is handled by numa balancing, it is enough to only update the entire folio's access time. Link: https://lkml.kernel.org/r/20231018140806.2783514-10-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: huge_memory: use folio_last_cpupid() in __split_huge_page_tail()Kefeng Wang
Convert to use folio_last_cpupid() in __split_huge_page_tail(). Link: https://lkml.kernel.org/r/20231018140806.2783514-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: huge_memory: use folio_last_cpupid() in do_huge_pmd_numa_page()Kefeng Wang
Convert to use folio_last_cpupid() in do_huge_pmd_numa_page(). Link: https://lkml.kernel.org/r/20231018140806.2783514-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: memory: use folio_last_cpupid() in do_numa_page()Kefeng Wang
Convert to use folio_last_cpupid() in do_numa_page(). Link: https://lkml.kernel.org/r/20231018140806.2783514-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm/swap: avoid a xa load for swapout pathKairui Song
A variable is never used for swapout path (shadowp is NULL) and compiler is unable to optimize out the unneeded load since it's a function call. The was introduced by 3852f6768ede ("mm/swapcache: support to handle the shadow entries"). Link: https://lkml.kernel.org/r/20231017011728.37508-1-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: kmem: reimplement get_obj_cgroup_from_current()Roman Gushchin
Reimplement get_obj_cgroup_from_current() using current_obj_cgroup(). get_obj_cgroup_from_current() and current_obj_cgroup() share 80% of the code, so the new implementation is almost trivial. get_obj_cgroup_from_current() is a convenient function used by the bpf subsystem, so there is no reason to get rid of it completely. Link: https://lkml.kernel.org/r/20231019225346.1822282-7-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25percpu: scoped objcg protectionRoman Gushchin
Similar to slab and kmem, switch to a scope-based protection of the objcg pointer to avoid. Link: https://lkml.kernel.org/r/20231019225346.1822282-6-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: kmem: scoped objcg protectionRoman Gushchin
Switch to a scope-based protection of the objcg pointer on slab/kmem allocation paths. Instead of using the get_() semantics in the pre-allocation hook and put the reference afterwards, let's rely on the fact that objcg is pinned by the scope. It's possible because: 1) if the objcg is received from the current task struct, the task is keeping a reference to the objcg. 2) if the objcg is received from an active memcg (remote charging), the memcg is pinned by the scope and has a reference to the corresponding objcg. Link: https://lkml.kernel.org/r/20231019225346.1822282-5-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: kmem: make memcg keep a reference to the original objcgRoman Gushchin
Keep a reference to the original objcg object for the entire life of a memcg structure. This allows to simplify the synchronization on the kernel memory allocation paths: pinning a (live) memcg will also pin the corresponding objcg. The memory overhead of this change is minimal because object cgroups usually outlive their corresponding memory cgroups even without this change, so it's only an additional pointer per memcg. Link: https://lkml.kernel.org/r/20231019225346.1822282-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: kmem: add direct objcg pointer to task_structRoman Gushchin
To charge a freshly allocated kernel object to a memory cgroup, the kernel needs to obtain an objcg pointer. Currently it does it indirectly by obtaining the memcg pointer first and then calling to __get_obj_cgroup_from_memcg(). Usually tasks spend their entire life belonging to the same object cgroup. So it makes sense to save the objcg pointer on task_struct directly, so it can be obtained faster. It requires some work on fork, exit and cgroup migrate paths, but these paths are way colder. To avoid any costly synchronization the following rules are applied: 1) A task sets it's objcg pointer itself. 2) If a task is being migrated to another cgroup, the least significant bit of the objcg pointer is set atomically. 3) On the allocation path the objcg pointer is obtained locklessly using the READ_ONCE() macro and the least significant bit is checked. If it's set, the following procedure is used to update it locklessly: - task->objcg is zeroed using cmpxcg - new objcg pointer is obtained - task->objcg is updated using try_cmpxchg - operation is repeated if try_cmpxcg fails It guarantees that no updates will be lost if task migration is racing against objcg pointer update. It also allows to keep both read and write paths fully lockless. Because the task is keeping a reference to the objcg, it can't go away while the task is alive. This commit doesn't change the way the remote memcg charging works. Link: https://lkml.kernel.org/r/20231019225346.1822282-3-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25mm: kmem: optimize get_obj_cgroup_from_current()Roman Gushchin
Patch series "mm: improve performance of accounted kernel memory allocations", v5. This patchset improves the performance of accounted kernel memory allocations by ~30% as measured by a micro-benchmark [1]. The benchmark is very straightforward: 1M of 64 bytes-large kmalloc() allocations. Below are results with the disabled kernel memory accounting, the original state and with this patchset applied. | | Kmem disabled | Original | Patched | Delta | |-------------+---------------+----------+---------+--------| | User cgroup | 29764 | 84548 | 59078 | -30.0% | | Root cgroup | 29742 | 48342 | 31501 | -34.8% | As we can see, the patchset removes the majority of the overhead when there is no actual accounting (a task belongs to the root memory cgroup) and almost halves the accounting overhead otherwise. The main idea is to get rid of unnecessary memcg to objcg conversions and switch to a scope-based protection of objcgs, which eliminates extra operations with objcg reference counters under a rcu read lock. More details are provided in individual commit descriptions. This patch (of 5): Manually inline memcg_kmem_bypass() and active_memcg() to speed up get_obj_cgroup_from_current() by avoiding duplicate in_task() checks and active_memcg() readings. Also add a likely() macro to __get_obj_cgroup_from_memcg(): obj_cgroup_tryget() should succeed at almost all times except a very unlikely race with the memcg deletion path. Link: https://lkml.kernel.org/r/20231019225346.1822282-1-roman.gushchin@linux.dev Link: https://lkml.kernel.org/r/20231019225346.1822282-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>