summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2024-05-05mm/ksm: convert chain series funcs and replace get_ksm_pageAlex Shi (tencent)
In ksm stable tree all page are single, let's convert them to use and folios as well as stable_tree_insert/stable_tree_search funcs. And replace get_ksm_page() by ksm_get_folio() since there is no more needs. It could save a few compound_head calls. Link: https://lkml.kernel.org/r/20240411061713.1847574-9-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/ksm: use folio in write_protect_pageAlex Shi (tencent)
Compound page is checked and skipped before write_protect_page() called, use folio to save a few compound_head checks. Link: https://lkml.kernel.org/r/20240411061713.1847574-8-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/ksm: use ksm_get_folio in scan_get_next_rmap_itemAlex Shi (tencent)
Save a compound_head call. Link: https://lkml.kernel.org/r/20240411061713.1847574-7-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/ksm: use folio in stable_node_dupAlex Shi (tencent)
Use ksm_get_folio() and save 2 compound_head calls. Link: https://lkml.kernel.org/r/20240411061713.1847574-6-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/ksm: use folio in remove_stable_nodeAlex Shi (tencent)
Pages in stable tree are all single normal page, so uses ksm_get_folio() and folio_set_stable_node(), also saves 3 calls to compound_head(). Link: https://lkml.kernel.org/r/20240411061713.1847574-5-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/ksm: add folio_set_stable_nodeAlex Shi (tencent)
Turn set_page_stable_node() into a wrapper folio_set_stable_node, and then use it to replace the former. we will merge them together after all place converted to folio. Link: https://lkml.kernel.org/r/20240411061713.1847574-4-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/ksm: use folio in remove_rmap_item_from_treeAlex Shi (tencent)
To save 2 compound_head calls. Link: https://lkml.kernel.org/r/20240411061713.1847574-3-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/ksm: add ksm_get_folioAlex Shi (tencent)
Patch series "transfer page to folio in KSM". This is the first part of page to folio transfer on KSM. Since only single page could be stored in KSM, we could safely transfer stable tree pages to folios. This patchset could reduce ksm.o 57kbytes from 2541776 bytes on latest akpm/mm-stable branch with CONFIG_DEBUG_VM enabled. It pass the KSM testing in LTP and kernel selftest. Thanks for Matthew Wilcox and David Hildenbrand's suggestions and comments! This patch (of 10): The ksm only contains single pages, so we could add a new func ksm_get_folio for get_ksm_page to use folio instead of pages to save a couple of compound_head calls. After all caller replaced, get_ksm_page will be removed. Link: https://lkml.kernel.org/r/20240411061713.1847574-1-alexs@kernel.org Link: https://lkml.kernel.org/r/20240411061713.1847574-2-alexs@kernel.org Signed-off-by: Alex Shi (tencent) <alexs@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/debug: print only page mapcount (excluding folio entire mapcount) in ↵David Hildenbrand
__dump_folio() Let's simplify and only print the page mapcount: we already print the large folio mapcount and the entire folio mapcount for large folios separately; that should be sufficient to figure out what's happening. While at it, print the page mapcount also if it had an underflow, filtering out only typed pages. Link: https://lkml.kernel.org/r/20240409192301.907377-18-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/migrate_device: use folio_mapcount() in migrate_vma_check_page()David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's convert migrate_vma_check_page() to work on a folio internally so we can remove the page_mapcount() usage. Note that we reject any large folios. There is a lot more folio conversion to be had, but that has to wait for another day. No functional change intended. Link: https://lkml.kernel.org/r/20240409192301.907377-15-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/filemap: use folio_mapcount() in filemap_unaccount_folio()David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's use folio_mapcount() instead of filemap_unaccount_folio(). No functional change intended, because we're only dealing with small folios. Link: https://lkml.kernel.org/r/20240409192301.907377-14-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/migrate: use folio_likely_mapped_shared() in add_page_for_migration()David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. In add_page_for_migration(), we actually want to check if the folio is mapped shared, to reject such folios. So let's use folio_likely_mapped_shared() instead. For small folios, fully mapped THP, and hugetlb folios, there is no change. For partially mapped, shared THP, we should now do a better job at rejecting such folios. Link: https://lkml.kernel.org/r/20240409192301.907377-12-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. For tracing purposes, we use page_mapcount() in __alloc_contig_migrate_range(). Adding that mapcount to total_mapped sounds strange: total_migrated and total_reclaimed would count each page only once, not multiple times. But then, isolate_migratepages_range() adds each folio only once to the list. So for large folios, we would query the mapcount of the first page of the folio, which doesn't make too much sense for large folios. Let's simply use folio_mapped() * folio_nr_pages(), which makes more sense as nr_migratepages is also incremented by the number of pages in the folio in case of successful migration. Link: https://lkml.kernel.org/r/20240409192301.907377-11-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. We can only unmap full folios; page_mapped(), which we check here, is translated to folio_mapped() -- based on folio_mapcount(). So let's print the folio mapcount instead. Link: https://lkml.kernel.org/r/20240409192301.907377-10-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity checkDavid Hildenbrand
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's similarly check for folio_mapcount() underflows instead of page_mapcount() underflows like we do in zap_present_folio_ptes() now. Instead of the VM_BUG_ON(), we should actually be doing something like print_bad_pte(). For now, let's keep it simple and use WARN_ON_ONCE(), performing that check independently of DEBUG_VM. Link: https://lkml.kernel.org/r/20240409192301.907377-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/memory: use folio_mapcount() in zap_present_folio_ptes()David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. In zap_present_folio_ptes(), let's simply check the folio mapcount(). If there is some issue, it will underflow at some point either way when unmapping. As indicated already in commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP"), we already documented "If we ever have a cheap folio_mapcount(), we might just want to check for underflows there.". There is no change for small folios. For large folios, we'll now catch more underflows when batch-unmapping, because instead of only testing the mapcount of the first subpage, we'll test if the folio mapcount underflows. Link: https://lkml.kernel.org/r/20240409192301.907377-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: track mapcount of large folios in single valueDavid Hildenbrand
Let's track the mapcount of large folios in a single value. The mapcount of a large folio currently corresponds to the sum of the entire mapcount and all page mapcounts. This sum is what we actually want to know in folio_mapcount() and it is also sufficient for implementing folio_mapped(). With PTE-mapped THP becoming more important and more widely used, we want to avoid looping over all pages of a folio just to obtain the mapcount of large folios. The comment "In the common case, avoid the loop when no pages mapped by PTE" in folio_total_mapcount() does no longer hold for mTHP that are always mapped by PTE. Further, we are planning on using folio_mapcount() more frequently, and might even want to remove page mapcounts for large folios in some kernel configs. Therefore, allow for reading the mapcount of large folios efficiently and atomically without looping over any pages. Maintain the mapcount also for hugetlb pages for simplicity. Use the new mapcount to implement folio_mapcount() and folio_mapped(). Make page_mapped() simply call folio_mapped(). We can now get rid of folio_large_is_mapped(). _nr_pages_mapped is now only used in rmap code and for debugging purposes. Keep folio_nr_pages_mapped() around, but document that its use should be limited to rmap internals and debugging purposes. This change implies one additional atomic add/sub whenever mapping/unmapping (parts of) a large folio. As we now batch RMAP operations for PTE-mapped THP during fork(), during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust the large mapcount for a PTE batch only once, the added overhead in the common case is small. Only when unmapping individual pages of a large folio (e.g., during COW), the overhead might be bigger in comparison, but it's essentially one additional atomic operation. Note that before the new mapcount would overflow, already our refcount would overflow: each mapping requires a folio reference. Extend the focumentation of folio_mapcount(). Link: https://lkml.kernel.org/r/20240409192301.907377-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/rmap: add fast-path for small folios when adding/removing/duplicatingDavid Hildenbrand
Let's add a fast-path for small folios to all relevant rmap functions. Note that only RMAP_LEVEL_PTE applies. This is a preparation for tracking the mapcount of large folios in a single value. Link: https://lkml.kernel.org/r/20240409192301.907377-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Chris Zankel <chris@zankel.net> Cc: Hugh Dickins <hughd@google.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Richard Chang <richardycc@google.com> Cc: Rich Felker <dalias@libc.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: follow_pte() improvementsDavid Hildenbrand
follow_pte() is now our main function to lookup PTEs in VM_PFNMAP/VM_IO VMAs. Let's perform some more sanity checks to make this exported function harder to abuse. Further, extend the doc a bit, it still focuses on the KVM use case with MMU notifiers. Drop the KVM+follow_pfn() comment, follow_pfn() is no more, and we have other users nowadays. Also extend the doc regarding refcounted pages and the interaction with MMU notifiers. KVM is one example that uses MMU notifiers and can deal with refcounted pages properly. VFIO is one example that doesn't use MMU notifiers, and to prevent use-after-free, rejects refcounted pages: pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)). Protection changes are less of a concern for users like VFIO: the behavior is similar to longterm-pinning a page, and getting the PTE protection changed afterwards. The primary concern with refcounted pages is use-after-free, which callers should be aware of. Link: https://lkml.kernel.org/r/20240410155527.474777-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: pass VMA instead of MM to follow_pte()David Hildenbrand
... and centralize the VM_IO/VM_PFNMAP sanity check in there. We'll now also perform these sanity checks for direct follow_pte() invocations. For generic_access_phys(), we might now check multiple times: nothing to worry about, really. Link: https://lkml.kernel.org/r/20240410155527.474777-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> [KVM] Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Fei Li <fei1.li@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Yonghua Huang <yonghua.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm,swap: add document about RCU read lock and swapoff interactionHuang Ying
During reviewing a patch to fix the race condition between free_swap_and_cache() and swapoff() [1], it was found that the document about how to prevent racing with swapoff isn't clear enough. Especially RCU read lock can prevent swapoff from freeing data structures. So, the document is added as comments. [1] https://lore.kernel.org/linux-mm/c8fe62d0-78b8-527a-5bef-ee663ccdc37a@huawei.com/ Link: https://lkml.kernel.org/r/20240407065450.498821-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/mmap: make accountable_mapping return boolHao Ge
accountable_mapping() can return bool, so change it. Link: https://lkml.kernel.org/r/20240407063843.804274-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/mmap: make vma_wants_writenotify return boolHao Ge
vma_wants_writenotify() should return bool, so change it. Link: https://lkml.kernel.org/r/20240407062653.803142-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05memory tier: create CPUless memory tiers after obtaining HMAT infoHo-Ren (Jack) Chuang
The current implementation treats emulated memory devices, such as CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory (E820_TYPE_RAM). However, these emulated devices have different characteristics than traditional DRAM, making it important to distinguish them. Thus, we modify the tiered memory initialization process to introduce a delay specifically for CPUless NUMA nodes. This delay ensures that the memory tier initialization for these nodes is deferred until HMAT information is obtained during the boot process. Finally, demotion tables are recalculated at the end. * late_initcall(memory_tier_late_init); Some device drivers may have initialized memory tiers between `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing online memory nodes and configuring memory tiers. They should be excluded in the late init. * Handle cases where there is no HMAT when creating memory tiers There is a scenario where a CPUless node does not provide HMAT information. If no HMAT is specified, it falls back to using the default DRAM tier. * Introduce another new lock `default_dram_perf_lock` for adist calculation In the current implementation, iterating through CPUlist nodes requires holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up trying to acquire the same lock, leading to a potential deadlock. Therefore, we propose introducing a standalone `default_dram_perf_lock` to protect `default_dram_perf_*`. This approach not only avoids deadlock but also prevents holding a large lock simultaneously. * Upgrade `set_node_memory_tier` to support additional cases, including default DRAM, late CPUless, and hot-plugged initializations. To cover hot-plugged memory nodes, `mt_calc_adistance()` and `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to handle cases where memtype is not initialized and where HMAT information is available. * Introduce `default_memory_types` for those memory types that are not initialized by device drivers. Because late initialized memory and default DRAM memory need to be managed, a default memory type is created for storing all memory types that are not initialized by device drivers and as a fallback. Link: https://lkml.kernel.org/r/20240405000707.2670063-3-horenchuang@bytedance.com Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gregory Price <gourry.memverge@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawie.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05memory tier: dax/kmem: introduce an abstract layer for finding, allocating, ↵Ho-Ren (Jack) Chuang
and putting memory types Patch series "Improved Memory Tier Creation for CPUless NUMA Nodes", v11. When a memory device, such as CXL1.1 type3 memory, is emulated as normal memory (E820_TYPE_RAM), the memory device is indistinguishable from normal DRAM in terms of memory tiering with the current implementation. The current memory tiering assigns all detected normal memory nodes to the same DRAM tier. This results in normal memory devices with different attributions being unable to be assigned to the correct memory tier, leading to the inability to migrate pages between different types of memory. https://lore.kernel.org/linux-mm/PH0PR08MB7955E9F08CCB64F23963B5C3A860A@PH0PR08MB7955.namprd08.prod.outlook.com/T/ This patchset automatically resolves the issues. It delays the initialization of memory tiers for CPUless NUMA nodes until they obtain HMAT information and after all devices are initialized at boot time, eliminating the need for user intervention. If no HMAT is specified, it falls back to using `default_dram_type`. Example usecase: We have CXL memory on the host, and we create VMs with a new system memory device backed by host CXL memory. We inject CXL memory performance attributes through QEMU, and the guest now sees memory nodes with performance attributes in HMAT. With this change, we enable the guest kernel to construct the correct memory tiering for the memory nodes. This patch (of 2): Since different memory devices require finding, allocating, and putting memory types, these common steps are abstracted in this patch, enhancing the scalability and conciseness of the code. Link: https://lkml.kernel.org/r/20240405000707.2670063-1-horenchuang@bytedance.com Link: https://lkml.kernel.org/r/20240405000707.2670063-2-horenchuang@bytedance.com Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawie.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gregory Price <gourry.memverge@gmail.com> Cc: Hao Xiang <hao.xiang@bytedance.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm,page_owner: don't remove __GFP_NOLOCKDEP in add_stack_record_to_listChristoph Hellwig
Otherwise we'll generate false lockdep positives. Link: https://lkml.kernel.org/r/20240429082828.1615986-1-hch@lst.de Fixes: 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm/vmalloc: fix return value of vb_alloc if size is 0Hailong.Liu
vm_map_ram() uses IS_ERR() to validate the return value of vb_alloc(). If vm_map_ram(page, 0, 0) is executed, vb_alloc(0, GFP_KERNEL) would return NULL. In such a case, IS_ERR() cannot handle the return value and lead to kernel panic by vmap_pages_range_noflush() at last. To resolve this issue, return ERR_PTR(-EINVAL) if the size is 0. Link: https://lkml.kernel.org/r/20240426024149.21176-1-hailong.liu@oppo.com Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: use memalloc_nofs_save() in page_cache_ra_order()Kefeng Wang
See commit f2c817bed58d ("mm: use memalloc_nofs_save in readahead path"), ensure that page_cache_ra_order() do not attempt to reclaim file-backed pages too, or it leads to a deadlock, found issue when test ext4 large folio. INFO: task DataXceiver for:7494 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:DataXceiver for state:D stack:0 pid:7494 ppid:1 flags:0x00000200 Call trace: __switch_to+0x14c/0x240 __schedule+0x82c/0xdd0 schedule+0x58/0xf0 io_schedule+0x24/0xa0 __folio_lock+0x130/0x300 migrate_pages_batch+0x378/0x918 migrate_pages+0x350/0x700 compact_zone+0x63c/0xb38 compact_zone_order+0xc0/0x118 try_to_compact_pages+0xb0/0x280 __alloc_pages_direct_compact+0x98/0x248 __alloc_pages+0x510/0x1110 alloc_pages+0x9c/0x130 folio_alloc+0x20/0x78 filemap_alloc_folio+0x8c/0x1b0 page_cache_ra_order+0x174/0x308 ondemand_readahead+0x1c8/0x2b8 page_cache_async_ra+0x68/0xb8 filemap_readahead.isra.0+0x64/0xa8 filemap_get_pages+0x3fc/0x5b0 filemap_splice_read+0xf4/0x280 ext4_file_splice_read+0x2c/0x48 [ext4] vfs_splice_read.part.0+0xa8/0x118 splice_direct_to_actor+0xbc/0x288 do_splice_direct+0x9c/0x108 do_sendfile+0x328/0x468 __arm64_sys_sendfile64+0x8c/0x148 invoke_syscall+0x4c/0x118 el0_svc_common.constprop.0+0xc8/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x4c/0x1f8 el0t_64_sync_handler+0xc0/0xc8 el0t_64_sync+0x188/0x190 Link: https://lkml.kernel.org/r/20240426112938.124740-1-wangkefeng.wang@huawei.com Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-05mm: page_owner: fix wrong information in dump_page_ownerManinder Singh
With commit ea4b5b33bf8a ("mm,page_owner: update metadata for tail pages"), new API __update_page_owner_handle was introduced and arguemnt was passed in wrong order from __set_page_owner and thus page_owner is giving wrong data. [ 15.982420] page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL), pid 80, tgid -1210279584 (insmod), ts 80, free_ts 0 Fixing the same. Correct output: [ 14.556482] page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL), pid 80, tgid 80 (insmod), ts 14552004992, free_ts 0 Link: https://lkml.kernel.org/r/20240424111838.3782931-1-hariom1.p@samsung.com Fixes: ea4b5b33bf8a ("mm,page_owner: update metadata for tail pages") Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Hariom Panthi <hariom1.p@samsung.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Rohit Thapliyal <r.thapliyal@samsung.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-02swapon(2): open swap with O_EXCLAl Viro
... eliminating the need to reopen block devices so they could be exclusively held. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02swapon(2)/swapoff(2): don't bother with block sizeAl Viro
once upon a time that used to matter; these days we do swap IO for swap devices at the level that doesn't give a damn about block size, buffer_head or anything of that sort - just attach the page to bio, set the location and size (the latter to PAGE_SIZE) and feed into queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02mm/slub: remove the check for NULL kmalloc_cachesHyunmin Lee
If the same size kmalloc cache already exists, it should not be created again. So there is the check for NULL kmalloc_caches before calling the kmalloc creation function. However, new_kmalloc_cache() itself checks NULL kmalloc_cahces before cache creation. Therefore, the NULL check is not necessary in this function. Signed-off-by: Hyunmin Lee <hyunminlr@gmail.com> Co-developed-by: Jeungwoo Yoo <casionwoo@gmail.com> Signed-off-by: Jeungwoo Yoo <casionwoo@gmail.com> Co-developed-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-05-02mm/slub: create kmalloc 96 and 192 caches regardless cache size orderHyunmin Lee
For SLAB the kmalloc caches needed to be created in ascending sizes in order. However, the constraint is not necessary anymore because SLAB has been removed and SLUB doesn't need to comply with the constraint. Thus, kmalloc 96 and 192 caches can be created after the other size kmalloc caches are created instead of checking every time to find their order to be created. Also, this change could prevent engineers from being confused by the removed constraint. Signed-off-by: Hyunmin Lee <hyunminlr@gmail.com> Co-developed-by: Jeungwoo Yoo <casionwoo@gmail.com> Signed-off-by: Jeungwoo Yoo <casionwoo@gmail.com> Co-developed-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-05-02mm/slub: mark racy access on slab->freelistlinke li
In deactivate_slab(), slab->freelist can be changed concurrently. Mark data race on slab->freelist as benign using READ_ONCE. This patch is aimed at reducing the number of benign races reported by KCSAN in order to focus future debugging effort on harmful races. Signed-off-by: linke li <lilinke99@qq.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-05-01mm: Export writeback_iter()David Howells
Export writeback_iter() so that it can be used by netfslib as a module. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Christoph Hellwig <hch@lst.de> cc: linux-mm@kvack.org
2024-05-01mm: Provide a means of invalidation without using launder_folioDavid Howells
Implement a replacement for launder_folio. The key feature of invalidate_inode_pages2() is that it locks each folio individually, unmaps it to prevent mmap'd accesses interfering and calls the ->launder_folio() address_space op to flush it. This has problems: firstly, each folio is written individually as one or more small writes; secondly, adjacent folios cannot be added so easily into the laundry; thirdly, it's yet another op to implement. Instead, use the invalidate lock to cause anyone wanting to add a folio to the inode to wait, then unmap all the folios if we have mmaps, then, conditionally, use ->writepages() to flush any dirty data back and then discard all pages. The invalidate lock prevents ->read_iter(), ->write_iter() and faulting through mmap all from adding pages for the duration. This is then used from netfslib to handle the flusing in unbuffered and direct writes. Signed-off-by: David Howells <dhowells@redhat.com> cc: Matthew Wilcox <willy@infradead.org> cc: Miklos Szeredi <miklos@szeredi.hu> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Christoph Hellwig <hch@lst.de> cc: Andrew Morton <akpm@linux-foundation.org> cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Christian Brauner <brauner@kernel.org> cc: Jeff Layton <jlayton@kernel.org> cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: devel@lists.orangefs.org
2024-05-01mm/slub: avoid zeroing outside-object freepointer for single freeNicolas Bouchinet
Commit 284f17ac13fe ("mm/slub: handle bulk and single object freeing separately") splits single and bulk object freeing in two functions slab_free() and slab_free_bulk() which leads slab_free() to call slab_free_hook() directly instead of slab_free_freelist_hook(). If `init_on_free` is set, slab_free_hook() zeroes the object. Afterward, if `slub_debug=F` and `CONFIG_SLAB_FREELIST_HARDENED` are set, the do_slab_free() slowpath executes freelist consistency checks and try to decode a zeroed freepointer which leads to a "Freepointer corrupt" detection in check_object(). During bulk free, slab_free_freelist_hook() isn't affected as it always sets it objects freepointer using set_freepointer() to maintain its reconstructed freelist after `init_on_free`. For single free, object's freepointer thus needs to be avoided when stored outside the object if `init_on_free` is set. The freepointer left as is, check_object() may later detect an invalid pointer value due to objects overflow. To reproduce, set `slub_debug=FU init_on_free=1 log_level=7` on the command line of a kernel build with `CONFIG_SLAB_FREELIST_HARDENED=y`. dmesg sample log: [ 10.708715] ============================================================================= [ 10.710323] BUG kmalloc-rnd-05-32 (Tainted: G B T ): Freepointer corrupt [ 10.712695] ----------------------------------------------------------------------------- [ 10.712695] [ 10.712695] Slab 0xffffd8bdc400d580 objects=32 used=4 fp=0xffff9d9a80356f80 flags=0x200000000000a00(workingset|slab|node=0|zone=2) [ 10.716698] Object 0xffff9d9a80356600 @offset=1536 fp=0x7ee4f480ce0ecd7c [ 10.716698] [ 10.716698] Bytes b4 ffff9d9a803565f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 10.720703] Object ffff9d9a80356600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 10.720703] Object ffff9d9a80356610: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 10.724696] Padding ffff9d9a8035666c: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 10.724696] Padding ffff9d9a8035667c: 00 00 00 00 .... [ 10.724696] FIX kmalloc-rnd-05-32: Object at 0xffff9d9a80356600 not freed Fixes: 284f17ac13fe ("mm/slub: handle bulk and single object freeing separately") Cc: <stable@vger.kernel.org> Co-developed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-04-29mm: Remove the PG_fscache alias for PG_private_2David Howells
Remove the PG_fscache alias for PG_private_2 and use the latter directly. Use of this flag for marking pages undergoing writing to the cache should be considered deprecated and the folios should be marked dirty instead and the write done in ->writepages(). Note that PG_private_2 itself should be considered deprecated and up for future removal by the MM folks too. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> cc: Shyam Prasad N <sprasad@microsoft.com> cc: Tom Talpey <tom@talpey.com> cc: Bharath SM <bharathsm@microsoft.com> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Anna Schumaker <anna@kernel.org> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org
2024-04-26Merge branch 'memory-observability' into x86/amdJoerg Roedel
2024-04-25mm: kmsan: implement kmsan_memmove()Alexander Potapenko
Provide a hook that can be used by custom memcpy implementations to tell KMSAN that the metadata needs to be copied. Without that, false positive reports are possible in the cases where KMSAN fails to intercept memory initialization. Link: https://lore.kernel.org/all/3b7dbd88-0861-4638-b2d2-911c97a4cadf@I-love.SAKURA.ne.jp/ Link: https://lkml.kernel.org/r/20240320101851.2589698-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: convert free_zone_device_page to free_zone_device_folioMatthew Wilcox (Oracle)
Both callers already have a folio; pass it in and save a few calls to compound_head(). Link: https://lkml.kernel.org/r/20240405153228.2563754-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: combine __folio_put_small, __folio_put_large and __folio_putMatthew Wilcox (Oracle)
It's now obvious that __folio_put_small() and __folio_put_large() do almost exactly the same thing. Inline them both into __folio_put(). Link: https://lkml.kernel.org/r/20240405153228.2563754-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: inline destroy_large_folio() into __folio_put_large()Matthew Wilcox (Oracle)
destroy_large_folio() has only one caller, move its contents there. Link: https://lkml.kernel.org/r/20240405153228.2563754-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: combine free_the_page() and free_unref_page()Matthew Wilcox (Oracle)
The pcp_allowed_order() check in free_the_page() was only being skipped by __folio_put_small() which is about to be rearranged. Link: https://lkml.kernel.org/r/20240405153228.2563754-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: free non-hugetlb large folios in a batchMatthew Wilcox (Oracle)
Patch series "Clean up __folio_put()". With all the changes over the last few years, __folio_put_small and __folio_put_large have become almost identical to each other ... except you can't tell because they're spread over two files. Rearrange it all so that you can tell, and then inline them both into __folio_put(). This patch (of 5): free_unref_folios() can now handle non-hugetlb large folios, so keep normal large folios in the batch. hugetlb folios still need to be handled specially. [peterx@redhat.com: fix panic] Link: https://lkml.kernel.org/r/ZikjPB0Dt5HA8-uL@x1n Link: https://lkml.kernel.org/r/20240405153228.2563754-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240405153228.2563754-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm: convert pagecache_isize_extended to use a folioMatthew Wilcox (Oracle)
Remove four hidden calls to compound_head(). Also exit early if the filesystem block size is >= PAGE_SIZE instead of just equal to PAGE_SIZE. Link: https://lkml.kernel.org/r/20240405180038.2618624-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/hugetlb: pass correct order_per_bit to cma_declare_contiguous_nidFrank van der Linden
The hugetlb_cma code passes 0 in the order_per_bit argument to cma_declare_contiguous_nid (the alignment, computed using the page order, is correctly passed in). This causes a bit in the cma allocation bitmap to always represent a 4k page, making the bitmaps potentially very large, and slower. It would create bitmaps that would be pretty big. E.g. for a 4k page size on x86, hugetlb_cma=64G would mean a bitmap size of (64G / 4k) / 8 == 2M. With HUGETLB_PAGE_ORDER as order_per_bit, as intended, this would be (64G / 2M) / 8 == 4k. So, that's quite a difference. Also, this restricted the hugetlb_cma area to ((PAGE_SIZE << MAX_PAGE_ORDER) * 8) * PAGE_SIZE (e.g. 128G on x86) , since bitmap_alloc uses normal page allocation, and is thus restricted by MAX_PAGE_ORDER. Specifying anything about that would fail the CMA initialization. So, correctly pass in the order instead. Link: https://lkml.kernel.org/r/20240404162515.527802-2-fvdl@google.com Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma") Signed-off-by: Frank van der Linden <fvdl@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25mm/cma: drop incorrect alignment check in cma_init_reserved_memFrank van der Linden
cma_init_reserved_mem uses IS_ALIGNED to check if the size represented by one bit in the cma allocation bitmask is aligned with CMA_MIN_ALIGNMENT_BYTES (pageblock size). However, this is too strict, as this will fail if order_per_bit > pageblock_order, which is a valid configuration. We could check IS_ALIGNED both ways, but since both numbers are powers of two, no check is needed at all. Link: https://lkml.kernel.org/r/20240404162515.527802-1-fvdl@google.com Fixes: de9e14eebf33 ("drivers: dma-contiguous: add initialization from device tree") Signed-off-by: Frank van der Linden <fvdl@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25hugetlb: convert hugetlb_wp() to use struct vm_faultVishal Moola (Oracle)
hugetlb_wp() can use the struct vm_fault passed in from hugetlb_fault(). This alleviates the stack by consolidating 5 variables into a single struct. [vishal.moola@gmail.com: simplify hugetlb_wp() arguments] Link: https://lkml.kernel.org/r/ZhQtoFNZBNwBCeXn@fedora Link: https://lkml.kernel.org/r/20240401202651.31440-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25hugetlb: convert hugetlb_no_page() to use struct vm_faultVishal Moola (Oracle)
hugetlb_no_page() can use the struct vm_fault passed in from hugetlb_fault(). This alleviates the stack by consolidating 7 variables into a single struct. [vishal.moola@gmail.com: simplify hugetlb_no_page() arguments] Link: https://lkml.kernel.org/r/ZhQtN8y5zud8iI1u@fedora Link: https://lkml.kernel.org/r/20240401202651.31440-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>