summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2022-07-03mm/damon/reclaim: add 'damon_reclaim_' prefix to 'enabled_store()'SeongJae Park
This commit adds 'damon_reclaim_' prefix to 'enabled_store()', so that we can distinguish it easily from the stack trace using 'faddr2line.sh' like tools. Link: https://lkml.kernel.org/r/20220606182310.48781-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/damon/reclaim: make 'enabled' checking timer simplerSeongJae Park
DAMON_RECLAIM's 'enabled' parameter store callback ('enabled_store()') schedules the parameter check timer ('damon_reclaim_timer') if the parameter is set as 'Y'. Then, the timer schedules itself to check if user has set the parameter as 'N'. It's unnecessarily complex. This commit makes it simpler by making the parameter store callback to schedule the timer regardless of the parameter value and disabling the timer's self scheduling. Link: https://lkml.kernel.org/r/20220606182310.48781-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/damon/sysfs: deduplicate inputs applyingSeongJae Park
DAMON sysfs interface's DAMON context building and its online parameter update have duplicated code. This commit removes the duplicate. Link: https://lkml.kernel.org/r/20220606182310.48781-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/damon/reclaim: deduplicate 'commit_inputs' handlingSeongJae Park
DAMON_RECLAIM's handling of 'commit_inputs' parameter is duplicated in 'after_aggregation()' and 'after_wmarks_check()' callbacks. This commit deduplicates the code for better maintenance. Link: https://lkml.kernel.org/r/20220606182310.48781-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/damon/{dbgfs,sysfs}: move target_has_pid() from dbgfs to damon.hSeongJae Park
The function for knowing if given monitoring context's targets will have pid or not is defined and used in dbgfs only. However, the logic is also needed for sysfs. This commit moves the code to damon.h and makes both dbgfs and sysfs to use it. Link: https://lkml.kernel.org/r/20220606182310.48781-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/migration: fix potential pte_unmap on an not mapped pteMiaohe Lin
__migration_entry_wait and migration_entry_wait_on_locked assume pte is always mapped from caller. But this is not the case when it's called from migration_entry_wait_huge and follow_huge_pmd. Add a hugetlbfs variant that calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue. Link: https://lkml.kernel.org/r/20220530113016.16663-5-linmiaohe@huawei.com Fixes: 30dad30922cc ("mm: migration: add migrate_entry_wait_huge()") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Howells <dhowells@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/migration: return errno when isolate_huge_page failedMiaohe Lin
We might fail to isolate huge page due to e.g. the page is under migration which cleared HPageMigratable. We should return errno in this case rather than always return 1 which could confuse the user, i.e. the caller might think all of the memory is migrated while the hugetlb page is left behind. We make the prototype of isolate_huge_page consistent with isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page to isolate_hugetlb as suggested by Muchun to improve the readability. Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com Fixes: e8db67eb0ded ("mm: migrate: move_pages() supports thp migration") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Huang Ying <ying.huang@intel.com> Reported-by: kernel test robot <lkp@intel.com> (build error) Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/migration: remove unneeded lock page and PageMovable checkMiaohe Lin
When non-lru movable page was freed from under us, __ClearPageMovable must have been done. So we can remove unneeded lock page and PageMovable check here. Also free_pages_prepare() will clear PG_isolated for us, so we can further remove ClearPageIsolated as suggested by David. Link: https://lkml.kernel.org/r/20220530113016.16663-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Howells <dhowells@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/page_vma_mapped.c: check possible huge PMD map with transhuge_vma_suitable()Yang Shi
IIUC page_vma_mapped_walk() checks if the vma is possibly huge PMD mapped with transparent_hugepage_active() and "pvmw->nr_pages >= HPAGE_PMD_NR". Actually pvmw->nr_pages is returned by compound_nr() or folio_nr_pages(), so the page should be THP as long as "pvmw->nr_pages >= HPAGE_PMD_NR". And it is guaranteed THP is allocated for valid VMA in the first place. But it may be not PMD mapped if the VMA is file VMA and it is not properly aligned. The transhuge_vma_suitable() is used to do such check, so replace transparent_hugepage_active() to it, which is too heavy and overkilling. Link: https://lkml.kernel.org/r/20220513191705.457775-1-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm: split huge PUD on wp_huge_pud fallbackGowans, James
Currently the implementation will split the PUD when a fallback is taken inside the create_huge_pud function. This isn't where it should be done: the splitting should be done in wp_huge_pud, just like it's done for PMDs. Reason being that if a callback is taken during create, there is no PUD yet so nothing to split, whereas if a fallback is taken when encountering a write protection fault there is something to split. It looks like this was the original intention with the commit where the splitting was introduced, but somehow it got moved to the wrong place between v1 and v2 of the patch series. Rebase mistake perhaps. Link: https://lkml.kernel.org/r/6f48d622eb8bce1ae5dd75327b0b73894a2ec407.camel@amazon.com Fixes: 327e9fd48972 ("mm: Split huge pages on write-notify or COW") Signed-off-by: James Gowans <jgowans@amazon.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/rmap: fix dereferencing invalid subpage pointer in try_to_migrate_one()David Hildenbrand
The subpage we calculate is an invalid pointer for device private pages, because device private pages are mapped via non-present device private entries, not ordinary present PTEs. Let's just not compute broken pointers and fixup later. Move the proper assignment of the correct subpage to the beginning of the function and assert that we really only have a single page in our folio. This currently results in a BUG when tying to compute anon_exclusive, because: [ 528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0 [ 528.739585] #PF: supervisor read access in kernel mode [ 528.745324] #PF: error_code(0x0000) - not-present page [ 528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0 [ 528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257 [ 528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021 [ 528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000 [ 528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29 c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74 0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02 [ 528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202 [ 528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0 [ 528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8 [ 528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000 [ 528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540 [ 528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff [ 528.850865] FS: 00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000 [ 528.859891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0 [ 528.874275] PKRU: 55555554 [ 528.877286] Call Trace: [ 528.880016] <TASK> [ 528.882356] ? lock_is_held_type+0xdf/0x130 [ 528.887033] rmap_walk_anon+0x167/0x410 [ 528.891316] try_to_migrate+0x90/0xd0 [ 528.895405] ? try_to_unmap_one+0xe10/0xe10 [ 528.900074] ? anon_vma_ctor+0x50/0x50 [ 528.904260] ? put_anon_vma+0x10/0x10 [ 528.908347] ? invalid_mkclean_vma+0x20/0x20 [ 528.913114] migrate_vma_setup+0x5f4/0x750 [ 528.917691] dmirror_devmem_fault+0x8c/0x250 [test_hmm] [ 528.923532] do_swap_page+0xac0/0xe50 [ 528.927623] ? __lock_acquire+0x4b2/0x1ac0 [ 528.932199] __handle_mm_fault+0x949/0x1440 [ 528.936876] handle_mm_fault+0x13f/0x3e0 [ 528.941256] do_user_addr_fault+0x215/0x740 [ 528.945928] exc_page_fault+0x75/0x280 [ 528.950115] asm_exc_page_fault+0x27/0x30 [ 528.954593] RIP: 0033:0x40366b ... Link: https://lkml.kernel.org/r/20220623205332.319257-1-david@redhat.com Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm: sparsemem: fix missing higher order allocation splittingMuchun Song
Higher order allocations for vmemmap pages from buddy allocator must be able to be treated as indepdenent small pages as they can be freed individually by the caller. There is no problem for higher order vmemmap pages allocated at boot time since each individual small page will be initialized at boot time. However, it will be an issue for memory hotplug case since those higher order vmemmap pages are allocated from buddy allocator without initializing each individual small page's refcount. The system will panic in put_page_testzero() when CONFIG_DEBUG_VM is enabled if the vmemmap page is freed. Link: https://lkml.kernel.org/r/20220620023019.94257-1-songmuchun@bytedance.com Fixes: d8d55f5616cf ("mm: sparsemem: use page table lock to protect kernel pmd operations") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm/damon: use set_huge_pte_at() to make huge pte oldBaolin Wang
The huge_ptep_set_access_flags() can not make the huge pte old according to the discussion [1], that means we will always mornitor the young state of the hugetlb though we stopped accessing the hugetlb, as a result DAMON will get inaccurate accessing statistics. So changing to use set_huge_pte_at() to make the huge pte old to fix this issue. [1] https://lore.kernel.org/all/Yqy97gXI4Nqb7dYo@arm.com/ Link: https://lkml.kernel.org/r/1655692482-28797-1-git-send-email-baolin.wang@linux.alibaba.com Fixes: 49f4203aae06 ("mm/damon: add access checking for hugetlb pages") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pagesAxel Rasmussen
When fallocate() is used on a shmem file, the pages we allocate can end up with !PageUptodate. Since UFFDIO_CONTINUE tries to find the existing page the user wants to map with SGP_READ, we would fail to find such a page, since shmem_getpage_gfp returns with a "NULL" pagep for SGP_READ if it discovers !PageUptodate. As a result, UFFDIO_CONTINUE returns -EFAULT, as it would do if the page wasn't found in the page cache at all. This isn't the intended behavior. UFFDIO_CONTINUE is just trying to find if a page exists, and doesn't care whether it still needs to be cleared or not. So, instead of SGP_READ, pass in SGP_NOALLOC. This is the same, except for one critical difference: in the !PageUptodate case, SGP_NOALLOC will clear the page and then return it. With this change, UFFDIO_CONTINUE works properly (succeeds) on a shmem file which has been fallocated, but otherwise not modified. Link: https://lkml.kernel.org/r/20220610173812.1768919-1-axelrasmussen@google.com Fixes: 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem") Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-01usercopy: use unsigned long instead of uintptr_tJason A. Donenfeld
A recent commit factored out a series of annoying (unsigned long) casts into a single variable declaration, but made the pointer type a `uintptr_t` rather than the usual `unsigned long`. This patch changes it to be the integer type more typically used by the kernel to represent addresses. Fixes: 35fb9ae4aa2e ("usercopy: Cast pointer to an integer once") Cc: Matthew Wilcox <willy@infradead.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Joe Perches <joe@perches.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220616143617.449094-1-Jason@zx2c4.com
2022-06-30memblock: avoid some repeat when add new rangeJinyu Tang
The worst case is that the new memory range overlaps all existing regions, which requires type->cnt + 1 empty struct memblock_region slots in the type->regions array. So if type->cnt + 1 + type->cnt is less than type->max, we can insert regions directly rather than calculate the needed amount before the insertion. And becase of merge operation in the end of function, tpye->cnt will increase slowly for many cases. This change allows to avoid unnecessary repeat of memblock ranges traversal for many cases when adding new memory range. Signed-off-by: Jinyu Tang <tjytimi@163.com> [rppt: massaged comment and changelog text] Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
2022-06-29filemap: Use filemap_read_folio() in do_read_cache_folio()Matthew Wilcox (Oracle)
By passing ->read_folio to filemap_read_folio(), we can use filemap_read_folio() in do_read_cache_folio(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-29filemap: Handle AOP_TRUNCATED_PAGE in do_read_cache_folio()Matthew Wilcox (Oracle)
If the call to filler() returns AOP_TRUNCATED_PAGE, we need to retry the page cache lookup. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-29filemap: Move 'filler' case to the end of do_read_cache_folio()Matthew Wilcox (Oracle)
No functionality change intended; this simply moves code around to disentangle the function a little. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-29filemap: Remove find_get_pages_range() and associated functionsMatthew Wilcox (Oracle)
All callers of find_get_pages_range(), pagevec_lookup_range() and pagevec_lookup() have now been removed. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-29shmem: Convert shmem_unlock_mapping() to use filemap_get_folios()Matthew Wilcox (Oracle)
This is a straightforward conversion. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-29vmscan: Add check_move_unevictable_folios()Matthew Wilcox (Oracle)
Change the guts of check_move_unevictable_pages() over to use folios and add check_move_unevictable_pages() as a wrapper. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-29filemap: Add filemap_get_folios()Matthew Wilcox (Oracle)
This is the equivalent of find_get_pages() but fills a folio_batch instead of an array of pages. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-29filemap: Remove add_to_page_cache() and add_to_page_cache_locked()Matthew Wilcox (Oracle)
These functions have no more users, so delete them. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
2022-06-29hugetlb: Convert huge_add_to_page_cache() to use a folioMatthew Wilcox (Oracle)
Remove the last caller of add_to_page_cache() Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
2022-06-29mm: Remove __delete_from_page_cache()Matthew Wilcox (Oracle)
This wrapper is no longer used. Remove it and all references to it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-29mm: Account dirty folios properly during splitsMatthew Wilcox (Oracle)
If the last folio in a file is split as a result of truncation, we simply clear the dirty bits for the pages we're discarding. That causes NR_FILE_DIRTY (among other counters) to be thrown off and eventually Linux will hang in balance_dirty_pages_ratelimited() Reported-by: Dave Chinner <dchinner@redhat.com> Tested-by: Dave Chinner <dchinner@redhat.com> Tested-by: Darrick J. Wong <djwong@kernel.org> Fixes: d68eccad3706 ("mm/filemap: Allow large folios to be added to the page cache") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-28arch/*/: remove CONFIG_VIRT_TO_BUSArnd Bergmann
All architecture-independent users of virt_to_bus() and bus_to_virt() have been fixed to use the dma mapping interfaces or have been removed now. This means the definitions on most architectures, and the CONFIG_VIRT_TO_BUS symbol are now obsolete and can be removed. The only exceptions to this are a few network and scsi drivers for m68k Amiga and VME machines and ppc32 Macintosh. These drivers work correctly with the old interfaces and are probably not worth changing. On alpha and parisc, virt_to_bus() were still used in asm/floppy.h. alpha can use isa_virt_to_bus() like x86 does, and parisc can just open-code the virt_to_phys() here, as this is architecture specific code. I tried updating the bus-virt-phys-mapping.rst documentation, which started as an email from Linus to explain some details of the Linux-2.0 driver interfaces. The bits about virt_to_bus() were declared obsolete backin 2000, and the rest is not all that relevant any more, so in the end I just decided to remove the file completely. Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-06-27docs: rename Documentation/vm to Documentation/mmMike Rapoport
so it will be consistent with code mm directory and with Documentation/admin-guide/mm and won't be confused with virtual machines. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Jonathan Corbet <corbet@lwn.net> Acked-by: Wu XiangCheng <bobwxc@email.cn>
2022-06-27Merge branch 'master' into mm-stableakpm
2022-06-27mm: ioremap: Add ioremap/iounmap_allowed()Kefeng Wang
Add special hook for architecture to verify addr, size or prot when ioremap() or iounmap(), which will make the generic ioremap more useful. ioremap_allowed() return a bool, - true means continue to remap - false means skip remap and return directly iounmap_allowed() return a bool, - true means continue to vunmap - false code means skip vunmap and return directly Meanwhile, only vunmap the address when it is in vmalloc area as the generic ioremap only returns vmalloc addresses. Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Baoquan He <bhe@redhat.com> Link: https://lore.kernel.org/r/20220607125027.44946-5-wangkefeng.wang@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-27mm: ioremap: Setup phys_addr of struct vm_structKefeng Wang
Show physical address of each ioremap in /proc/vmallocinfo. Acked-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Link: https://lore.kernel.org/r/20220607125027.44946-4-wangkefeng.wang@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-27mm: ioremap: Use more sensible name in ioremap_prot()Kefeng Wang
Use more meaningful and sensible naming phys_addr instead addr in ioremap_prot(). Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220610092255.32445-1-wangkefeng.wang@huawei.com Signed-off-by: Will Deacon <will@kernel.org>
2022-06-26Merge tag 'mm-hotfixes-stable-2022-06-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "Minor things, mainly - mailmap updates, MAINTAINERS updates, etc. Fixes for this merge window: - fix for a damon boot hang, from SeongJae - fix for a kfence warning splat, from Jason Donenfeld - fix for zero-pfn pinning, from Alex Williamson - fix for fallocate hole punch clearing, from Mike Kravetz Fixes for previous releases: - fix for a performance regression, from Marcelo - fix for a hwpoisining BUG from zhenwei pi" * tag 'mm-hotfixes-stable-2022-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: add entry for Christian Marangi mm/memory-failure: disable unpoison once hw error happens hugetlbfs: zero partial pages during fallocate hole punch mm: memcontrol: reference to tools/cgroup/memcg_slabinfo.py mm: re-allow pinning of zero pfns mm/kfence: select random number before taking raw lock MAINTAINERS: add maillist information for LoongArch MAINTAINERS: update MM tree references MAINTAINERS: update Abel Vesa's email MAINTAINERS: add MEMORY HOT(UN)PLUG section and add David as reviewer MAINTAINERS: add Miaohe Lin as a memory-failure reviewer mailmap: add alias for jarkko@profian.com mm/damon/reclaim: schedule 'damon_reclaim_timer' only after 'system_wq' is initialized kthread: make it clear that kthread_create_on_node() might be terminated by any fatal signal mm: lru_cache_disable: use synchronize_rcu_expedited mm/page_isolation.c: fix one kernel-doc comment
2022-06-23filemap: Fix serialization adding transparent huge pages to page cacheAlistair Popple
Commit 793917d997df ("mm/readahead: Add large folio readahead") introduced support for using large folios for filebacked pages if the filesystem supports it. page_cache_ra_order() was introduced to allocate and add these large folios to the page cache. However adding pages to the page cache should be serialized against truncation and hole punching by taking invalidate_lock. Not doing so can lead to data races resulting in stale data getting added to the page cache and marked up-to-date. See commit 730633f0b7f9 ("mm: Protect operations adding pages to page cache with invalidate_lock") for more details. This issue was found by inspection but a testcase revealed it was possible to observe in practice on XFS. Fix this by taking invalidate_lock in page_cache_ra_order(), to mirror what is done for the non-thp case in page_cache_ra_unbounded(). Signed-off-by: Alistair Popple <apopple@nvidia.com> Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-23mm: Clear page->private when splitting or migrating a pageMatthew Wilcox (Oracle)
In our efforts to remove uses of PG_private, we have found folios with the private flag clear and folio->private not-NULL. That is the root cause behind 642d51fb0775 ("ceph: check folio PG_private bit instead of folio->private"). It can also affect a few other filesystems that haven't yet reported a problem. compaction_alloc() can return a page with uninitialised page->private, and rather than checking all the callers of migrate_pages(), just zero page->private after calling get_new_page(). Similarly, the tail pages from split_huge_page() may also have an uninitialised page->private. Reported-by: Xiubo Li <xiubli@redhat.com> Tested-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-20filemap: Handle sibling entries in filemap_get_read_batch()Matthew Wilcox (Oracle)
If a read races with an invalidation followed by another read, it is possible for a folio to be replaced with a higher-order folio. If that happens, we'll see a sibling entry for the new folio in the next iteration of the loop. This manifests as a NULL pointer dereference while holding the RCU read lock. Handle this by simply returning. The next call will find the new folio and handle it correctly. The other ways of handling this rare race are more complex and it's just not worth it. Reported-by: Dave Chinner <david@fromorbit.com> Reported-by: Brian Foster <bfoster@redhat.com> Debugged-by: Brian Foster <bfoster@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Fixes: cbd59c48ae2b ("mm/filemap: use head pages in generic_file_buffered_read") Cc: stable@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-20filemap: Correct the conditions for marking a folio as accessedMatthew Wilcox (Oracle)
We had an off-by-one error which meant that we never marked the first page in a read as accessed. This was visible as a slowdown when re-reading a file as pages were being evicted from cache too soon. In reviewing this code, we noticed a second bug where a multi-page folio would be marked as accessed multiple times when doing reads that were less than the size of the folio. Abstract the comparison of whether two file positions are in the same folio into a new function, fixing both of these bugs. Reported-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-20Merge tag 'slab-for-5.19-fixup' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: - A slub fix for PREEMPT_RT locking semantics from Sebastian. - A slub fix for state corruption due to a possible race scenario from Jann. * tag 'slab-for-5.19-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slub: add missing TID updates on slab deactivation mm/slub: Move the stackdepot related allocation out of IRQ-off section.
2022-06-17Merge tag 'fs_for_v5.19-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull writeback and ext2 fixes from Jan Kara: "A fix for writeback bug which prevented machines with kdevtmpfs from booting and also one small ext2 bugfix in IO error handling" * tag 'fs_for_v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: init: Initialize noop_backing_dev_info early ext2: fix fs corruption when trying to remove a non-empty directory with IO error
2022-06-16mm/kmemleak: prevent soft lockup in first object iteration loop of ↵Waiman Long
kmemleak_scan() The first RCU-based object iteration loop has to modify the object count. So we cannot skip taking the object lock. One way to avoid soft lockup is to insert occasional cond_resched() call into the loop. This cannot be done while holding the RCU read lock which is to protect objects from being freed. However, taking a reference to the object will prevent it from being freed. We can then do a cond_resched() call after every 64k objects safely. Link: https://lkml.kernel.org/r/20220614220359.59282-4-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16mm/kmemleak: skip unlikely objects in kmemleak_scan() without taking lockWaiman Long
There are 3 RCU-based object iteration loops in kmemleak_scan(). Because of the need to take RCU read lock, we can't insert cond_resched() into the loop like other parts of the function. As there can be millions of objects to be scanned, it takes a while to iterate all of them. The kmemleak functionality is usually enabled in a debug kernel which is much slower than a non-debug kernel. With sufficient number of kmemleak objects, the time to iterate them all may exceed 22s causing soft lockup. watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kmemleak:625] In this particular bug report, the soft lockup happen in the 2nd iteration loop. In the 2nd and 3rd loops, most of the objects are checked and then skipped under the object lock. Only a selected fews are modified. Those objects certainly need lock protection. However, the lock/unlock operation is slow especially with interrupt disabling and enabling included. We can actually do some basic check like color_white() without taking the lock and skip the object accordingly. Of course, this kind of check is racy and may miss objects that are being modified concurrently. The cost of missed objects, however, is just that they will be discovered in the next scan instead. The advantage of doing so is that iteration can be done much faster especially with LOCKDEP enabled in a debug kernel. With a debug kernel running on a 2-socket 96-thread x86-64 system (HZ=1000), the 2nd and 3rd iteration loops speedup with this patch on the first kmemleak_scan() call after bootup is shown in the table below. Before patch After patch Loop # # of objects Elapsed time # of objects Elapsed time ------ ------------ ------------ ------------ ------------ 2 2,599,850 2.392s 2,596,364 0.266s 3 2,600,176 2.171s 2,597,061 0.260s This patch reduces loop iteration times by about 88%. This will greatly reduce the chance of a soft lockup happening in the 2nd or 3rd iteration loops. Even though the first loop runs a little bit faster, it can still be problematic if many kmemleak objects are there. As the object count has to be modified in every object, we cannot avoid taking the object lock. So other way to prevent soft lockup will be needed. Link: https://lkml.kernel.org/r/20220614220359.59282-3-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16mm/kmemleak: use _irq lock/unlock variants in kmemleak_scan/_clear()Waiman Long
Patch series "mm/kmemleak: Avoid soft lockup in kmemleak_scan()", v2. There are 3 RCU-based object iteration loops in kmemleak_scan(). Because of the need to take RCU read lock, we can't insert cond_resched() into the loop like other parts of the function. As there can be millions of objects to be scanned, it takes a while to iterate all of them. The kmemleak functionality is usually enabled in a debug kernel which is much slower than a non-debug kernel. With sufficient number of kmemleak objects, the time to iterate them all may exceed 22s causing soft lockup. watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [kmemleak:625] This patch series make changes to the 3 object iteration loops in kmemleak_scan() to prevent them from causing soft lockup. This patch (of 3): kmemleak_scan() is called only from the kmemleak scan thread or from write to the kmemleak debugfs file. Both are in task context and so we can directly use the simpler _irq() lock/unlock calls instead of the more complex _irqsave/_irqrestore variants. Similarly, kmemleak_clear() is called only from write to the kmemleak debugfs file. The same change can be applied. Link: https://lkml.kernel.org/r/20220614220359.59282-1-longman@redhat.com Link: https://lkml.kernel.org/r/20220614220359.59282-2-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16mm/sparse-vmemmap.c: remove unwanted initialization in ↵Gautam Menghani
vmemmap_populate_compound_pages() Remove unnecessary initialization for the variable 'next'. This fixes the clang scan warning: Value stored to 'next' during its initialization is never read [deadcode.DeadStores] Link: https://lkml.kernel.org/r/20220612182320.160651-1-gautammenghani201@gmail.com Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16mm: kmem: make mem_cgroup_from_obj() vmalloc()-safeRoman Gushchin
Currently mem_cgroup_from_obj() is not working properly with objects allocated using vmalloc(). It creates problems in some cases, when it's called for static objects belonging to modules or generally allocated using vmalloc(). This patch makes mem_cgroup_from_obj() safe to be called on objects allocated using vmalloc(). It also introduces mem_cgroup_from_slab_obj(), which is a faster version to use in places when we know the object is either a slab object or a generic slab page (e.g. when adding an object to a lru list). Link: https://lkml.kernel.org/r/20220610180310.1725111-1-roman.gushchin@linux.dev Suggested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Acked-by: Shakeel Butt <shakeelb@google.com> Tested-by: Vasily Averin <vvs@openvz.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Qian Cai <quic_qiancai@quicinc.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16mm/memremap: fix memunmap_pages() race with get_dev_pagemap()Miaohe Lin
Think about the below scene: CPU1 CPU2 memunmap_pages percpu_ref_exit __percpu_ref_exit free_percpu(percpu_count); /* percpu_count is freed here! */ get_dev_pagemap xa_load(&pgmap_array, PHYS_PFN(phys)) /* pgmap still in the pgmap_array */ percpu_ref_tryget_live(&pgmap->ref) if __ref_is_percpu /* __PERCPU_REF_ATOMIC_DEAD not set yet */ this_cpu_inc(*percpu_count) /* access freed percpu_count here! */ ref->percpu_count_ptr = __PERCPU_REF_ATOMIC_DEAD; /* too late... */ pageunmap_range To fix the issue, do percpu_ref_exit() after pgmap_array is emptied. So we won't do percpu_ref_tryget_live() against a being freed percpu_ref. Link: https://lkml.kernel.org/r/20220609121305.2508-1-linmiaohe@huawei.com Fixes: b7b3c01b1915 ("mm/memremap_pages: support multiple ranges per invocation") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16mm: kmemleak: check physical address when scanPatrick Wang
Check the physical address of objects for its boundary when scan instead of in kmemleak_*_phys(). Link: https://lkml.kernel.org/r/20220611035551.1823303-5-patrick.wang.shcn@gmail.com Fixes: 23c2d497de21 ("mm: kmemleak: take a full lowmem check in kmemleak_*_phys()") Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Yee Lee <yee.lee@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16mm: kmemleak: add rbtree and store physical address for objects allocated ↵Patrick Wang
with PA Add object_phys_tree_root to store the objects allocated with physical address. Distinguish it from object_tree_root by OBJECT_PHYS flag or function argument. The physical address is stored directly in those objects. Link: https://lkml.kernel.org/r/20220611035551.1823303-4-patrick.wang.shcn@gmail.com Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Yee Lee <yee.lee@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16mm: kmemleak: add OBJECT_PHYS flag for objects allocated with physical addressPatrick Wang
Add OBJECT_PHYS flag for object. This flag is used to identify the objects allocated with physical address. The create_object_phys() function is added as well to set that flag and is used by kmemleak_alloc_phys(). Link: https://lkml.kernel.org/r/20220611035551.1823303-3-patrick.wang.shcn@gmail.com Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Yee Lee <yee.lee@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16mm: kmemleak: remove kmemleak_not_leak_phys() and the min_count argument to ↵Patrick Wang
kmemleak_alloc_phys() Patch series "mm: kmemleak: store objects allocated with physical address separately and check when scan", v4. The kmemleak_*_phys() interface uses "min_low_pfn" and "max_low_pfn" to check address. But on some architectures, kmemleak_*_phys() is called before those two variables initialized. The following steps will be taken: 1) Add OBJECT_PHYS flag and rbtree for the objects allocated with physical address 2) Store physical address in objects if allocated with OBJECT_PHYS 3) Check the boundary when scan instead of in kmemleak_*_phys() This patch set will solve: https://lore.kernel.org/r/20220527032504.30341-1-yee.lee@mediatek.com https://lore.kernel.org/r/9dd08bb5-f39e-53d8-f88d-bec598a08c93@gmail.com v3: https://lore.kernel.org/r/20220609124950.1694394-1-patrick.wang.shcn@gmail.com v2: https://lore.kernel.org/r/20220603035415.1243913-1-patrick.wang.shcn@gmail.com v1: https://lore.kernel.org/r/20220531150823.1004101-1-patrick.wang.shcn@gmail.com This patch (of 4): Remove the unused kmemleak_not_leak_phys() function. And remove the min_count argument to kmemleak_alloc_phys() function, assume it's 0. Link: https://lkml.kernel.org/r/20220611035551.1823303-1-patrick.wang.shcn@gmail.com Link: https://lkml.kernel.org/r/20220611035551.1823303-2-patrick.wang.shcn@gmail.com Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Yee Lee <yee.lee@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>