summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2024-10-28tmpfs: Expose filesystem features via sysfsAndré Almeida
Expose filesystem features through sysfs, so userspace can query if tmpfs support casefold. This follows the same setup as defined by ext4 and f2fs to expose casefold support to userspace. Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-8-f443d5814194@igalia.com Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28tmpfs: Add flag FS_CASEFOLD_FL support for tmpfs dirsAndré Almeida
Enable setting flag FS_CASEFOLD_FL for tmpfs directories, when tmpfs is mounted with casefold support. A special check is need for this flag, since it can't be set for non-empty directories. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-7-f443d5814194@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28tmpfs: Add casefold lookup supportAndré Almeida
Enable casefold lookup in tmpfs, based on the encoding defined by userspace. That means that instead of comparing byte per byte a file name, it compares to a case-insensitive equivalent of the Unicode string. * Dcache handling There's a special need when dealing with case-insensitive dentries. First of all, we currently invalidated every negative casefold dentries. That happens because currently VFS code has no proper support to deal with that, giving that it could incorrectly reuse a previous filename for a new file that has a casefold match. For instance, this could happen: $ mkdir DIR $ rm -r DIR $ mkdir dir $ ls DIR/ And would be perceived as inconsistency from userspace point of view, because even that we match files in a case-insensitive manner, we still honor whatever is the initial filename. Along with that, tmpfs stores only the first equivalent name dentry used in the dcache, preventing duplications of dentries in the dcache. The d_compare() version for casefold files uses a normalized string, so the filename under lookup will be compared to another normalized string for the existing file, achieving a casefolded lookup. * Enabling casefold via mount options Most filesystems have their data stored in disk, so casefold option need to be enabled when building a filesystem on a device (via mkfs). However, as tmpfs is a RAM backed filesystem, there's no disk information and thus no mkfs to store information about casefold. For tmpfs, create casefold options for mounting. Userspace can then enable casefold support for a mount point using: $ mount -t tmpfs -o casefold=utf8-12.1.0 fs_name mount_dir/ Userspace must set what Unicode standard is aiming to. The available options depends on what the kernel Unicode subsystem supports. And for strict encoding: $ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name mount_dir/ Strict encoding means that tmpfs will refuse to create invalid UTF-8 sequences. When this option is not enabled, any invalid sequence will be treated as an opaque byte sequence, ignoring the encoding thus not being able to be looked up in a case-insensitive way. * Check for casefold dirs on simple_lookup() On simple_lookup(), do not create dentries for casefold directories. Currently, VFS does not support case-insensitive negative dentries and can create inconsistencies in the filesystem. Prevent such dentries to being created in the first place. Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-6-f443d5814194@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28memblock: uniformly initialize all reserved pages to MIGRATE_MOVABLEHua Su
Currently when CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set, the reserved pages are initialized to MIGRATE_MOVABLE by default in memmap_init. Reserved memory mainly store the metadata of struct page. When HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=Y and hugepages are allocated, the HVO will remap the vmemmap virtual address range to the page which vmemmap_reuse is mapped to. The pages previously mapping the range will be freed to the buddy system. Before this patch: when CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set, the freed memory was placed on the Movable list; When CONFIG_DEFERRED_STRUCT_PAGE_INIT=Y, the freed memory was placed on the Unmovable list. After this patch, the freed memory is placed on the Movable list regardless of whether CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. Eg: Tested on a virtual machine(1000GB): Intel(R) Xeon(R) Platinum 8358P CPU After vm start: echo 500000 > /proc/sys/vm/nr_hugepages cat /proc/meminfo | grep -i huge HugePages_Total: 500000 HugePages_Free: 500000 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 1024000000 kB cat /proc/pagetypeinfo before: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 … Node 0, zone Normal, type Unmovable 51 2 1 28 53 35 35 43 40 69 3852 Node 0, zone Normal, type Movable 6485 4610 666 202 200 185 208 87 54 2 240 Node 0, zone Normal, type Reclaimable 2 2 1 23 13 1 2 1 0 1 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Unmovable ≈ 15GB after: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 … Node 0, zone Normal, type Unmovable 0 1 1 0 0 0 0 1 1 1 0 Node 0, zone Normal, type Movable 1563 4107 1119 189 256 368 286 132 109 4 3841 Node 0, zone Normal, type Reclaimable 2 2 1 23 13 1 2 1 0 1 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Signed-off-by: Hua Su <suhua.tanke@gmail.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Link: https://lore.kernel.org/r/20241021051151.4664-1-suhua.tanke@gmail.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2024-10-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. No conflicts. Adjacent changes in: include/linux/bpf.h include/uapi/linux/bpf.h kernel/bpf/btf.c kernel/bpf/helpers.c kernel/bpf/syscall.c kernel/bpf/verifier.c kernel/trace/bpf_trace.c mm/slab_common.c tools/include/uapi/linux/bpf.h tools/testing/selftests/bpf/Makefile Link: https://lore.kernel.org/all/20241024215724.60017-1-daniel@iogearbox.net/ Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-22mm/page-writeback.c: Fix comment of wb_domain_writeout_add()Tang Yizhou
__bdi_writeout_inc() has undergone multiple renamings, but the comment within the function body have not been updated accordingly. Update it to reflect the latest wb_domain_writeout_add(). Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Link: https://lore.kernel.org/r/20241009151728.300477-3-yizhou.tang@shopee.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22mm/page-writeback.c: Update comment for BANDWIDTH_INTERVALTang Yizhou
The name of the BANDWIDTH_INTERVAL macro is misleading, as it is not only used in the bandwidth update functions wb_update_bandwidth() and __wb_update_bandwidth(), but also in the dirty limit update function domain_update_dirty_limit(). Currently, we haven't found an ideal name, so update the comment only. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Link: https://lore.kernel.org/r/20241009151728.300477-2-yizhou.tang@shopee.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-21LoongArch: Set initial pte entry with PAGE_GLOBAL for kernel spaceBibo Mao
There are two pages in one TLB entry on LoongArch system. For kernel space, it requires both two pte entries (buddies) with PAGE_GLOBAL bit set, otherwise HW treats it as non-global tlb, there will be potential problems if tlb entry for kernel space is not global. Such as fail to flush kernel tlb with the function local_flush_tlb_kernel_range() which supposed only flush tlb with global bit. Kernel address space areas include percpu, vmalloc, vmemmap, fixmap and kasan areas. For these areas both two consecutive page table entries should be enabled with PAGE_GLOBAL bit. So with function set_pte() and pte_clear(), pte buddy entry is checked and set besides its own pte entry. However it is not atomic operation to set both two pte entries, there is problem with test_vmalloc test case. So function kernel_pte_init() is added to init a pte table when it is created for kernel address space, and the default initial pte value is PAGE_GLOBAL rather than zero at beginning. Then only its own pte entry need update with function set_pte() and pte_clear(), nothing to do with the pte buddy entry. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-10-18mm: fix follow_pfnmap API lockdep assertLinus Torvalds
The lockdep asserts for the new follow_pfnmap() API "knows" that a pfnmap always has a vma->vm_file, since that's the only way to create such a mapping. And that's actually true for all the normal cases. But not for the mmap failure case, where the incomplete mapping is torn down and we have cleared vma->vm_file because the failure occured before the file was linked to the vma. So this codepath does actually need to check for vm_file being NULL. Reported-by: Jann Horn <jannh@google.com> Fixes: 6da8e9634bb7 ("mm: new follow_pfnmap API") Cc: Peter Xu <peterx@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-18mm: Use str_on_off() helper function in report_meminit()Thorsten Blum
Remove hard-coded strings by using the helper function str_on_off(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://lore.kernel.org/r/20241018103150.96824-2-thorsten.blum@linux.dev Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2024-10-17mm/mglru: only clear kswapd_failures if reclaimableWei Xu
lru_gen_shrink_node() unconditionally clears kswapd_failures, which can prevent kswapd from sleeping and cause 100% kswapd cpu usage even when kswapd repeatedly fails to make progress in reclaim. Only clear kswap_failures in lru_gen_shrink_node() if reclaim makes some progress, similar to shrink_node(). I happened to run into this problem in one of my tests recently. It requires a combination of several conditions: The allocator needs to allocate a right amount of pages such that it can wake up kswapd without itself being OOM killed; there is no memory for kswapd to reclaim (My test disables swap and cleans page cache first); no other process frees enough memory at the same time. Link: https://lkml.kernel.org/r/20241014221211.832591-1-weixugc@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Wei Xu <weixugc@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens <heftig@archlinux.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm/swapfile: skip HugeTLB pages for unuse_vmaLiu Shixin
I got a bad pud error and lost a 1GB HugeTLB when calling swapoff. The problem can be reproduced by the following steps: 1. Allocate an anonymous 1GB HugeTLB and some other anonymous memory. 2. Swapout the above anonymous memory. 3. run swapoff and we will get a bad pud error in kernel message: mm/pgtable-generic.c:42: bad pud 00000000743d215d(84000001400000e7) We can tell that pud_clear_bad is called by pud_none_or_clear_bad in unuse_pud_range() by ftrace. And therefore the HugeTLB pages will never be freed because we lost it from page table. We can skip HugeTLB pages for unuse_vma to fix it. Link: https://lkml.kernel.org/r/20241015014521.570237-1-liushixin2@huawei.com Fixes: 0fe6e20b9c4c ("hugetlb, rmap: add reverse mapping for hugepage") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: swap: prevent possible data-race in __try_to_reclaim_swapJeongjun Park
A report [1] was uploaded from syzbot. In the previous commit 862590ac3708 ("mm: swap: allow cache reclaim to skip slot cache"), the __try_to_reclaim_swap() function reads offset and folio->entry from folio without folio_lock protection. In the currently reported KCSAN log, it is assumed that the actual data-race will not occur because the calltrace that does WRITE already obtains the folio_lock and then writes. However, the existing __try_to_reclaim_swap() function was already implemented to perform reads under folio_lock protection [1], and there is a risk of a data-race occurring through a function other than the one shown in the KCSAN log. Therefore, I think it is appropriate to change read operations for folio to be performed under folio_lock. [1] ================================================================== BUG: KCSAN: data-race in __delete_from_swap_cache / __try_to_reclaim_swap write to 0xffffea0004c90328 of 8 bytes by task 5186 on cpu 0: __delete_from_swap_cache+0x1f0/0x290 mm/swap_state.c:163 delete_from_swap_cache+0x72/0xe0 mm/swap_state.c:243 folio_free_swap+0x1d8/0x1f0 mm/swapfile.c:1850 free_swap_cache mm/swap_state.c:293 [inline] free_pages_and_swap_cache+0x1fc/0x410 mm/swap_state.c:325 __tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline] tlb_batch_pages_flush mm/mmu_gather.c:149 [inline] tlb_flush_mmu_free mm/mmu_gather.c:366 [inline] tlb_flush_mmu+0x2cf/0x440 mm/mmu_gather.c:373 zap_pte_range mm/memory.c:1700 [inline] zap_pmd_range mm/memory.c:1739 [inline] zap_pud_range mm/memory.c:1768 [inline] zap_p4d_range mm/memory.c:1789 [inline] unmap_page_range+0x1f3c/0x22d0 mm/memory.c:1810 unmap_single_vma+0x142/0x1d0 mm/memory.c:1856 unmap_vmas+0x18d/0x2b0 mm/memory.c:1900 exit_mmap+0x18a/0x690 mm/mmap.c:1864 __mmput+0x28/0x1b0 kernel/fork.c:1347 mmput+0x4c/0x60 kernel/fork.c:1369 exit_mm+0xe4/0x190 kernel/exit.c:571 do_exit+0x55e/0x17f0 kernel/exit.c:926 do_group_exit+0x102/0x150 kernel/exit.c:1088 get_signal+0xf2a/0x1070 kernel/signal.c:2917 arch_do_signal_or_restart+0x95/0x4b0 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop kernel/entry/common.c:111 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x59/0x130 kernel/entry/common.c:218 do_syscall_64+0xd6/0x1c0 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffffea0004c90328 of 8 bytes by task 5189 on cpu 1: __try_to_reclaim_swap+0x9d/0x510 mm/swapfile.c:198 free_swap_and_cache_nr+0x45d/0x8a0 mm/swapfile.c:1915 zap_pte_range mm/memory.c:1656 [inline] zap_pmd_range mm/memory.c:1739 [inline] zap_pud_range mm/memory.c:1768 [inline] zap_p4d_range mm/memory.c:1789 [inline] unmap_page_range+0xcf8/0x22d0 mm/memory.c:1810 unmap_single_vma+0x142/0x1d0 mm/memory.c:1856 unmap_vmas+0x18d/0x2b0 mm/memory.c:1900 exit_mmap+0x18a/0x690 mm/mmap.c:1864 __mmput+0x28/0x1b0 kernel/fork.c:1347 mmput+0x4c/0x60 kernel/fork.c:1369 exit_mm+0xe4/0x190 kernel/exit.c:571 do_exit+0x55e/0x17f0 kernel/exit.c:926 __do_sys_exit kernel/exit.c:1055 [inline] __se_sys_exit kernel/exit.c:1053 [inline] __x64_sys_exit+0x1f/0x20 kernel/exit.c:1053 x64_sys_call+0x2d46/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:61 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x0000000000000242 -> 0x0000000000000000 Link: https://lkml.kernel.org/r/20241007070623.23340-1-aha310510@gmail.com Reported-by: syzbot+fa43f1b63e3aa6f66329@syzkaller.appspotmail.com Fixes: 862590ac3708 ("mm: swap: allow cache reclaim to skip slot cache") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: khugepaged: fix the incorrect statistics when collapsing large file foliosBaolin Wang
Khugepaged already supports collapsing file large folios (including shmem mTHP) by commit 7de856ffd007 ("mm: khugepaged: support shmem mTHP collapse"), and the control parameters in khugepaged: 'khugepaged_max_ptes_swap' and 'khugepaged_max_ptes_none', still compare based on PTE granularity to determine whether a file collapse is needed. However, the statistics for 'present' and 'swap' in hpage_collapse_scan_file() do not take into account the large folios, which may lead to incorrect judgments regarding the khugepaged_max_ptes_swap/none parameters, resulting in unnecessary file collapses. To fix this issue, take into account the large folios' statistics for 'present' and 'swap' variables in the hpage_collapse_scan_file(). Link: https://lkml.kernel.org/r/c76305d96d12d030a1a346b50503d148364246d2.1728901391.git.baolin.wang@linux.alibaba.com Fixes: 7de856ffd007 ("mm: khugepaged: support shmem mTHP collapse") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: don't install PMD mappings when THPs are disabled by the hw/process/vmaDavid Hildenbrand
We (or rather, readahead logic :) ) might be allocating a THP in the pagecache and then try mapping it into a process that explicitly disabled THP: we might end up installing PMD mappings. This is a problem for s390x KVM, which explicitly remaps all PMD-mapped THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before starting the VM. For example, starting a VM backed on a file system with large folios supported makes the VM crash when the VM tries accessing such a mapping using KVM. Is it also a problem when the HW disabled THP using TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case without X86_FEATURE_PSE. In the future, we might be able to do better on s390x and only disallow PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED really wants. For now, fix it by essentially performing the same check as would be done in __thp_vma_allowable_orders() or in shmem code, where this works as expected, and disallow PMD mappings, making us fallback to PTE mappings. Link: https://lkml.kernel.org/r/20241011102445.934409-3-david@redhat.com Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Leo Fu <bfu@redhat.com> Tested-by: Thomas Huth <thuth@redhat.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw()Kefeng Wang
Patch series "mm: don't install PMD mappings when THPs are disabled by the hw/process/vma". During testing, it was found that we can get PMD mappings in processes where THP (and more precisely, PMD mappings) are supposed to be disabled. While it works as expected for anon+shmem, the pagecache is the problematic bit. For s390 KVM this currently means that a VM backed by a file located on filesystem with large folio support can crash when KVM tries accessing the problematic page, because the readahead logic might decide to use a PMD-sized THP and faulting it into the page tables will install a PMD mapping, something that s390 KVM cannot tolerate. This might also be a problem with HW that does not support PMD mappings, but I did not try reproducing it. Fix it by respecting the ways to disable THPs when deciding whether we can install a PMD mapping. khugepaged should already be taking care of not collapsing if THPs are effectively disabled for the hw/process/vma. This patch (of 2): Add vma_thp_disabled() and thp_disabled_by_hw() helpers to be shared by shmem_allowable_huge_orders() and __thp_vma_allowable_orders(). [david@redhat.com: rename to vma_thp_disabled(), split out thp_disabled_by_hw() ] Link: https://lkml.kernel.org/r/20241011102445.934409-2-david@redhat.com Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Leo Fu <bfu@redhat.com> Tested-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Boqiao Fu <bfu@redhat.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: khugepaged: fix the arguments order in khugepaged_collapse_file trace pointYang Shi
The "addr" and "is_shmem" arguments have different order in TP_PROTO and TP_ARGS. This resulted in the incorrect trace result: text-hugepage-644429 [276] 392092.878683: mm_khugepaged_collapse_file: mm=0xffff20025d52c440, hpage_pfn=0x200678c00, index=512, addr=1, is_shmem=0, filename=text-hugepage, nr=512, result=failed The value of "addr" is wrong because it was treated as bool value, the type of is_shmem. Fix the order in TP_PROTO to keep "addr" is before "is_shmem" since the original patch review suggested this order to achieve best packing. And use "lx" for "addr" instead of "ld" in TP_printk because address is typically shown in hex. After the fix, the trace result looks correct: text-hugepage-7291 [004] 128.627251: mm_khugepaged_collapse_file: mm=0xffff0001328f9500, hpage_pfn=0x20016ea00, index=512, addr=0x400000, is_shmem=0, filename=text-hugepage, nr=512, result=failed Link: https://lkml.kernel.org/r/20241012011702.1084846-1-yang@os.amperecomputing.com Fixes: 4c9473e87e75 ("mm/khugepaged: add tracepoint to collapse_file()") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Cc: Gautam Menghani <gautammenghani201@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> [6.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm/damon/tests/sysfs-kunit.h: fix memory leak in damon_sysfs_test_add_targets()Jinjie Ruan
The sysfs_target->regions allocated in damon_sysfs_regions_alloc() is not freed in damon_sysfs_test_add_targets(), which cause the following memory leak, free it to fix it. unreferenced object 0xffffff80c2a8db80 (size 96): comm "kunit_try_catch", pid 187, jiffies 4294894363 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 0): [<0000000001e3714d>] kmemleak_alloc+0x34/0x40 [<000000008e6835c1>] __kmalloc_cache_noprof+0x26c/0x2f4 [<000000001286d9f8>] damon_sysfs_test_add_targets+0x1cc/0x738 [<0000000032ef8f77>] kunit_try_run_case+0x13c/0x3ac [<00000000f3edea23>] kunit_generic_run_threadfn_adapter+0x80/0xec [<00000000adf936cf>] kthread+0x2e8/0x374 [<0000000041bb1628>] ret_from_fork+0x10/0x20 Link: https://lkml.kernel.org/r/20241010125323.3127187-1-ruanjinjie@huawei.com Fixes: b8ee5575f763 ("mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: remove unused stub for can_swapin_thp()Andy Shevchenko
When can_swapin_thp() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: mm/memory.c:4184:20: error: unused function 'can_swapin_thp' [-Werror,-Wunused-function] Fix this by removing the unused stub. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Link: https://lkml.kernel.org/r/20241008191329.2332346-1-andriy.shevchenko@linux.intel.com Fixes: 242d12c98174 ("mm: support large folios swap-in for sync io devices") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Barry Song <baohua@kernel.org> Cc: Bill Wendling <morbo@google.com> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm/mremap: fix move_normal_pmd/retract_page_tables raceJann Horn
In mremap(), move_page_tables() looks at the type of the PMD entry and the specified address range to figure out by which method the next chunk of page table entries should be moved. At that point, the mmap_lock is held in write mode, but no rmap locks are held yet. For PMD entries that point to page tables and are fully covered by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called, which first takes rmap locks, then does move_normal_pmd(). move_normal_pmd() takes the necessary page table locks at source and destination, then moves an entire page table from the source to the destination. The problem is: The rmap locks, which protect against concurrent page table removal by retract_page_tables() in the THP code, are only taken after the PMD entry has been read and it has been decided how to move it. So we can race as follows (with two processes that have mappings of the same tmpfs file that is stored on a tmpfs mount with huge=advise); note that process A accesses page tables through the MM while process B does it through the file rmap: process A process B ========= ========= mremap mremap_to move_vma move_page_tables get_old_pmd alloc_new_pmd *** PREEMPT *** madvise(MADV_COLLAPSE) do_madvise madvise_walk_vmas madvise_vma_behavior madvise_collapse hpage_collapse_scan_file collapse_file retract_page_tables i_mmap_lock_read(mapping) pmdp_collapse_flush i_mmap_unlock_read(mapping) move_pgt_entry(NORMAL_PMD, ...) take_rmap_locks move_normal_pmd drop_rmap_locks When this happens, move_normal_pmd() can end up creating bogus PMD entries in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect depends on arch-specific and machine-specific details; on x86, you can end up with physical page 0 mapped as a page table, which is likely exploitable for user->kernel privilege escalation. Fix the race by letting process B recheck that the PMD still points to a page table after the rmap locks have been taken. Otherwise, we bail and let the caller fall back to the PTE-level copying path, which will then bail immediately at the pmd_none() check. Bug reachability: Reaching this bug requires that you can create shmem/file THP mappings - anonymous THP uses different code that doesn't zap stuff under rmap locks. File THP is gated on an experimental config flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need shmem THP to hit this bug. As far as I know, getting shmem THP normally requires that you can mount your own tmpfs with the right mount flags, which would require creating your own user+mount namespace; though I don't know if some distros maybe enable shmem THP by default or something like that. Bug impact: This issue can likely be used for user->kernel privilege escalation when it is reachable. Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: David Hildenbrand <david@redhat.com> Co-developed-by: David Hildenbrand <david@redhat.com> Closes: https://project-zero.issues.chromium.org/371047675 Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm/mmap: correct error handling in mmap_region()Lorenzo Stoakes
Commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()") changed how error handling is performed in mmap_region(). The error value defaults to -ENOMEM, but then gets reassigned immediately to the result of vms_gather_munmap_vmas() if we are performing a MAP_FIXED mapping over existing VMAs (and thus unmapping them). This overwrites the error value, potentially clearing it. After this, we invoke may_expand_vm() and possibly vm_area_alloc(), and check to see if they failed. If they do so, then we perform error-handling logic, but importantly, we do NOT update the error code. This means that, if vms_gather_munmap_vmas() succeeds, but one of these calls does not, the function will return indicating no error, but rather an address value of zero, which is entirely incorrect. Correct this and avoid future confusion by strictly setting error on each and every occasion we jump to the error handling logic, and set the error code immediately prior to doing so. This way we can see at a glance that the error code is always correct. Many thanks to Vegard Nossum who spotted this issue in discussion around this problem. Link: https://lkml.kernel.org/r/20241002073932.13482-1-lorenzo.stoakes@oracle.com Fixes: f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Vegard Nossum <vegard.nossum@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-16mm/bpf: Add bpf_get_kmem_cache() kfuncNamhyung Kim
The bpf_get_kmem_cache() is to get a slab cache information from a virtual address like virt_to_cache(). If the address is a pointer to a slab object, it'd return a valid kmem_cache pointer, otherwise NULL is returned. It doesn't grab a reference count of the kmem_cache so the caller is responsible to manage the access. The returned point is marked as PTR_UNTRUSTED. The intended use case for now is to symbolize locks in slab objects from the lock contention tracepoints. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241010232505.1339892-3-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-16mm/damon/core: Use generic upper bound recommondation for usleep_range()Anna-Maria Behnsen
The upper bound for usleep_range_idle() was taken from the outdated documentation. As a recommondation for the upper bound of usleep_range() depends on HZ configuration it is not possible to hard code it. Use the define "USLEEP_RANGE_UPPER_BOUND" instead. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-8-dc8b907cb62f@linutronix.de
2024-10-16timers: Rename usleep_idle_range() to usleep_range_idle()Anna-Maria Behnsen
usleep_idle_range() is a variant of usleep_range(). Both are using usleep_range_state() as a base. To be able to find all the related functions in one go, rename it usleep_idle_range() to usleep_range_idle(). No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-4-dc8b907cb62f@linutronix.de
2024-10-15rust: treewide: switch to the kernel `Vec` typeDanilo Krummrich
Now that we got the kernel `Vec` in place, convert all existing `Vec` users to make use of it. Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Benno Lossin <benno.lossin@proton.me> Reviewed-by: Gary Guo <gary@garyguo.net> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://lore.kernel.org/r/20241004154149.93856-20-dakr@kernel.org [ Converted `kasan_test_rust.rs` too, as discussed. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2024-10-10Merge patch series "timekeeping/fs: multigrain timestamp redux"Christian Brauner
Jeff Layton <jlayton@kernel.org> says: The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. To remedy this, keep a global monotonic atomic64_t value that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems). * patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org: tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10tmpfs: add support for multigrain timestampsJeff Layton
Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. tmpfs only requires the FS_MGTIME flag. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-12-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-09mm: zswap: delete comments for "value" member of 'struct zswap_entry'.Kanchana P Sridhar
Made a minor edit in the comments for 'struct zswap_entry' to delete the description of the 'value' member that was deleted in commit 20a5532ffa53 ("mm: remove code to handle same filled pages"). Link: https://lkml.kernel.org/r/20241002194213.30041-1-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Fixes: 20a5532ffa53 ("mm: remove code to handle same filled pages") Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Usama Arif <usamaarif642@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09secretmem: disable memfd_secret() if arch cannot set direct mapPatrick Roy
Return -ENOSYS from memfd_secret() syscall if !can_set_direct_map(). This is the case for example on some arm64 configurations, where marking 4k PTEs in the direct map not present can only be done if the direct map is set up at 4k granularity in the first place (as ARM's break-before-make semantics do not easily allow breaking apart large/gigantic pages). More precisely, on arm64 systems with !can_set_direct_map(), set_direct_map_invalid_noflush() is a no-op, however it returns success (0) instead of an error. This means that memfd_secret will seemingly "work" (e.g. syscall succeeds, you can mmap the fd and fault in pages), but it does not actually achieve its goal of removing its memory from the direct map. Note that with this patch, memfd_secret() will start erroring on systems where can_set_direct_map() returns false (arm64 with CONFIG_RODATA_FULL_DEFAULT_ENABLED=n, CONFIG_DEBUG_PAGEALLOC=n and CONFIG_KFENCE=n), but that still seems better than the current silent failure. Since CONFIG_RODATA_FULL_DEFAULT_ENABLED defaults to 'y', most arm64 systems actually have a working memfd_secret() and aren't be affected. From going through the iterations of the original memfd_secret patch series, it seems that disabling the syscall in these scenarios was the intended behavior [1] (preferred over having set_direct_map_invalid_noflush return an error as that would result in SIGBUSes at page-fault time), however the check for it got dropped between v16 [2] and v17 [3], when secretmem moved away from CMA allocations. [1]: https://lore.kernel.org/lkml/20201124164930.GK8537@kernel.org/ [2]: https://lore.kernel.org/lkml/20210121122723.3446-11-rppt@kernel.org/#t [3]: https://lore.kernel.org/lkml/20201125092208.12544-10-rppt@kernel.org/ Link: https://lkml.kernel.org/r/20241001080056.784735-1-roypat@amazon.co.uk Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Patrick Roy <roypat@amazon.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09mm/huge_memory: check pmd_special() only after pmd_present()David Hildenbrand
We should only check for pmd_special() after we made sure that we have a present PMD. For example, if we have a migration PMD, pmd_special() might indicate that we have a special PMD although we really don't. This fixes confusing migration entries as PFN mappings, and not doing what we are supposed to do in the "is_swap_pmd()" case further down in the function -- including messing up COW, page table handling and accounting. Link: https://lkml.kernel.org/r/20240926154234.2247217-1-david@redhat.com Fixes: bc02afbd4d73 ("mm/fork: accept huge pfnmap entries") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+bf2c35fa302ebe3c7471@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/66f15c8d.050a0220.c23dd.000f.GAE@google.com/ Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-07rust: enable `clippy::undocumented_unsafe_blocks` lintMiguel Ojeda
Checking that we are not missing any `// SAFETY` comments in our `unsafe` blocks is something we have wanted to do for a long time, as well as cleaning up the remaining cases that were not documented [1]. Back when Rust for Linux started, this was something that could have been done via a script, like Rust's `tidy`. Soon after, in Rust 1.58.0, Clippy implemented the `undocumented_unsafe_blocks` lint [2]. Even though the lint has a few false positives, e.g. in some cases where attributes appear between the comment and the `unsafe` block [3], there are workarounds and the lint seems quite usable already. Thus enable the lint now. We still have a few cases to clean up, so just allow those for the moment by writing a `TODO` comment -- some of those may be good candidates for new contributors. Link: https://github.com/Rust-for-Linux/linux/issues/351 [1] Link: https://rust-lang.github.io/rust-clippy/master/#/undocumented_unsafe_blocks [2] Link: https://github.com/rust-lang/rust-clippy/issues/13189 [3] Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Trevor Gross <tmgross@umich.edu> Tested-by: Gary Guo <gary@garyguo.net> Reviewed-by: Gary Guo <gary@garyguo.net> Link: https://lore.kernel.org/r/20240904204347.168520-5-ojeda@kernel.org Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2024-10-04mm: Introduce ARCH_HAS_USER_SHADOW_STACKMark Brown
Since multiple architectures have support for shadow stacks and we need to select support for this feature in several places in the generic code provide a generic config option that the architectures can select. Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Deepak Gupta <debug@rivosinc.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Kees Cook <kees@kernel.org> Tested-by: Kees Cook <kees@kernel.org> Acked-by: Shuah Khan <skhan@linuxfoundation.org> Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-1-222b78d87eee@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04migrate: Remove references to Private2Matthew Wilcox (Oracle)
These comments are now stale; rewrite them. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241002040111.1023018-7-willy@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-04fs: Move clearing of mappedtodisk to buffer.cMatthew Wilcox (Oracle)
The mappedtodisk flag is only meaningful for buffer head based filesystems. It should not be cleared for other filesystems. This allows us to reuse the mappedtodisk flag to have other meanings in filesystems that do not use buffer heads. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241002040111.1023018-2-willy@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-02mm, slab: suppress warnings in test_leak_destroy kunit testVlastimil Babka
The test_leak_destroy kunit test intends to test the detection of stray objects in kmem_cache_destroy(), which normally produces a warning. The other slab kunit tests suppress the warnings in the kunit test context, so suppress warnings and related printk output in this test as well. Automated test running environments then don't need to learn to filter the warnings. Also rename the test's kmem_cache, the name was wrongly copy-pasted from test_kfree_rcu. Fixes: 4e1c44b3db79 ("kunit, slub: add test_kfree_rcu() and test_leak_destroy()") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202408251723.42f3d902-oliver.sang@intel.com Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Closes: https://lore.kernel.org/all/CAB=+i9RHHbfSkmUuLshXGY_ifEZg9vCZi3fqr99+kmmnpDus7Q@mail.gmail.com/ Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/all/6fcb1252-7990-4f0d-8027-5e83f0fb9409@roeck-us.net/ Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-10-02filemap: filemap_read() should check that the offset is positive or zeroTrond Myklebust
We do check that the read offset is less than the filesystem limit, however for good measure we should also check that it is positive or zero, and return EINVAL if that is not the case. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Link: https://lore.kernel.org/r/482ee0b8a30b62324adb9f7c551a99926f037393.1726257832.git.trond.myklebust@hammerspace.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-01mm, slab: fix use of SLAB_SUPPORTS_SYSFS in kmem_cache_release()Nilay Shroff
The fix implemented in commit 4ec10268ed98 ("mm, slab: unlink slabinfo, sysfs and debugfs immediately") caused a subtle side effect due to which while destroying the kmem cache, the code path would never get into sysfs_slab_release() function even though SLAB_SUPPORTS_SYSFS is defined and slab state is FULL. Due to this side effect, we would never release kobject defined for kmem cache and leak the associated memory. The issue here's with the use of __is_defined() macro in kmem_cache_ release(). The __is_defined() macro expands to __take_second_arg( arg1_or_junk 1, 0). If "arg1_or_junk" is defined to 1 then it expands to __take_second_arg(0, 1, 0) and returns 1. If "arg1_or_junk" is NOT defined to any value then it expands to __take_second_arg(... 1, 0) and returns 0. In this particular issue, SLAB_SUPPORTS_SYSFS is defined without any associated value and that causes __is_defined(SLAB_SUPPORTS_SYSFS) to always evaluate to 0 and hence it would never invoke sysfs_slab_release(). This patch helps fix this issue by defining SLAB_SUPPORTS_SYSFS to 1. Fixes: 4ec10268ed98 ("mm, slab: unlink slabinfo, sysfs and debugfs immediately") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs9YCCcfmdxN43-9H3HnTYQsRtTYw1Kzq-L468GfLKAENA@mail.gmail.com/ Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-09-27Merge tag 'mm-hotfixes-stable-2024-09-27-09-45' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "19 hotfixes. 13 are cc:stable. There's a focus on fixes for the memfd_pin_folios() work which was added into 6.11. Apart from that, the usual shower of singleton fixes" * tag 'mm-hotfixes-stable-2024-09-27-09-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: ocfs2: fix uninit-value in ocfs2_get_block() zram: don't free statically defined names memory tiers: use default_dram_perf_ref_source in log message Revert "list: test: fix tests for list_cut_position()" kselftests: mm: fix wrong __NR_userfaultfd value compiler.h: specify correct attribute for .rodata..c_jump_table mm/damon/Kconfig: update DAMON doc URL mm: kfence: fix elapsed time for allocated/freed track ocfs2: fix deadlock in ocfs2_get_system_file_inode ocfs2: reserve space for inline xattr before attaching reflink tree mm: migrate: annotate data-race in migrate_folio_unmap() mm/hugetlb: simplify refs in memfd_alloc_folio mm/gup: fix memfd_pin_folios alloc race panic mm/gup: fix memfd_pin_folios hugetlb page allocation mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak mm/hugetlb: fix memfd_pin_folios free_huge_pages leak mm/filemap: fix filemap_get_folios_contig THP panic mm: make SPLIT_PTE_PTLOCKS depend on SMP tools: fix shared radix-tree build
2024-09-27[tree-wide] finally take no_llseek outAl Viro
no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-26memory tiers: use default_dram_perf_ref_source in log messageHuang Ying
Commit 3718c02dbd4c ("acpi, hmat: calculate abstract distance with HMAT") added a default_dram_perf_ref_source variable that was initialized but never used. This causes kmemleak to report the following memory leak: unreferenced object 0xff11000225a47b60 (size 16): comm "swapper/0", pid 1, jiffies 4294761654 hex dump (first 16 bytes): 41 43 50 49 20 48 4d 41 54 00 c1 4b 7d b7 75 7c ACPI HMAT..K}.u| backtrace (crc e6d0e7b2): [<ffffffff95d5afdb>] __kmalloc_node_track_caller_noprof+0x36b/0x440 [<ffffffff95c276d6>] kstrdup+0x36/0x60 [<ffffffff95dfabfa>] mt_set_default_dram_perf+0x23a/0x2c0 [<ffffffff9ad64733>] hmat_init+0x2b3/0x660 [<ffffffff95203cec>] do_one_initcall+0x11c/0x5c0 [<ffffffff9ac9cfc4>] do_initcalls+0x1b4/0x1f0 [<ffffffff9ac9d52e>] kernel_init_freeable+0x4ae/0x520 [<ffffffff97c789cc>] kernel_init+0x1c/0x150 [<ffffffff952aecd1>] ret_from_fork+0x31/0x70 [<ffffffff9520b18a>] ret_from_fork_asm+0x1a/0x30 This reminds us that we forget to use the performance data source information. So, use the variable in the error log message to help identify the root cause of inconsistent performance number. Link: https://lkml.kernel.org/r/87y13mvo0n.fsf@yhuang6-desk2.ccr.corp.intel.com Fixes: 3718c02dbd4c ("acpi, hmat: calculate abstract distance with HMAT") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Waiman Long <longman@redhat.com> Acked-by: Waiman Long <longman@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26mm/damon/Kconfig: update DAMON doc URLDiederik de Haas
The old URL doesn't really work anymore and as the documentation has been integrated in the main kernel documentation site, change the URL to point to that. Link: https://lkml.kernel.org/r/20240924082331.11499-1-didi.debian@cknow.org Signed-off-by: Diederik de Haas <didi.debian@cknow.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26mm: kfence: fix elapsed time for allocated/freed trackqiwu.chen
Fix elapsed time for the allocated/freed track introduced by commit 62e73fd85d7bf. Link: https://lkml.kernel.org/r/20240924085004.75401-1-qiwu.chen@transsion.com Fixes: 62e73fd85d7b ("mm: kfence: print the elapsed time for allocated/freed track") Signed-off-by: qiwu.chen <qiwu.chen@transsion.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26mm: migrate: annotate data-race in migrate_folio_unmap()Jeongjun Park
I found a report from syzbot [1] This report shows that the value can be changed, but in reality, the value of __folio_set_movable() cannot be changed because it holds the folio refcount. Therefore, it is appropriate to add an annotate to make KCSAN ignore that data-race. [1] ================================================================== BUG: KCSAN: data-race in __filemap_remove_folio / migrate_pages_batch write to 0xffffea0004b81dd8 of 8 bytes by task 6348 on cpu 0: page_cache_delete mm/filemap.c:153 [inline] __filemap_remove_folio+0x1ac/0x2c0 mm/filemap.c:233 filemap_remove_folio+0x6b/0x1f0 mm/filemap.c:265 truncate_inode_folio+0x42/0x50 mm/truncate.c:178 shmem_undo_range+0x25b/0xa70 mm/shmem.c:1028 shmem_truncate_range mm/shmem.c:1144 [inline] shmem_evict_inode+0x14d/0x530 mm/shmem.c:1272 evict+0x2f0/0x580 fs/inode.c:731 iput_final fs/inode.c:1883 [inline] iput+0x42a/0x5b0 fs/inode.c:1909 dentry_unlink_inode+0x24f/0x260 fs/dcache.c:412 __dentry_kill+0x18b/0x4c0 fs/dcache.c:615 dput+0x5c/0xd0 fs/dcache.c:857 __fput+0x3fb/0x6d0 fs/file_table.c:439 ____fput+0x1c/0x30 fs/file_table.c:459 task_work_run+0x13a/0x1a0 kernel/task_work.c:228 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0xbe/0x130 kernel/entry/common.c:218 do_syscall_64+0xd6/0x1c0 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffffea0004b81dd8 of 8 bytes by task 6342 on cpu 1: __folio_test_movable include/linux/page-flags.h:699 [inline] migrate_folio_unmap mm/migrate.c:1199 [inline] migrate_pages_batch+0x24c/0x1940 mm/migrate.c:1797 migrate_pages_sync mm/migrate.c:1963 [inline] migrate_pages+0xff1/0x1820 mm/migrate.c:2072 do_mbind mm/mempolicy.c:1390 [inline] kernel_mbind mm/mempolicy.c:1533 [inline] __do_sys_mbind mm/mempolicy.c:1607 [inline] __se_sys_mbind+0xf76/0x1160 mm/mempolicy.c:1603 __x64_sys_mbind+0x78/0x90 mm/mempolicy.c:1603 x64_sys_call+0x2b4d/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:238 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0xffff888127601078 -> 0x0000000000000000 Link: https://lkml.kernel.org/r/20240924130053.107490-1-aha310510@gmail.com Fixes: 7e2a5e5ab217 ("mm: migrate: use __folio_test_movable()") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26mm/hugetlb: simplify refs in memfd_alloc_folioSteve Sistare
The folio_try_get in memfd_alloc_folio is not necessary. Delete it, and delete the matching folio_put in memfd_pin_folios. This also avoids leaking a ref if the memfd_alloc_folio call to hugetlb_add_to_page_cache fails. That error path is also broken in a second way -- when its folio_put causes the ref to become 0, it will implicitly call free_huge_folio, but then the path *explicitly* calls free_huge_folio. Delete the latter. This is a continuation of the fix "mm/hugetlb: fix memfd_pin_folios free_huge_pages leak" [steven.sistare@oracle.com: remove explicit call to free_huge_folio(), per Matthew] Link: https://lkml.kernel.org/r/Zti-7nPVMcGgpcbi@casper.infradead.org Link: https://lkml.kernel.org/r/1725481920-82506-1-git-send-email-steven.sistare@oracle.com Link: https://lkml.kernel.org/r/1725478868-61732-1-git-send-email-steven.sistare@oracle.com Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Suggested-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26mm/gup: fix memfd_pin_folios alloc race panicSteve Sistare
If memfd_pin_folios tries to create a hugetlb page, but someone else already did, then folio gets the value -EEXIST here: folio = memfd_alloc_folio(memfd, start_idx); if (IS_ERR(folio)) { ret = PTR_ERR(folio); if (ret != -EEXIST) goto err; then on the next trip through the "while start_idx" loop we panic here: if (folio) { folio_put(folio); To fix, set the folio to NULL on error. Link: https://lkml.kernel.org/r/1725373521-451395-6-git-send-email-steven.sistare@oracle.com Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26mm/gup: fix memfd_pin_folios hugetlb page allocationSteve Sistare
When memfd_pin_folios -> memfd_alloc_folio creates a hugetlb page, the index is wrong. The subsequent call to filemap_get_folios_contig thus cannot find it, and fails, and memfd_pin_folios loops forever. To fix, adjust the index for the huge_page_order. memfd_alloc_folio also forgets to unlock the folio, so the next touch of the page calls hugetlb_fault which blocks forever trying to take the lock. Unlock it. Link: https://lkml.kernel.org/r/1725373521-451395-5-git-send-email-steven.sistare@oracle.com Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26mm/hugetlb: fix memfd_pin_folios resv_huge_pages leakSteve Sistare
memfd_pin_folios followed by unpin_folios leaves resv_huge_pages elevated if the pages were not already faulted in. During a normal page fault, resv_huge_pages is consumed here: hugetlb_fault() alloc_hugetlb_folio() dequeue_hugetlb_folio_vma() dequeue_hugetlb_folio_nodemask() dequeue_hugetlb_folio_node_exact() free_huge_pages-- resv_huge_pages-- During memfd_pin_folios, the page is created by calling alloc_hugetlb_folio_nodemask instead of alloc_hugetlb_folio, and resv_huge_pages is not modified: memfd_alloc_folio() alloc_hugetlb_folio_nodemask() dequeue_hugetlb_folio_nodemask() dequeue_hugetlb_folio_node_exact() free_huge_pages-- alloc_hugetlb_folio_nodemask has other callers that must not modify resv_huge_pages. Therefore, to fix, define an alternate version of alloc_hugetlb_folio_nodemask for this call site that adjusts resv_huge_pages. Link: https://lkml.kernel.org/r/1725373521-451395-4-git-send-email-steven.sistare@oracle.com Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26mm/hugetlb: fix memfd_pin_folios free_huge_pages leakSteve Sistare
memfd_pin_folios followed by unpin_folios fails to restore free_huge_pages if the pages were not already faulted in, because the folio refcount for pages created by memfd_alloc_folio never goes to 0. memfd_pin_folios needs another folio_put to undo the folio_try_get below: memfd_alloc_folio() alloc_hugetlb_folio_nodemask() dequeue_hugetlb_folio_nodemask() dequeue_hugetlb_folio_node_exact() folio_ref_unfreeze(folio, 1); ; adds 1 refcount folio_try_get() ; adds 1 refcount hugetlb_add_to_page_cache() ; adds 512 refcount (on x86) With the fix, after memfd_pin_folios + unpin_folios, the refcount for the (unfaulted) page is 512, which is correct, as the refcount for a faulted unpinned page is 513. Link: https://lkml.kernel.org/r/1725373521-451395-3-git-send-email-steven.sistare@oracle.com Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26mm/filemap: fix filemap_get_folios_contig THP panicSteve Sistare
Patch series "memfd-pin huge page fixes". Fix multiple bugs that occur when using memfd_pin_folios with hugetlb pages and THP. The hugetlb bugs only bite when the page is not yet faulted in when memfd_pin_folios is called. The THP bug bites when the starting offset passed to memfd_pin_folios is not huge page aligned. See the commit messages for details. This patch (of 5): memfd_pin_folios on memory backed by THP panics if the requested start offset is not huge page aligned: BUG: kernel NULL pointer dereference, address: 0000000000000036 RIP: 0010:filemap_get_folios_contig+0xdf/0x290 RSP: 0018:ffffc9002092fbe8 EFLAGS: 00010202 RAX: 0000000000000002 RBX: 0000000000000002 RCX: 0000000000000002 The fault occurs here, because xas_load returns a folio with value 2: filemap_get_folios_contig() for (folio = xas_load(&xas); folio && xas.xa_index <= end; folio = xas_next(&xas)) { ... if (!folio_try_get(folio)) <-- BOOM "2" is an xarray sibling entry. We get it because memfd_pin_folios does not round the indices passed to filemap_get_folios_contig to huge page boundaries for THP, so we load from the middle of a huge page range see a sibling. (It does round for hugetlbfs, at the is_file_hugepages test). To fix, if the folio is a sibling, then return the next index as the starting point for the next call to filemap_get_folios_contig. Link: https://lkml.kernel.org/r/1725373521-451395-1-git-send-email-steven.sistare@oracle.com Link: https://lkml.kernel.org/r/1725373521-451395-2-git-send-email-steven.sistare@oracle.com Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26mm: make SPLIT_PTE_PTLOCKS depend on SMPGuenter Roeck
SPLIT_PTE_PTLOCKS depends on "NR_CPUS >= 4". Unfortunately, that evaluates to true if there is no NR_CPUS configuration option. This results in CONFIG_SPLIT_PTE_PTLOCKS=y for mac_defconfig. This in turn causes the m68k "q800" and "virt" machines to crash in qemu if debugging options are enabled. Making CONFIG_SPLIT_PTE_PTLOCKS dependent on the existence of NR_CPUS does not work since a dependency on the existence of a numeric Kconfig entry always evaluates to false. Example: config HAVE_NO_NR_CPUS def_bool y depends on !NR_CPUS After adding this to a Kconfig file, "make defconfig" includes: $ grep NR_CPUS .config CONFIG_NR_CPUS=64 CONFIG_HAVE_NO_NR_CPUS=y Defining NR_CPUS for m68k does not help either since many architectures define NR_CPUS only for SMP configurations. Make SPLIT_PTE_PTLOCKS depend on SMP instead to solve the problem. Link: https://lkml.kernel.org/r/20240924154205.1491376-1-linux@roeck-us.net Fixes: 394290cba966 ("mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig options") Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>