linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2025-04-01	mseal sysmap: kernel config and header change	Jeff Xu
	Patch series "mseal system mappings", v9. As discussed during mseal() upstream process [1], mseal() protects the VMAs of a given virtual memory range against modifications, such as the read/write (RW) and no-execute (NX) bits. For complete descriptions of memory sealing, please see mseal.rst [2]. The mseal() is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. The system mappings are readonly only, memory sealing can protect them from ever changing to writable or unmmap/remapped as different attributes. System mappings such as vdso, vvar, vvar_vclock, vectors (arm compat-mode), sigpage (arm compat-mode), are created by the kernel during program initialization, and could be sealed after creation. Unlike the aforementioned mappings, the uprobe mapping is not established during program startup. However, its lifetime is the same as the process's lifetime [3]. It could be sealed from creation. The vsyscall on x86-64 uses a special address (0xffffffffff600000), which is outside the mm managed range. This means mprotect, munmap, and mremap won't work on the vsyscall. Since sealing doesn't enhance the vsyscall's security, it is skipped in this patch. If we ever seal the vsyscall, it is probably only for decorative purpose, i.e. showing the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored. It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may alter the system mappings during restore operations. UML(User Mode Linux) and gVisor, rr are also known to change the vdso/vvar mappings. Consequently, this feature cannot be universally enabled across all systems. As such, CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default. To support mseal of system mappings, architectures must define CONFIG_ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS and update their special mappings calls to pass mseal flag. Additionally, architectures must confirm they do not unmap/remap system mappings during the process lifetime. The existence of this flag for an architecture implies that it does not require the remapping of thest system mappings during process lifetime, so sealing these mappings is safe from a kernel perspective. This version covers x86-64 and arm64 archiecture as minimum viable feature. While no specific CPU hardware features are required for enable this feature on an archiecture, memory sealing requires a 64-bit kernel. Other architectures can choose whether or not to adopt this feature. Currently, I'm not aware of any instances in the kernel code that actively munmap/mremap a system mapping without a request from userspace. The PPC does call munmap when _install_special_mapping fails for vdso; however, it's uncertain if this will ever fail for PPC - this needs to be investigated by PPC in the future [4]. The UML kernel can add this support when KUnit tests require it [5]. In this version, we've improved the handling of system mapping sealing from previous versions, instead of modifying the _install_special_mapping function itself, which would affect all architectures, we now call _install_special_mapping with a sealing flag only within the specific architecture that requires it. This targeted approach offers two key advantages: 1) It limits the code change's impact to the necessary architectures, and 2) It aligns with the software architecture by keeping the core memory management within the mm layer, while delegating the decision of sealing system mappings to the individual architecture, which is particularly relevant since 32-bit architectures never require sealing. Prior to this patch series, we explored sealing special mappings from userspace using glibc's dynamic linker. This approach revealed several issues: - The PT_LOAD header may report an incorrect length for vdso, (smaller than its actual size). The dynamic linker, which relies on PT_LOAD information to determine mapping size, would then split and partially seal the vdso mapping. Since each architecture has its own vdso/vvar code, fixing this in the kernel would require going through each archiecture. Our initial goal was to enable sealing readonly mappings, e.g. .text, across all architectures, sealing vdso from kernel since creation appears to be simpler than sealing vdso at glibc. - The [vvar] mapping header only contains address information, not length information. Similar issues might exist for other special mappings. - Mappings like uprobe are not covered by the dynamic linker, and there is no effective solution for them. This feature's security enhancements will benefit ChromeOS, Android, and other high security systems. Testing: This feature was tested on ChromeOS and Android for both x86-64 and ARM64. - Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly, i.e. "sl" shown in the smaps for those mappings, and mremap is blocked. - Passing various automation tests (e.g. pre-checkin) on ChromeOS and Android to ensure the sealing doesn't affect the functionality of Chromebook and Android phone. I also tested the feature on Ubuntu on x86-64: - With config disabled, vdso/vvar is not sealed, - with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK, normal operations such as browsing the web, open/edit doc are OK. Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1] Link: Documentation/userspace-api/mseal.rst [2] Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3] Link: https://lore.kernel.org/all/CABi2SkV6JJwJeviDLsq9N4ONvQ=EFANsiWkgiEOjyT9TQSt+HA@mail.gmail.com/ [4] Link: https://lore.kernel.org/all/202502251035.239B85A93@keescook/ [5] This patch (of 7): Provide infrastructure to mseal system mappings. Establish two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS, ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS) and VM_SEALED_SYSMAP macro for future patches. Link: https://lkml.kernel.org/r/20250305021711.3867874-1-jeffxu@google.com Link: https://lkml.kernel.org/r/20250305021711.3867874-2-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Berg <benjamin@sipsolutions.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Elliot Hughes <enh@google.com> Cc: Florian Faineli <f.fainelli@gmail.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <mike.rapoport@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm: pgtable: remove tlb_remove_page_ptdesc()	Qi Zheng
	The tlb_remove_ptdesc()/tlb_remove_table() is specially designed for page table pages, and now all architectures have been converted to use it to remove page table pages. So let's remove tlb_remove_page_ptdesc(), it currently has no users and should not be used for page table pages. Link: https://lkml.kernel.org/r/3df04c8494339073b71be4acb2d92e108ecd1b60.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	x86: pgtable: convert to use tlb_remove_ptdesc()	Qi Zheng
	The x86 has already been converted to use struct ptdesc, so convert it to use tlb_remove_ptdesc() instead of tlb_remove_table(). Link: https://lkml.kernel.org/r/36ad56b7e06fa4b17fb23c4fc650e8e0d72bb3cd.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	riscv: pgtable: unconditionally use tlb_remove_ptdesc()	Qi Zheng
	To support fast gup, the commit 69be3fb111e7 ("riscv: enable MMU_GATHER_RCU_TABLE_FREE for SMP && MMU") did the following: 1) use tlb_remove_page_ptdesc() for those platforms which use IPI to perform TLB shootdown 2) use tlb_remove_ptdesc() for those platforms which use SBI to perform TLB shootdown The tlb_remove_page_ptdesc() is the wrapper of the tlb_remove_page(). By design, the tlb_remove_page() should be used to remove a normal page from a page table entry, and should not be used for page table pages. The tlb_remove_ptdesc() is the wrapper of the tlb_remove_table(), which is designed specifically for freeing page table pages. If the CONFIG_MMU_GATHER_TABLE_FREE is enabled, the tlb_remove_table() will use semi RCU to free page table pages, that is: - batch table freeing: asynchronous free by RCU - single table freeing: IPI + synchronous free If the CONFIG_MMU_GATHER_TABLE_FREE is disabled, the tlb_remove_table() will fall back to pagetable_dtor() + tlb_remove_page(). For case 1), since we need to perform TLB shootdown before freeing the page table page, the local_irq_save() in fast gup can block the freeing and protect the fast gup page walker. Therefore we can ensure safety by just using tlb_remove_page_ptdesc(). In addition, we can also the tlb_remove_ptdesc()/tlb_remove_table() to achieve it, and it doesn't matter whether CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected. And in theory, the performance of freeing pages asynchronously via RCU will not be lower than synchronous free. For case 2), since local_irq_save() only disable S-privilege IPI irq but not M-privilege's, which is used by the SBI implementation to perform TLB shootdown, so we must select CONFIG_MMU_GATHER_RCU_TABLE_FREE and use tlb_remove_ptdesc() to ensure safety. The riscv selects this config for SMP && MMU, the CONFIG_RISCV_SBI is dependent on MMU. Therefore, only the UP system may have the situation where CONFIG_MMU_GATHER_RCU_TABLE_FREE is disabled but CONFIG_RISCV_SBI is enabled. But there is no freeing vs fast gup race in the UP system. So, in summary, we can use tlb_remove_ptdesc() to support fast gup in all cases, and this interface is specifically designed for page table pages. So let's use it unconditionally. Link: https://lkml.kernel.org/r/9025595e895515515c95e48db54b29afa489c41d.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm: pgtable: convert some architectures to use tlb_remove_ptdesc()	Qi Zheng
	Now, the nine architectures of csky, hexagon, loongarch, m68k, mips, nios2, openrisc, sh and um do not select CONFIG_MMU_GATHER_RCU_TABLE_FREE, and just call pagetable_dtor() + tlb_remove_page_ptdesc() (the wrapper of tlb_remove_page()). This is the same as the implementation of tlb_remove_{ptdesc\|table}() under !CONFIG_MMU_GATHER_TABLE_FREE, so convert these architectures to use tlb_remove_ptdesc(). The ultimate goal is to make the architecture only use tlb_remove_ptdesc() or tlb_remove_table() for page table pages. [zhengqi.arch@bytedance.com: v2] Link: https://lkml.kernel.org/r/20250303072603.45423-1-zhengqi.arch@bytedance.com [akpm@linux-foundation.org: remove trailing semi in arch/loongarch/include/asm/pgalloc.h] Link: https://lkml.kernel.org/r/19db3e8673b67bad2f1df1ab37f1c89d99eacfea.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc*	Qi Zheng
	All callers of tlb_remove_ptdesc() pass it a pointer of struct ptdesc, so let's change the pt parameter from void * to struct ptdesc * to perform a type safety check. Link: https://lkml.kernel.org/r/60bb44299cf2d731df6592e446e7f694054d0dbe.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm: pgtable: make generic tlb_remove_table() use struct ptdesc	Qi Zheng
	Patch series "remove tlb_remove_page_ptdesc()", v2. As suggested by Peter Zijlstra below [1], this series aims to remove tlb_remove_page_ptdesc(). : Fundamentally tlb_remove_page() is about removing pages as from a PTE, : there should not be a page-table anywhere near here ever. : : Yes, some architectures use tlb_remove_page() for page-tables too, but : that is more or less an implementation detail that can be fixed. After this series, all architectures use tlb_remove_table() or tlb_remove_ptdesc() to remove the page table pages. In the future, once all architectures using tlb_remove_table() have also converted to using struct ptdesc (eg. powerpc), it may be possible to use only tlb_remove_ptdesc(). [1] https://lore.kernel.org/linux-mm/20250103111457.GC22934@noisy.programming.kicks-ass.net/ This patch (of 6): Now only arm will call tlb_remove_ptdesc()/tlb_remove_table() when CONFIG_MMU_GATHER_TABLE_FREE is disabled. In this case, the type of the table parameter is actually struct ptdesc * instead of struct page . Since struct ptdesc still overlaps with struct page and has not been separated from it, forcing the table parameter to struct page will not cause any problems at this time. But this is definitely incorrect and needs to be fixed. So just like the generic __tlb_remove_table(), let generic tlb_remove_table() use struct ptdesc by default when CONFIG_MMU_GATHER_TABLE_FREE is disabled. Link: https://lkml.kernel.org/r/cover.1740454179.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/5be8c3ab7bd68510bf0db4cf84010f4dfe372917.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	microblaze/mm: put mm_cmdline_setup() in .init.text section	Wei Yang
	As reported by lkp, there is a section mismatch of mm_cmdline_setup() and memblock. The reason is we don't specify the section of mm_cmdline_setup() and gcc put it into .text.unlikely. As mm_cmdline_setup() is only used in mmu_init(), which is in .init.text section, put mm_cmdline_setup() into it too. Link: https://lkml.kernel.org/r/20250328010136.13139-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503241259.kJV3U7Xj-lkp@intel.com/ Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range	Jinjiang Tu
	We triggered the below BUG: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 pfn:0x240402 head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x1ffffe0000000040(head\|node=1\|zone=3\|lastcpupid=0x1ffff) page_type: f4(hugetlb) page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1) ------------[ cut here ]------------ kernel BUG at ./include/linux/page-flags.h:310! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : const_folio_flags+0x3c/0x58 lr : const_folio_flags+0x3c/0x58 Call trace: const_folio_flags+0x3c/0x58 (P) do_migrate_range+0x164/0x720 offline_pages+0x63c/0x6fc memory_subsys_offline+0x190/0x1f4 device_offline+0xc0/0x13c state_store+0x90/0xd8 dev_attr_store+0x18/0x2c sysfs_kf_write+0x44/0x54 kernfs_fop_write_iter+0x120/0x1cc vfs_write+0x240/0x378 ksys_write+0x70/0x108 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x48/0x10c el0_svc_common.constprop.0+0x40/0xe0 When allocating a hugetlb folio, between the folio is taken from buddy and prep_compound_page() is called, start_isolate_page_range() and do_migrate_range() is called. When do_migrate_range() scans the head page of the hugetlb folio, the compound_head field isn't set, so scans the tail page next. And at this time, the compound_head field of tail page is set, folio_test_large() is called by tail page, thus triggers VM_BUG_ON(). To fix it, get folio refcount before calling folio_test_large(). Link: https://lkml.kernel.org/r/20250324131750.1551884-1-tujinjiang@huawei.com Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration") Fixes: b62b51d2d159 ("mm: memory_hotplug: remove head variable in do_migrate_range()") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	MAINTAINERS: mm: add entry for secretmem	Mike Rapoport (Microsoft)
	Link: https://lkml.kernel.org/r/20250326215541.1809379-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Willaims <dan.j.williams@intel.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	MAINTAINERS: mm: add entry for numa memblocks and numa emulation	Mike Rapoport (Microsoft)
	Link: https://lkml.kernel.org/r/20250326215541.1809379-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Willaims <dan.j.williams@intel.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	MAINTAINERS: mm: add entry for execmem	Mike Rapoport (Microsoft)
	Link: https://lkml.kernel.org/r/20250326215541.1809379-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Willaims <dan.j.williams@intel.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	MAINTAINERS: fixup USERFAULTFD entry	Mike Rapoport (Microsoft)
	Patch series "MAINTAINERS: add my isub-entries to MM part." Following discussion at LSF/MM/BPF I'm adding execmem, secretmem and numa memblocks sub-entries for MEMORY MANAGEMENT in MAINTAINERS. This patch (of 4): Change title to "MEMORY MANAGEMENT - USERFAULTFD" and make it sub-topic in memory management and add missing include/linux/userfaultfd_k.h and mailing list Link: https://lkml.kernel.org/r/20250326215541.1809379-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20250326215541.1809379-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Willaims <dan.j.williams@intel.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	selftest/mm: va_high_addr_switch: add ppc64 support check	Li Wang
	Add PPC64 Radix MMU support to the va_high_addr_switch.sh by introducing check_supported_ppc64(). The function verifies: - 5-level paging (PGTABLE_LEVELS >= 5) enable in kernel config - Radix MMU (required for PPC64 5-level translation) - HugePages availability (needed for some tests) If any check fails, the test is skipped (ksft_skip). This ensures compatibility with Power9/Power10 systems running in Radix MMU mode. Avoid failures on 4-level paging system: # mmap(NULL, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(LOW_ADDR, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(HIGH_ADDR, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(HIGH_ADDR, MAP_HUGETLB) again: 0xffffffffffffffff - FAILED # mmap(HIGH_ADDR, MAP_FIXED \| MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(-1, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(-1, MAP_HUGETLB) again: 0xffffffffffffffff - FAILED # mmap(ADDR_SWITCH_HINT - PAGE_SIZE, 2HUGETLB_SIZE, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(ADDR_SWITCH_HINT , 2HUGETLB_SIZE, MAP_FIXED \| MAP_HUGETLB): 0xffffffffffffffff - FAILED Link: https://lkml.kernel.org/r/20250327114813.25980-1-liwang@redhat.com Signed-off-by: Li Wang <liwang@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	memblock: don't release high memory to page allocator when HIGHMEM is off	Mike Rapoport (Microsoft)
	Nathan Chancellor reports the following crash on a MIPS system with CONFIG_HIGHMEM=n: Linux version 6.14.0-rc6-00359-g6faea3422e3b (nathan@ax162) (mips-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 SMP Fri Mar 21 08:12:02 MST 2025 earlycon: uart8250 at I/O port 0x3f8 (options '38400n8') printk: legacy bootconsole [uart8250] enabled Config serial console: console=ttyS0,38400n8r CPU0 revision is: 00019300 (MIPS 24Kc) FPU revision is: 00739300 MIPS: machine is mti,malta Software DMA cache coherency enabled Initial ramdisk at: 0x8fad0000 (5360128 bytes) OF: reserved mem: Reserved memory: No reserved-memory node in the DT Primary instruction cache 2kB, VIPT, 2-way, linesize 16 bytes. Primary data cache 2kB, 2-way, VIPT, no aliases, linesize 16 bytes Zone ranges: DMA [mem 0x0000000000000000-0x0000000000ffffff] Normal [mem 0x0000000001000000-0x000000001fffffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000000000000-0x000000000fffffff] node 0: [mem 0x0000000090000000-0x000000009fffffff] Initmem setup node 0 [mem 0x0000000000000000-0x000000009fffffff] On node 0, zone Normal: 16384 pages in unavailable ranges random: crng init done percpu: Embedded 3 pages/cpu s18832 r8192 d22128 u49152 Kernel command line: rd_start=0xffffffff8fad0000 rd_size=5360128 console=ttyS0,38400n8r printk: log buffer data + meta data: 32768 + 102400 = 135168 bytes Dentry cache hash table entries: 65536 (order: 4, 262144 bytes, linear) Inode-cache hash table entries: 32768 (order: 3, 131072 bytes, linear) Writing ErrCtl register=00000000 Readback ErrCtl register=00000000 Built 1 zonelists, mobility grouping on. Total pages: 16384 mem auto-init: stack:all(zero), heap alloc:off, heap free:off Unhandled kernel unaligned access[#1]: CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc6-00359-g6faea3422e3b #1 Hardware name: mti,malta $ 0 : 00000000 00000001 81cb0880 00129027 $ 4 : 00000001 0000000a 00000002 00129026 $ 8 : ffffdfff 80101e00 00000002 00000000 $12 : 81c9c224 81c63e68 00000002 00000000 $16 : 805b1e00 00025800 81cb0880 00000002 $20 : 00000000 81c63e64 0000000a 81f10000 $24 : 81c63e64 81c63e60 $28 : 81c60000 81c63de0 00000001 81cc9d20 Hi : 00000000 Lo : 00000000 epc : 814a227c __free_pages_ok+0x144/0x3c0 ra : 81cc9d20 memblock_free_all+0x1d4/0x27c Status: 10000002 KERNEL EXL Cause : 00800410 (ExcCode 04) BadVA : 00129026 PrId : 00019300 (MIPS 24Kc) Modules linked in: Process swapper (pid: 0, threadinfo=(ptrval), task=(ptrval), tls=00000000) Stack : 81f10000 805a9e00 81c80000 00000000 00000002 814aa240 000003ff 00000400 00000000 81f10000 81c9c224 00003b1f 81c80000 81c63e60 81ca0000 81c63e64 81f10000 0000000a 0000001f 81cc9d20 81f10000 81cc96d8 00000000 81c80000 81c9c224 81c63e60 81c63e64 00000000 81f10000 00024000 00028000 00025c00 90000000 a0000000 00000002 00000017 00000000 00000000 81f10000 81f10000 ... Call Trace: [<814a227c>] __free_pages_ok+0x144/0x3c0 [<81cc9d20>] memblock_free_all+0x1d4/0x27c [<81cc6764>] mm_core_init+0x100/0x138 [<81cb4ba4>] start_kernel+0x4a0/0x6e4 Code: 1080ffd5 02003825 2467ffff <8ce30000> 7c630500 1060ffd4 00000000 8ce30000 7c630180 The crash happens because commit 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") too eagerly frees high memory to the page allocator even when HIGHMEM is disabled. Make sure that when CONFIG_HIGHMEM=n the high memory is not released to the page allocator. Link: https://lore.kernel.org/all/20250323190647.GA1009914@ax162 Link: https://lkml.kernel.org/r/20250325114928.1791109-3-rppt@kernel.org Reported-by: Nathan Chancellor <nathan@kernel.org> Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm/mm_init: init holes in the end of the memory map for FLATMEM	Mike Rapoport (Microsoft)
	Patch series "mm: fixes for fallouts from mem_init() cleanup". These are the fixes for fallouts from mem_init() cleanup reported by Nathan Chancellor and kbuild. The details are in the commit messages. This patch (of 2): Kernel test robot reports the following crash on 32-bit system with FLATMEM and DEBUG_VM_PGFLAGS enabled: [ 0.478822][ T0] kernel BUG at include/linux/page-flags.h:536! [ 0.479312][ T0] Oops: invalid opcode: 0000 [#1] PREEMPT SMP [ 0.479768][ T0] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc6-00357-g8268af309d07 #1 [ 0.480470][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 0.481260][ T0] EIP: reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.481683][ T0] Code: 5d c3 01 f1 89 c8 ba e1 38 f4 c3 e8 1e 37 8e fc 0f 0b b8 90 e2 62 c4 e8 e2 05 5e fc 01 f1 89 c8 ba be 85 f7 c3 e8 04 37 8e fc <0f> 0b b8 80 e2 62 c4 e8 c8 05 5e fc 55 89 e5 53 57 56 83 ec 10 89 [ 0.483177][ T0] EAX: 00000000 EBX: c425df50 ECX: 00000000 EDX: 00000000 [ 0.483712][ T0] ESI: 017ffc00 EDI: ffffffff EBP: c425df34 ESP: c425df2c [ 0.484248][ T0] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046 [ 0.484846][ T0] CR0: 80050033 CR2: 00000000 CR3: 04b48000 CR4: 00000090 [ 0.485376][ T0] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 0.485907][ T0] DR6: fffe0ff0 DR7: 00000400 [ 0.486253][ T0] Call Trace: [ 0.486494][ T0] ? __die_body (arch/x86/kernel/dumpstack.c:478) [ 0.486822][ T0] ? die (arch/x86/kernel/dumpstack.c:?) [ 0.487099][ T0] ? do_trap (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:197) [ 0.487409][ T0] ? do_error_trap (arch/x86/kernel/traps.c:217) [ 0.487752][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.488153][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301) [ 0.488490][ T0] ? handle_invalid_op (arch/x86/kernel/traps.c:254) [ 0.488869][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.489271][ T0] ? exc_invalid_op (arch/x86/kernel/traps.c:316) [ 0.489619][ T0] ? handle_exception (arch/x86/entry/entry_32.S:1055) [ 0.489996][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301) [ 0.490332][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.490733][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301) [ 0.491068][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.491470][ T0] memmap_init_reserved_pages (mm/memblock.c:2203) [ 0.491887][ T0] free_low_memory_core_early (mm/memblock.c:?) [ 0.492302][ T0] memblock_free_all (mm/memblock.c:2272 include/linux/atomic/atomic-arch-fallback.h:546 include/linux/atomic/atomic-long.h:123 include/linux/atomic/atomic-instrumented.h:3261 include/linux/mm.h:67 mm/memblock.c:2273) [ 0.492659][ T0] mem_init (arch/x86/mm/init_32.c:735) [ 0.492952][ T0] mm_core_init (mm/mm_init.c:2730) [ 0.493271][ T0] start_kernel (init/main.c:958) [ 0.493604][ T0] i386_start_kernel (arch/x86/kernel/head32.c:79) [ 0.493969][ T0] startup_32_smp (arch/x86/kernel/head_32.S:292) The crash happens because after commit 8268af309d07 ("arch, mm: set max_mapnr when allocating memory map for FLATMEM") max_mapnr is rounded up to MAX_ORDER_NR_PAGES and the pages in the end of the memory map are passing pfn_valid() check in reserve_bootmem_region(). Make sure that that pages in the end of the memory map are initialized, just like the pages in the end of the last section for SPARSEMEM. Link: https://lkml.kernel.org/r/20250325114928.1791109-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20250325114928.1791109-2-rppt@kernel.org Fixes: 8268af309d07 ("arch, mm: set max_mapnr when allocating memory map for FLATMEM") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503241424.d16223ec-lkp@intel.com Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	MAINTAINERS: add peterx as userfaultfd reviewer	Peter Xu
	Add an entry for userfaultfd and make myself a reviewer of it, just in case it helps people manage the cc list. I named it MEMORY USERFAULTFD, could be a bad name, but then it can be together with the MEMORY* entries when everything is in alphabetic order, which is definitely a benefit. The line may not change much on how I'd work with userfaultfd; I think I'll do the same as before.. But maybe it still, more or less, adds some responsibility on top, indeed. [akpm@linux-foundation.org: add include/linux/userfaultfd_k.h, per Mike] [akpm@linux-foundation.org: fix misordering] Link: https://lkml.kernel.org/r/20250322002124.131736-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm/page_alloc: replace flag check with PageHWPoison() in check_new_page_bad()	Ye Liu
	This patch replaces the direct check for the __PG_HWPOISON flag with the PageHWPoison() macro, improving code readability and maintaining consistency with other parts of the memory management code. Link: https://lkml.kernel.org/r/20250320063346.489030-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm/damon/core: simplify control flow in damon_register_ops()	Taotao Chen
	The function logic is not complex, so using goto is unnecessary. Replace it with a straightforward if-else to simplify control flow and improve readability. Link: https://lkml.kernel.org/r/Z9vxcPCw8tDsjKw1@OneApple Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm/kasan: use SLAB_NO_MERGE flag instead of an empty constructor	Harry Yoo
	Use SLAB_NO_MERGE flag to prevent merging instead of providing an empty constructor. Using an empty constructor in this manner is an abuse of slab interface. The SLAB_NO_MERGE flag should be used with caution, but in this case, it is acceptable as the cache is intended solely for debugging purposes. No functional changes intended. Link: https://lkml.kernel.org/r/20250318015926.1629748-1-harry.yoo@oracle.com Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Alexander Potapenko <glider@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm: page_alloc: fix defrag_mode's retry & OOM path	Johannes Weiner
	Brendan points out that defrag_mode doesn't properly clear ALLOC_NOFRAGMENT on its last-ditch attempt to allocate. But looking closer, the problem is actually more severe: it doesn't actually check whether it's already retried, and keeps looping. This means the OOM path is never taken, and the thread can loop indefinitely. This is verified with an intentional OOM test on defrag_mode=1, which results in the machine hanging. After this patch, it triggers the OOM kill reliably and recovers. Clear ALLOC_NOFRAGMENT properly, and only retry once. Link: https://lkml.kernel.org/r/20250401041231.GA2117727@cmpxchg.org Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm/mremap: do not set vrm->vma NULL immediately prior to checking it	Lorenzo Stoakes
	This seems rather unwise. If we cannot merge, extend, then we need to recall the original VMA to see if we need to uncharge. If we do need to, do so. Link: https://lkml.kernel.org/r/b2fb6b9c-376d-4e9b-905e-26d847fd3865@lucifer.local Fixes: d5c8aec0542e ("mm/mremap: initial refactor of move_vma()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-=by: "Lai, Yi" <yi1.lai@linux.intel.com> Closes: https://lore.kernel.org/linux-mm/Z+lcvEIHMLiKVR1i@ly-workstation/ Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm: zswap: fix crypto_free_acomp() deadlock in zswap_cpu_comp_dead()	Yosry Ahmed
	Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock (through crypto_exit_scomp_ops_async()). On the other hand, crypto_alloc_acomp_node() holds the scomp_lock (through crypto_scomp_init_tfm()), and then allocates memory. If the allocation results in reclaim, we may attempt to hold the per-CPU acomp_ctx mutex. The above dependencies can cause an ABBA deadlock. For example in the following scenario: (1) Task A running on CPU #1: crypto_alloc_acomp_node() Holds scomp_lock Enters reclaim Reads per_cpu_ptr(pool->acomp_ctx, 1) (2) Task A is descheduled (3) CPU #1 goes offline zswap_cpu_comp_dead(CPU #1) Holds per_cpu_ptr(pool->acomp_ctx, 1)) Calls crypto_free_acomp() Waits for scomp_lock (4) Task A running on CPU #2: Waits for per_cpu_ptr(pool->acomp_ctx, 1) // Read on CPU #1 DEADLOCK Since there is no requirement to call crypto_free_acomp() with the per-CPU acomp_ctx mutex held in zswap_cpu_comp_dead(), move it after the mutex is unlocked. Also move the acomp_request_free() and kfree() calls for consistency and to avoid any potential sublte locking dependencies in the future. With this, only setting acomp_ctx fields to NULL occurs with the mutex held. This is similar to how zswap_cpu_comp_prepare() only initializes acomp_ctx fields with the mutex held, after performing all allocations before holding the mutex. Opportunistically, move the NULL check on acomp_ctx so that it takes place before the mutex dereference. Link: https://lkml.kernel.org/r/20250226185625.2672936-1-yosry.ahmed@linux.dev Fixes: 12dcb0ef5406 ("mm: zswap: properly synchronize freeing resources during CPU hotunplug") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reported-by: syzbot+1a517ccfcbc6a7ab0f82@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67bcea51.050a0220.bbfd1.0096.GAE@google.com/ Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Nhat Pham <nphamcs@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Murphy <lists@colorremedies.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm/hugetlb: move hugetlb_sysctl_init() to the __init section	Marc Herbert
	hugetlb_sysctl_init() is only invoked once by an __init function and is merely a wrapper around another __init function so there is not reason to keep it. Fixes the following warning when toning down some GCC inline options: WARNING: modpost: vmlinux: section mismatch in reference: hugetlb_sysctl_init+0x1b (section: .text) -> __register_sysctl_init (section: .init.text) Link: https://lkml.kernel.org/r/20250319060041.2737320-1-marc.herbert@linux.intel.com Signed-off-by: Marc Herbert <Marc.Herbert@linux.intel.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm: page_isolation: avoid calling folio_hstate() without hugetlb_lock	Liu Shixin
	I found a NULL pointer dereference as followed: BUG: kernel NULL pointer dereference, address: 0000000000000028 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 5 UID: 0 PID: 5964 Comm: sh Kdump: loaded Not tainted 6.13.0-dirty #20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1. RIP: 0010:has_unmovable_pages+0x184/0x360 ... Call Trace: <TASK> set_migratetype_isolate+0xd1/0x180 start_isolate_page_range+0xd2/0x170 alloc_contig_range_noprof+0x101/0x660 alloc_contig_pages_noprof+0x238/0x290 alloc_gigantic_folio.isra.0+0xb6/0x1f0 only_alloc_fresh_hugetlb_folio.isra.0+0xf/0x60 alloc_pool_huge_folio+0x80/0xf0 set_max_huge_pages+0x211/0x490 __nr_hugepages_store_common+0x5f/0xe0 nr_hugepages_store+0x77/0x80 kernfs_fop_write_iter+0x118/0x200 vfs_write+0x23c/0x3f0 ksys_write+0x62/0xe0 do_syscall_64+0x5b/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e As has_unmovable_pages() call folio_hstate() without hugetlb_lock, there is a race to free the HugeTLB page between PageHuge() and folio_hstate(). There is no need to add hugetlb_lock here as the HugeTLB page can be freed in lot of places. So it's enough to unfold folio_hstate() and add a check to avoid NULL pointer dereference for hugepage_migration_supported(). Link: https://lkml.kernel.org/r/20250122061151.578768-1-liushixin2@huawei.com Fixes: 464c7ffbcb16 ("mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm/hugetlb_vmemmap: fix memory loads ordering	Yu Zhao
	Using x86_64 as an example, for a 32KB struct page[] area describing a 2MB hugeTLB, HVO reduces the area to 4KB by the following steps: 1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs; 2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped by PTE 0, and at the same time change the permission from r/w to r/o; 3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB to 4KB. However, the following race can happen due to improperly memory loads ordering: CPU 1 (HVO) CPU 2 (speculative PFN walker) page_ref_freeze() synchronize_rcu() rcu_read_lock() page_is_fake_head() is false vmemmap_remap_pte() XXX: struct page[] becomes r/o page_ref_unfreeze() page_ref_count() is not zero atomic_add_unless(&page->_refcount) XXX: try to modify r/o struct page[] Specifically, page_is_fake_head() must be ordered after page_ref_count() on CPU 2 so that it can only return true for this case, to avoid the later attempt to modify r/o struct page[]. This patch adds the missing memory barrier and makes the tests on page_is_fake_head() and page_ref_count() done in the proper order. Link: https://lkml.kernel.org/r/20250108074822.722696-1-yuzhao@google.com Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Will Deacon <will@kernel.org> Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/ Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Will Deacon <will@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	mm/userfaultfd: fix release hang over concurrent GUP	Peter Xu
	This patch should fix a possible userfaultfd release() hang during concurrent GUP. This problem was initially reported by Dimitris Siakavaras in July 2023 [1] in a firecracker use case. Firecracker has a separate process handling page faults remotely, and when the process releases the userfaultfd it can race with a concurrent GUP from KVM trying to fault in a guest page during the secondary MMU page fault process. A similar problem was reported recently again by Jinjiang Tu in March 2025 [2], even though the race happened this time with a mlockall() operation, which does GUP in a similar fashion. In 2017, commit 656710a60e36 ("userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS") was trying to fix this issue. AFAIU, that fixes well the fault paths but may not work yet for GUP. In GUP, the issue is NOPAGE will be almost treated the same as "page fault resolved" in faultin_page(), then the GUP will follow page again, seeing page missing, and it'll keep going into a live lock situation as reported. This change makes core mm return RETRY instead of NOPAGE for both the GUP and fault paths, proactively releasing the mmap read lock. This should guarantee the other release thread make progress on taking the write lock and avoid the live lock even for GUP. When at it, rearrange the comments to make sure it's uptodate. [1] https://lore.kernel.org/r/79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr [2] https://lore.kernel.org/r/20250307072133.3522652-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20250312145131.1143062-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Jinjiang Tu <tujinjiang@huawei.com> Cc: Dimitris Siakavaras <jimsiak@cslab.ece.ntua.gr> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01	Merge tag 'i2c-for-6.15-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: "i2c-core updates (collected by Wolfram): - remove last user and unexport i2c_of_match_device() - irq usage cleanup from Jiri i2c-host updates (collected by Andi): Refactoring and cleanups: - octeon, cadence, i801, pasemi, mlxbf, bcm-iproc: general refactorings - octeon: remove 10-bit address support Improvements: - amd-asf: improved error handling - designware: use guard(mutex) - amd-asf, designware: update naming to follow latest specs - cadence: fix cleanup path in probe - i801: use MMIO and I/O mapping helpers to access registers - pxa: handle error after clk_prepare_enable New features: - added i2c_10bit_addr__from_msg() and updated multiple drivers - omap: added multiplexer state handling - qcom-geni: update frequency configuration - qup: introduce DMA usage policy New hardware support: - exynos: add support for Samsung exynos7870 - k1: add support for spacemit k1 (new driver) - imx: add support for i.mx94 lpi2c - rk3x: add support for rk3562 - designware: add support for Renesas RZ/N1D Multiplexers: - ltc4306, reg: fix assignment in platform_driver structure at24 eeprom updates (collected by Bartosz): - add two new compatible entries to the DT binding document - drop of_match_ptr() and ACPI_PTR() macros" tag 'i2c-for-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (50 commits) dt-bindings: i2c: snps,designware-i2c: describe Renesas RZ/N1D variant irqdomain: i2c: Switch to irq_find_mapping() i2c: iproc: Refactor prototype and remove redundant error checks i2c: qcom-geni: Update i2c frequency table to match hardware guidance i2c: mlxbf: Use readl_poll_timeout_atomic() for polling i2c: pasemi: Add registers bits and switch to BIT() i2c: k1: Initialize variable before use i2c: spacemit: add support for SpacemiT K1 SoC dt-bindings: i2c: spacemit: add support for K1 SoC i2c: omap: Add support for setting mux dt-bindings: i2c: omap: Add mux-states property i2c: octeon: remove 10-bit addressing support i2c: octeon: fix return commenting i2c: i801: Use MMIO if available i2c: i801: Switch to iomapped register access i2c: i801: Improve too small kill wait time in i801_check_post i2c: i801: Move i801_wait_intr and i801_wait_byte_done in the code i2c: i801: Cosmetic improvements i2c: cadence: Move reset_control_assert after pm_runtime_set_suspended in probe error path i2c: cadence: Simplify using devm_clk_get_enabled() ...
2025-04-01	selftests: ublk: kublk: fix an error log line	Uday Shankar
	When doing io_uring operations using liburing, errno is not used to indicate errors, so the %m format specifier does not provide any relevant information for failed io_uring commands. Fix a log line emitted on get_params failure to translate the error code returned in the cqe->res field instead. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250401-ublk_selftests-v1-2-98129c9bc8bb@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-01	selftests: ublk: kublk: use ioctl-encoded opcodes	Uday Shankar
	There are a couple of places in the kublk selftests ublk server which use the legacy ublk opcodes. These operations fail (with -EOPNOTSUPP) on a kernel compiled without CONFIG_BLKDEV_UBLK_LEGACY_OPCODES set. We could easily require it to be set as a prerequisite for these selftests, but since new applications should not be using the legacy opcodes, use the ioctl-encoded opcodes everywhere in kublk. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250401-ublk_selftests-v1-1-98129c9bc8bb@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-01	x86/fred: Fix system hang during S4 resume with FRED enabled	Xin Li (Intel)
	Upon a wakeup from S4, the restore kernel starts and initializes the FRED MSRs as needed from its perspective. It then loads a hibernation image, including the image kernel, and attempts to load image pages directly into their original page frames used before hibernation unless those frames are currently in use. Once all pages are moved to their original locations, it jumps to a "trampoline" page in the image kernel. At this point, the image kernel takes control, but the FRED MSRs still contain values set by the restore kernel, which may differ from those set by the image kernel before hibernation. Therefore, the image kernel must ensure the FRED MSRs have the same values as before hibernation. Since these values depend only on the location of the kernel text and data, they can be recomputed from scratch. Reported-by: Xi Pardee <xi.pardee@intel.com> Reported-by: Todd Brandt <todd.e.brandt@intel.com> Tested-by: Todd Brandt <todd.e.brandt@intel.com> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250401075728.3626147-1-xin@zytor.com
2025-04-01	Documentation/EDAC: Fix warning document isn't included in any toctree	Shiju Jose
	Fix the build (htmldocs) warning: Documentation/edac/index.rst: WARNING: document isn't included in any toctree. Fixes: db99ea5f2c03 ("EDAC: Add support for EDAC device features control") Closes: https://lore.kernel.org/all/20250228185102.15842f8b@canb.auug.org.au/ Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250401115823.573-1-shiju.jose@huawei.com
2025-04-01	io_uring/zcrx: return early from io_zcrx_recv_skb if readlen is 0	David Wei
	When readlen is set for a recvzc request, tcp_read_sock() will call io_zcrx_recv_skb() one final time with len == desc->count == 0. This is caused by the !desc->count check happening too late. The offset + 1 != skb->len happens earlier and causes the while loop to continue. Fix this in io_zcrx_recv_skb() instead of tcp_read_sock(). Return early if len is 0 i.e. the read is done. Fixes: 6699ec9a23f8 ("io_uring/zcrx: add a read limit to recvzc requests") Signed-off-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/20250401195355.1613813-1-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-01	Merge tag 'dmaengine-6.15-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine Pull dmaengine updates from Vinod Koul: "The dmaengine subsystem updates for this cycle consist of a new driver (Microchip) along with couple of yaml binding conversions, core api updates and bunch of driver updates etc. New HW support: - Microchip sama7d65 dma controller - Yaml conversion of atmel dma binding and Freescale Elo DMA Controller binding Core: - Remove device_prep_dma_imm_data() API as users are removed - Reduce scope of some less frequently used DMA request channel APIs with aim to cleanup these in future Updates: - Drop Fenghua Yu from idxd maintainers, as he changed jobs - AMD ptdma support for multiqueue and ae4dma deprecated PCI IDs removal" * tag 'dmaengine-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (29 commits) dmaengine: ptdma: Utilize the AE4DMA engine's multi-queue functionality dmaengine: ae4dma: Use the MSI count and its corresponding IRQ number dmaengine: ae4dma: Remove deprecated PCI IDs dmaengine: Remove device_prep_dma_imm_data from struct dma_device dmaengine: ti: edma: support sw triggered chans in of_edma_xlate() dmaengine: ti: k3-udma: Enable second resource range for BCDMA and PKTDMA dmaengine: fsl-edma: free irq correctly in remove path dmaengine: fsl-edma: cleanup chan after dma_async_device_unregister dt-bindings: dma: snps,dw-axi-dmac: Allow devices to be marked as noncoherent dmaengine: dmatest: Fix dmatest waiting less when interrupted dt-bindings: dma: Convert fsl,elo*-dma to YAML dt-bindings: dma: fsl-mxs-dma: Add compatible string for i.MX8 chips dmaengine: Fix typo in comment dmaengine: ti: k3-udma-glue: Drop skip_fdq argument from k3_udma_glue_reset_rx_chn dmaengine: bcm2835-dma: fix warning when CONFIG_PM=n dt-bindings: dma: fsl,edma: Add i.MX94 support dt-bindings: dma: atmel: add microchip,sama7d65-dma dmaengine: img-mdc: remove incorrect of_match_ptr annotation dmaengine: idxd: Delete unnecessary NULL check dmaengine: pxa: Enable compile test ...
2025-04-01	Merge tag 'phy-for-6.15' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy Pull phy updates from Vinod Koul: "A fairly moderate sized request for the generic phy subsystem with some new device and driver support along with driver updates with Samsung and Qualcomm ones being major ones. New HW Support: - Qualcomm X1P42100 PCIe Gen4x4, QCS615 qmp usbc, PCIe UNIPHY 28LP driver, SM8750 QMP UFS PHY - Rockchip rk3576 hdptx, rk3562 naneng-combo support - Samsung MIPI D-/C-PHY driver, ExynosAutov920 ufs phy driver Updates: - Samsung USB3 Type-C lane orientation detection and configuration for Google gs101 - Qualcomm support for dual lane PHY support for QCS8300 SoC" * tag 'phy-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy: (47 commits) phy: rockchip-naneng-combo: Support rk3562 dt-bindings: phy: rockchip: Add rk3562 naneng-combophy compatible phy: rockchip: Add Samsung MIPI D-/C-PHY driver dt-bindings: phy: Add Rockchip MIPI C-/D-PHY schema phy: qcom: uniphy-28lp: add COMMON_CLK dependency phy: rockchip: usbdp: Remove unnecessary bool conversion phy: rockchip: usbdp: Avoid call hpd_event_trigger in dp_phy_init phy: rockchip: usbdp: Only verify link rates/lanes/voltage when the corresponding set flags are set phy: qcom-qmp-pcie: add dual lane PHY support for QCS8300 dt-bindings: phy: qcom,sc8280xp-qmp-pcie-phy: Document the QCS8300 QMP PCIe PHY Gen4 x2 phy: qcom-qmp-ufs: Add PHY Configuration support for sm8750 dt-bindings: phy: qcom,sc8280xp-qmp-ufs-phy: document the SM8750 QMP UFS PHY phy: qcom: Introduce PCIe UNIPHY 28LP driver dt-bindings: phy: qcom,uniphy-pcie: Document PCIe uniphy phy: qcom: qmp-usbc: Add qmp configuration for QCS615 phy: freescale: imx8m-pcie: assert phy reset and perst in power off phy: freescale: imx8m-pcie: cleanup reset logic phy: core: Remove unused phy_pm_runtime_(allow\|forbid) dt-bindings: phy: document Allwinner A523 USB-2.0 PHY phy: phy-rockchip-samsung-hdptx: Add support for RK3576 ...
2025-04-01	Merge tag 'soundwire-6.15-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire Pull soundwire updates from Vinod Koul: - Support for SoundWire Bulk Register Access (BRA) protocol in core along with Intel driver support and ASoC bits required - AMD driver updates and support for ACP 7.0 and 7.1 platforms * tag 'soundwire-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire: (28 commits) soundwire: take in count the bandwidth of a prepared stream ASoC: rt711-sdca: add DP0 support soundwire: debugfs: add interface for BPT/BRA transfers ASoC: SOF: Intel: hda-sdw-bpt: add CHAIN_DMA support soundwire: intel_ace2x: add BPT send_async/wait callbacks soundwire: intel: add BPT context definition ASoC: SOF: Intel: hda-sdw-bpt: add helpers for SoundWire BPT DMA soundwire: intel_auxdevice: add indirection for BPT send_async/wait soundwire: cadence: add BTP/BRA helpers to format data soundwire: bus: add bpt_stream pointer soundwire: bus: add send_async/wait APIs for BPT protocol soundwire: stream: reuse existing code for BPT stream soundwire: stream: special-case the bus compute_params() routine soundwire: stream: extend sdw_alloc_stream() to take 'type' parameter soundwire: extend sdw_stream_type to BPT soundwire: cadence: add BTP support for DP0 Documentation: driver: add SoundWire BRA description soundwire: amd: change the log level for command response log soundwire: slave: fix an OF node reference leak in soundwire slave device soundwire: Use str_enable_disable-like helpers ...
2025-04-01	x86/platform/iosf_mbi: Remove unused ↵	Dr. David Alan Gilbert
	iosf_mbi_unregister_pmic_bus_access_notifier() The last use of iosf_mbi_unregister_pmic_bus_access_notifier() was removed in 2017 by: a5266db4d314 ("drm/i915: Acquire PUNIT->PMIC bus for intel_uncore_forcewake_reset()") Remove it. (Note that the '_unlocked' version is still used.) Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: intel-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Link: https://lore.kernel.org/r/20241225175010.91783-1-linux@treblig.org
2025-04-01	Merge tag 'char-misc-6.15-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char / misc / IIO driver updates from Greg KH: "Here is the big set of char, misc, iio, and other smaller driver subsystems for 6.15-rc1. Lots of stuff in here, including: - loads of IIO changes and driver updates - counter driver updates - w1 driver updates - faux conversions for some drivers that were abusing the platform bus interface - coresight driver updates - rust miscdevice binding updates based on real-world-use - other minor driver updates All of these have been in linux-next with no reported issues for quite a while" * tag 'char-misc-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (292 commits) samples: rust_misc_device: fix markup in top-level docs Coresight: Fix a NULL vs IS_ERR() bug in probe misc: lis3lv02d: convert to use faux_device tlclk: convert to use faux_device regulator: dummy: convert to use the faux device interface bus: mhi: host: Fix race between unprepare and queue_buf coresight: configfs: Constify struct config_item_type doc: iio: ad7380: describe offload support iio: ad7380: add support for SPI offload iio: light: Add check for array bounds in veml6075_read_int_time_ms iio: adc: ti-ads7924 Drop unnecessary function parameters staging: iio: ad9834: Use devm_regulator_get_enable() staging: iio: ad9832: Use devm_regulator_get_enable() iio: gyro: bmg160_spi: add of_match_table dt-bindings: iio: adc: Add i.MX94 and i.MX95 support iio: adc: ad7768-1: remove unnecessary locking Documentation: ABI: add wideband filter type to sysfs-bus-iio iio: adc: ad7768-1: set MOSI idle state to prevent accidental reset iio: adc: ad7768-1: Fix conversion result sign iio: adc: ad7124: Benefit of dev = indio_dev->dev.parent in ad7124_parse_channel_config() ...
2025-04-01	Merge tag 'driver-core-6.15-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updatesk from Greg KH: "Here is the big set of driver core updates for 6.15-rc1. Lots of stuff happened this development cycle, including: - kernfs scaling changes to make it even faster thanks to rcu - bin_attribute constify work in many subsystems - faux bus minor tweaks for the rust bindings - rust binding updates for driver core, pci, and platform busses, making more functionaliy available to rust drivers. These are all due to people actually trying to use the bindings that were in 6.14. - make Rafael and Danilo full co-maintainers of the driver core codebase - other minor fixes and updates" * tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (52 commits) rust: platform: require Send for Driver trait implementers rust: pci: require Send for Driver trait implementers rust: platform: impl Send + Sync for platform::Device rust: pci: impl Send + Sync for pci::Device rust: platform: fix unrestricted &mut platform::Device rust: pci: fix unrestricted &mut pci::Device rust: device: implement device context marker rust: pci: use to_result() in enable_device_mem() MAINTAINERS: driver core: mark Rafael and Danilo as co-maintainers rust/kernel/faux: mark Registration methods inline driver core: faux: only create the device if probe() succeeds rust/faux: Add missing parent argument to Registration::new() rust/faux: Drop #[repr(transparent)] from faux::Registration rust: io: fix devres test with new io accessor functions rust: io: rename `io::Io` accessors kernfs: Move dput() outside of the RCU section. efi: rci2: mark bin_attribute as __ro_after_init rapidio: constify 'struct bin_attribute' firmware: qemu_fw_cfg: constify 'struct bin_attribute' powerpc/perf/hv-24x7: Constify 'struct bin_attribute' ...
2025-04-01	Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "powerpc/crash: use generic crashkernel reservation" from Sourabh Jain changes powerpc's kexec code to use more of the generic layers. - The series "get_maintainer: report subsystem status separately" from Vlastimil Babka makes some long-requested improvements to the get_maintainer output. - The series "ucount: Simplify refcounting with rcuref_t" from Sebastian Siewior cleans up and optimizing the refcounting in the ucount code. - The series "reboot: support runtime configuration of emergency hw_protection action" from Ahmad Fatoum improves the ability for a driver to perform an emergency system shutdown or reboot. - The series "Converge on using secs_to_jiffies() part two" from Easwar Hariharan performs further migrations from msecs_to_jiffies() to secs_to_jiffies(). - The series "lib/interval_tree: add some test cases and cleanup" from Wei Yang permits more userspace testing of kernel library code, adds some more tests and performs some cleanups. - The series "hung_task: Dump the blocking task stacktrace" from Masami Hiramatsu arranges for the hung_task detector to dump the stack of the blocking task and not just that of the blocked task. - The series "resource: Split and use DEFINE_RES() macros" from Andy Shevchenko provides some cleanups to the resource definition macros. - Plus the usual shower of singleton patches - please see the individual changelogs for details. tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) mailmap: consolidate email addresses of Alexander Sverdlin fs/procfs: fix the comment above proc_pid_wchan() relay: use kasprintf() instead of fixed buffer formatting resource: replace open coded variant of DEFINE_RES() resource: replace open coded variants of DEFINE_RES_*_NAMED() resource: replace open coded variant of DEFINE_RES_NAMED_DESC() resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED() samples: add hung_task detector mutex blocking sample hung_task: show the blocker task if the task is hung on mutex kexec_core: accept unaccepted kexec segments' destination addresses watchdog/perf: optimize bytes copied and remove manual NUL-termination lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap() lib/interval_tree: skip the check before go to the right subtree lib/interval_tree: add test case for span iteration lib/interval_tree: add test case for interval_tree_iter_xxx() helpers lib/rbtree: add random seed lib/rbtree: split tests lib/rbtree: enable userland test suite for rbtree related data structure checkpatch: describe --min-conf-desc-length scripts/gdb/symbols: determine KASLR offset on s390 ...
2025-04-01	cpufreq: Reference count policy in cpufreq_update_limits()	Rafael J. Wysocki
	Since acpi_processor_notify() can be called before registering a cpufreq driver or even in cases when a cpufreq driver is not registered at all, cpufreq_update_limits() needs to check if a cpufreq driver is present and prevent it from being unregistered. For this purpose, make it call cpufreq_cpu_get() to obtain a cpufreq policy pointer for the given CPU and reference count the corresponding policy object, if present. Fixes: 5a25e3f7cc53 ("cpufreq: intel_pstate: Driver-specific handling of _PPC updates") Closes: https://lore.kernel.org/linux-acpi/Z-ShAR59cTow0KcR@mail-itl Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://patch.msgid.link/1928789.tdWV9SEqCh@rjwysocki.net
2025-04-01	Merge tag 'mm-stable-2025-03-30-16-52' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...
2025-04-01	ASoC: imx-card: Add NULL check in imx_card_probe()	Henry Martin
	devm_kasprintf() returns NULL when memory allocation fails. Currently, imx_card_probe() does not check for this case, which results in a NULL pointer dereference. Add NULL check after devm_kasprintf() to prevent this issue. Fixes: aa736700f42f ("ASoC: imx-card: Add imx-card machine driver") Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20250401142510.29900-1-bsdhenrymartin@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-04-01	nvme-pci: skip nvme_write_sq_db on empty rqlist	Maurizio Lombardi
	nvme_submit_cmds() should check the rqlist before calling nvme_write_sq_db(); if the list is empty, it must return immediately. Fixes: beadf0088501 ("nvme-pci: reverse request order in nvme_queue_rqs") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01	nvme-multipath: change the NVME_MULTIPATH config option	John Meneghini
	Fix up the NVME_MULTIPATH config description so that it accurately describes what it does. Signed-off-by: John Meneghini <jmeneghi@redhat.com> Tested-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01	nvme: update the multipath warning in nvme_init_ns_head	John Meneghini
	The new NVME_MULTIPATH_PARAM config option requires updates to the warning message in nvme_init_ns_head(). Signed-off-by: John Meneghini <jmeneghi@redhat.com> Tested-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01	nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io()	Caleb Sander Mateos
	nvme_map_user_request() is called from both nvme_submit_user_cmd() and nvme_uring_cmd_io(). But the ioucmd branch is only applicable to nvme_uring_cmd_io(). Move it to nvme_uring_cmd_io() and just pass the resulting iov_iter to nvme_map_user_request(). For NVMe passthru operations with fixed buffers, the fixed buffer lookup happens in io_uring_cmd_import_fixed(). But nvme_uring_cmd_io() can return -EAGAIN first from nvme_alloc_user_request() if all tags in the tag set are in use. This ordering difference is observable when using UBLK_U_IO_{,UN}REGISTER_IO_BUF SQEs to modify the fixed buffer table. If the NVMe passthru operation is followed by UBLK_U_IO_UNREGISTER_IO_BUF to unregister the fixed buffer and the NVMe passthru goes async, the fixed buffer lookup will fail because it happens after the unregister. Userspace should not depend on the order in which io_uring issues SQEs submitted in parallel, but it may try submitting the SQEs together and fall back on a slow path if the fixed buffer lookup fails. To make the fast path more likely, do the import before nvme_alloc_user_request(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01	nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request()	Caleb Sander Mateos
	The callers of nvme_map_user_request() (nvme_submit_user_cmd() and nvme_uring_cmd_io()) allocate the request, so have them free it if nvme_map_user_request() fails. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01	nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer	Caleb Sander Mateos
	The vectorized io_uring NVMe passthru opcodes don't yet support fixed buffers. But since userspace can trigger this condition based on the io_uring SQE parameters, it shouldn't cause a kernel warning. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: 23fd22e55b76 ("nvme: wire up fixed buffer support for nvme passthrough") Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01	nvmet: pci-epf: Keep completion queues mapped	Damien Le Moal
	Instead of mapping and unmapping the completion queues memory to the host PCI address space whenever nvmet_pci_epf_cq_work() is called, map a completion queue to the host PCI address space when the completion queue is created with nvmet_pci_epf_create_cq() and unmap it when the completion queue is deleted with nvmet_pci_epf_delete_cq(). This removes the completion queue mapping/unmapping from nvmet_pci_epf_cq_work() and significantly increases performance. For a single job 4K random read QD=1 workload, the IOPS is increased from 23 KIOPS to 25 KIOPS. Some significant throughput increasde for high queue depth and large IOs workloads can also be seen. Since the functions nvmet_pci_epf_map_queue() and nvmet_pci_epf_unmap_queue() are called respectively only from nvmet_pci_epf_create_cq() and nvmet_pci_epf_delete_cq(), these functions are removed and open-coded in their respective call sites. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>