summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-12mm/gup: remove page_folio() in memfd_pin_folios()Vishal Moola (Oracle)
We can get the folio directly from the folio batch, so remove the unnecessary page_folio() call. Link: https://lkml.kernel.org/r/20250430010059.892632-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/gup: remove unnecessary check in memfd_pin_folios()Vishal Moola (Oracle)
Patch series "mm/gup: Cleanup memfd_pin_folios()". A couple straightforward cleanups to memfd_pin_folios() found through code inspection. Saves 124 bytes of kernel text overall and makes the code more readable. This patch (of 2): Commit 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") checks if filemap_get_folios_contig() returned duplicate folios to prevent multiple attempts at pinning the same folio. Commit 8ab1b1602396 ("mm: fix filemap_get_folios_contig returning batches of identical folios") ensures that filemap_get_folios_contig() returns a batch of distinct folios. We can remove the duplicate folio check to simplify the code and save 58 bytes of text. Link: https://lkml.kernel.org/r/20250430010059.892632-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20250430010059.892632-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm, swap: remove no longer used swap mapping helperKairui Song
This helper existed to fix the circular header dependency issue but it is no longer used since commit 0d40cfe63a2f ("fs: remove folio_file_mapping()"), remove it. Link: https://lkml.kernel.org/r/20250430181052.55698-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joanne Koong <joannelkoong@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Qu Wenruo <wqu@suse.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: move folio_index to mm/swap.h and remove no longer needed helperKairui Song
There are no remaining users of folio_index() outside the mm subsystem. Move it to mm/swap.h to co-locate it with swap_cache_index(), eliminating a forward declaration, and a function call overhead. Also remove the helper that was used to fix circular header dependency issue. Link: https://lkml.kernel.org/r/20250430181052.55698-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joanne Koong <joannelkoong@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Qu Wenruo <wqu@suse.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12filemap: do not use folio_contains for swap cache foliosKairui Song
Currently, none of the folio_contains callers should encounter swap cache folios. For fs/ callers, swap cache folios are never part of their workflow. For filemap and truncate, folio_contains is only used for sanity checks to verify the folio index matches the expected lookup / invalidation target. The swap cache does not utilize filemap or truncate helpers in ways that would trigger these checks, as it mostly implements its own cache management. Shmem won't trigger these sanity checks either unless thing went wrong, as it would directly trigger a BUG because swap cache index are unrelated and almost never matches shmem index. Shmem have to handle mixed values of folios, shadows, and swap entries, so it has its own way of handling the mapping. While some filemap helpers works for swap cache space, the swap cache is different from the page cache in many ways. So this particular helper will unlikely to work in a helpful way for swap cache folios. So make it explicit here that folio_contains should not be used for swap cache folios. This helps to avoid misuse, make swap cache less exposed and remove the folio_index usage here. [akpm@linux-foundation.org: s/VM_WARN_ON_FOLIO/VM_WARN_ON_ONCE_FOLIO/, per Kairui] Link: https://lkml.kernel.org/r/20250430181052.55698-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Joanne Koong <joannelkoong@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Qu Wenruo <wqu@suse.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12f2fs: drop usage of folio_indexKairui Song
folio_index is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through `swap_rw` but f2fs does not use that method for swap. Link: https://lkml.kernel.org/r/20250430181052.55698-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Chao Yu <chao@kernel.org> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joanne Koong <joannelkoong@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Qu Wenruo <wqu@suse.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12fuse: drop usage of folio_indexKairui Song
Patch series "mm, swap: clean up swap cache mapping helper", v3. This series removes usage of folio_index usage in fs/, and remove swap cache checking in folio_contains. Currently, the swap cache is already no longer directly exposed to fs, and swap cache will be more different from page cache. Clean up the helpers first to simplify the code and eliminate the helpers used for resolving circular header dependency issue between filemap and swap headers, and prepare for further changes. This patch (of 6): folio_index is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through `swap_rw` but fuse does not use that method for SWAP. Link: https://lkml.kernel.org/r/20250430181052.55698-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250430181052.55698-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Joanne Koong <joannelkoong@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Qu Wenruo <wqu@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12xarray: fix kerneldoc for __xa_cmpxchgChristoph Hellwig
Fix the documentation for __xa_cmpxchg to actually describe the cmpxch-like semantics correctly, based on the version for xa_cmpxchg. Link: https://lkml.kernel.org/r/20250507051656.3900864-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12docs/mm/damon/design: fix spelling mistakeThushara.M.S
The word accuracy was misspelled as "accruracy". Signed-off-by: Thushara.M.S <thusharms@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12DAX: warn when kmem regions are truncated for memory block alignmentGregory Price
Device capacity intended for use as system ram should be aligned to the architecture-defined memory block size or that capacity will be silently truncated and capacity stranded. As hotplug dax memory becomes more prevelant, the memory block size alignment becomes more important for platform and device vendors to pay attention to - so this truncation should not be silent. This issue is particularly relevant for CXL Dynamic Capacity devices, whose capacity may arrive in spec-aligned but block-misaligned chunks. Link: https://lkml.kernel.org/r/20250410142831.217887-1-gourry@gourry.net Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Gregory Price <gourry@gourry.net> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: page-flags-layout.h: change the KASAN_TAG_WIDTH for HW_TAGSGuilherme Giacomo Simoes
KASAN_TAG_WIDTH is 8 bits for both (HW_TAGS and SW_TAGS), but for HW_TAGS the KASAN_TAG_WIDTH can be 4 bits bits because due to the design of the MTE the memory words for storing metadata only need 4 bits. Change the preprocessor define KASAN_TAG_WIDTH for check if SW_TAGS is define, so KASAN_TAG_WIDTH should be 8 bits, but if HW_TAGS is define, so KASAN_TAG_WIDTH should be 4 bits to save a few flags bits. Link: https://lkml.kernel.org/r/20250428201409.5482-1-trintaeoitogc@gmail.com Signed-off-by: Guilherme Giacomo Simoes <trintaeoitogc@gmail.com> Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: perform VMA allocation, freeing, duplication in mmLorenzo Stoakes
Right now these are performed in kernel/fork.c which is odd and a violation of separation of concerns, as well as preventing us from integrating this and related logic into userland VMA testing going forward. There is a fly in the ointment - nommu - mmap.c is not compiled if CONFIG_MMU not set, and neither is vma.c. To square the circle, let's add a new file - vma_init.c. This will be compiled for both CONFIG_MMU and nommu builds, and will also form part of the VMA userland testing. This allows us to de-duplicate code, while maintaining separation of concerns and the ability for us to userland test this logic. Update the VMA userland tests accordingly, additionally adding a detach_free_vma() helper function to correctly detach VMAs before freeing them in test code, as this change was triggering the assert for this. [akpm@linux-foundation.org: remove stray newline, per Liam] Link: https://lkml.kernel.org/r/f97b3a85a6da0196b28070df331b99e22b263be8.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: move dup_mmap() to mmLorenzo Stoakes
This is a key step in our being able to abstract and isolate VMA allocation and destruction logic. This function is the last one where vm_area_free() and vm_area_dup() are directly referenced outside of mmap, so having this in mm allows us to isolate these. We do the same for the nommu version which is substantially simpler. We place the declaration for dup_mmap() in mm/internal.h and have kernel/fork.c import this in order to prevent improper use of this functionality elsewhere in the kernel. While we're here, we remove the useless #ifdef CONFIG_MMU check around mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if CONFIG_MMU is set. Link: https://lkml.kernel.org/r/e49aad3d00212f5539d9fa5769bfda4ce451db3e.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: abstract initial stack setup to mm subsystemLorenzo Stoakes
There are peculiarities within the kernel where what is very clearly mm code is performed elsewhere arbitrarily. This violates separation of concerns and makes it harder to refactor code to make changes to how fundamental initialisation and operation of mm logic is performed. One such case is the creation of the VMA containing the initial stack upon execve()'ing a new process. This is currently performed in __bprm_mm_init() in fs/exec.c. Abstract this operation to create_init_stack_vma(). This allows us to limit use of vma allocation and free code to fork and mm only. We previously did the same for the step at which we relocate the initial stack VMA downwards via relocate_vma_down(), now we move the initial VMA establishment too. Take the opportunity to also move insert_vm_struct() to mm/vma.c as it's no longer needed anywhere outside of mm. Link: https://lkml.kernel.org/r/118c950ef7a8dd19ab20a23a68c3603751acd30e.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: establish mm/vma_exec.c for shared exec/mm VMA functionalityLorenzo Stoakes
Patch series "move all VMA allocation, freeing and duplication logic to mm", v3. Currently VMA allocation, freeing and duplication exist in kernel/fork.c, which is a violation of separation of concerns, and leaves these functions exposed to the rest of the kernel when they are in fact internal implementation details. Resolve this by moving this logic to mm, and making it internal to vma.c, vma.h. This also allows us, in future, to provide userland testing around this functionality. We additionally abstract dup_mmap() to mm, being careful to ensure kernel/fork.c acceses this via the mm internal header so it is not exposed elsewhere in the kernel. As part of this change, also abstract initial stack allocation performed in __bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as this code uses vm_area_alloc() and vm_area_free(). In order to do so sensibly, we introduce a new mm/vma_exec.c file, which contains the code that is shared by mm and exec. This file is added to both memory mapping and exec sections in MAINTAINERS so both sets of maintainers can maintain oversight. As part of this change, we also move relocate_vma_down() to mm/vma_exec.c so all shared mm/exec functionality is kept in one place. We add code shared between nommu and mmu-enabled configurations in order to share VMA allocation, freeing and duplication code correctly while also keeping these functions available in userland VMA testing. This is achieved by adding a mm/vma_init.c file which is also compiled by the userland tests. This patch (of 4): There is functionality that overlaps the exec and memory mapping subsystems. While it properly belongs in mm, it is important that exec maintainers maintain oversight of this functionality correctly. We can establish both goals by adding a new mm/vma_exec.c file which contains these 'glue' functions, and have fs/exec.c import them. As a part of this change, to ensure that proper oversight is achieved, add the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections. scripts/get_maintainer.pl can correctly handle files in multiple entries and this neatly handles the cross-over. [akpm@linux-foundation.org: fix comment typo] Link: https://lkml.kernel.org/r/80f0d0c6-0b68-47f9-ab78-0ab7f74677fc@lucifer.local Link: https://lkml.kernel.org/r/cover.1745853549.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/91f2cee8f17d65214a9d83abb7011aa15f1ea690.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: kmemleak: mark variables as __read_mostlyLuiz Capitulino
The variables kmemleak_enabled and kmemleak_free_enabled are read in the kmemleak alloc and free path respectively, but are only written to if/when kmemleak is disabled. Link: https://lkml.kernel.org/r/4016090e857e8c4c2ade4b20df312f7f38325c15.1746046744.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: kmemleak: drop wrong commentLuiz Capitulino
Newly created objects have object->count == 0, so the comment is incorrect. Just drop it. Link: https://lkml.kernel.org/r/3dfd09bc0e77bb626619184a808774ff07de275c.1746046744.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: kmemleak: drop kmemleak_warning variableLuiz Capitulino
These are a trivial mm/kmemleak.c cleanups. I found these while reading through the code. This patch (of 3): The kmemleak_warning variable is not used since commit c5665868183f ("mm: kmemleak: use the memory pool for early allocations"), drop it. Link: https://lkml.kernel.org/r/cover.1746046744.git.luizcap@redhat.com Link: https://lkml.kernel.org/r/97e23faa7b67099027a1094c9438da5f72e037af.1746046744.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12jfs: implement migrate_folio for jfs_metapage_aopsShivank Garg
Add the missing migrate_folio operation to jfs_metapage_aops to fix warnings during memory compaction. These warnings were introduced by commit 7ee3647243e5 ("migrate: Remove call to ->writepage") which added explicit warnings when filesystems don't implement migrate_folio. System reports following warnings: jfs_metapage_aops does not implement migrate_folio WARNING: CPU: 0 PID: 6870 at mm/migrate.c:955 fallback_migrate_folio mm/migrate.c:953 [inline] WARNING: CPU: 0 PID: 6870 at mm/migrate.c:955 move_to_new_folio+0x70e/0x840 mm/migrate.c:1007 Implement metapage_migrate_folio() which handles both single and multiple metapages per page configurations. [shivankg@amd.com: change comment style] Link: https://lkml.kernel.org/r/1967593d-8084-4a4a-b384-35d5adc54eb4@amd.com [akpm@linux-foundation.org: fix build] [shivankg@amd.com: remove redundant NULL check in __metapage_migrate_folio()] Link: https://lkml.kernel.org/r/a67db238-0ca6-4725-abb2-dc092de87e1b@amd.com Link: https://lkml.kernel.org/r/20250430100150.279751-3-shivankg@amd.com Fixes: 35474d52c605 ("jfs: Convert metapage_writepage to metapage_write_folio") Signed-off-by: Shivank Garg <shivankg@amd.com> Reported-by: syzbot+8bb6fd945af4e0ad9299@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67faff52.050a0220.379d84.001b.GAE@google.com Tested-by: syzbot+8bb6fd945af4e0ad9299@syzkaller.appspotmail.com Cc: Alistair Popple <apopple@nvidia.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: add folio_expected_ref_count() for reference count calculationShivank Garg
Patch series " JFS: Implement migrate_folio for jfs_metapage_aops" v5. This patchset addresses a warning that occurs during memory compaction due to JFS's missing migrate_folio operation. The warning was introduced by commit 7ee3647243e5 ("migrate: Remove call to ->writepage") which added explicit warnings when filesystem don't implement migrate_folio. The syzbot reported following [1]: jfs_metapage_aops does not implement migrate_folio WARNING: CPU: 1 PID: 5861 at mm/migrate.c:955 fallback_migrate_folio mm/migrate.c:953 [inline] WARNING: CPU: 1 PID: 5861 at mm/migrate.c:955 move_to_new_folio+0x70e/0x840 mm/migrate.c:1007 Modules linked in: CPU: 1 UID: 0 PID: 5861 Comm: syz-executor280 Not tainted 6.15.0-rc1-next-20250411-syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 RIP: 0010:fallback_migrate_folio mm/migrate.c:953 [inline] RIP: 0010:move_to_new_folio+0x70e/0x840 mm/migrate.c:1007 To fix this issue, this series implement metapage_migrate_folio() for JFS which handles both single and multiple metapages per page configurations. While most filesystems leverage existing migration implementations like filemap_migrate_folio(), buffer_migrate_folio_norefs() or buffer_migrate_folio() (which internally used folio_expected_refs()), JFS's metapage architecture requires special handling of its private data during migration. To support this, this series introduce the folio_expected_ref_count(), which calculates external references to a folio from page/swap cache, private data, and page table mappings. This standardized implementation replaces the previous ad-hoc folio_expected_refs() function and enables JFS to accurately determine whether a folio has unexpected references before attempting migration. Implement folio_expected_ref_count() to calculate expected folio reference counts from: - Page/swap cache (1 per page) - Private data (1) - Page table mappings (1 per map) While originally needed for page migration operations, this improved implementation standardizes reference counting by consolidating all refcount contributors into a single, reusable function that can benefit any subsystem needing to detect unexpected references to folios. The folio_expected_ref_count() returns the sum of these external references without including any reference the caller itself might hold. Callers comparing against the actual folio_ref_count() must account for their own references separately. Link: https://syzkaller.appspot.com/bug?extid=8bb6fd945af4e0ad9299 [1] Link: https://lkml.kernel.org/r/20250430100150.279751-1-shivankg@amd.com Link: https://lkml.kernel.org/r/20250430100150.279751-2-shivankg@amd.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Shivank Garg <shivankg@amd.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Co-developed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12util_macros.h: make the header more resilientAndy Shevchenko
Add missing header inclusions. Link: https://lkml.kernel.org/r/20250428072754.3265274-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12sched/numa: add tracepoint that tracks the skipping of numa balancing due to ↵Libo Chen
cpuset memory pinning Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this tracks the task subjected to cpuset.mems pinning and prints out its allowed memory node mask. Link: https://lkml.kernel.org/r/20250424024523.2298272-3-libo.chen@oracle.com Signed-off-by: Libo Chen <libo.chen@oracle.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Chen Yu <yu.c.chen@intel.com> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Koutný <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Srikanth Aithal <sraithal@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.memsLibo Chen
Patch series "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems", v5. This patch (of 2): When the memory of the current task is pinned to one NUMA node by cgroup, there is no point in continuing the rest of VMA scanning and hinting page faults as they will just be overhead. With this change, there will be no more unnecessary PTE updates or page faults in this scenario. We have seen up to a 6x improvement on a typical java workload running on VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel platform, we have seen 20% improvment in a microbench that creates a 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB pages in a fixed number of loops. Link: https://lkml.kernel.org/r/20250424024523.2298272-1-libo.chen@oracle.com Link: https://lkml.kernel.org/r/20250424024523.2298272-2-libo.chen@oracle.com Signed-off-by: Libo Chen <libo.chen@oracle.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Srikanth Aithal <sraithal@amd.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Chris Hyser <chris.hyser@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Koutný <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/selftests: add a test to verify mmap_changing race with -EAGAINPeter Xu
Add an unit test to verify the recent mmap_changing ABI breakage. Note that I used some tricks here and there to make the test simple, e.g. I abused UFFDIO_MOVE on top of shmem with the fact that I know what I want to test will be even earlier than the vma type check. Rich comments were added to explain trivial details. Before that fix, -EAGAIN would have been written to the copy field most of the time but not always; the test should be able to reliably trigger the outlier case. After the fix, it's written always, the test verifies that making sure corresponding field (e.g. copy.copy for UFFDIO_COPY) is updated. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20250424215729.194656-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/rmap: inline folio_test_large_maybe_mapped_shared() into callersLance Yang
To prevent the function from being used when CONFIG_MM_ID is disabled, we intend to inline it into its few callers, which also would help maintain the expected code placement. Link: https://lkml.kernel.org/r/20250424155606.57488-1-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/damon/sysfs-schemes: use kmalloc_array() and size_add()Su Hui
It's safer to use kmalloc_array() and size_add() because it can prevent possible overflow problem. Link: https://lkml.kernel.org/r/20250421062423.740605-1-suhui@nfschina.com Signed-off-by: Su Hui <suhui@nfschina.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: workingset: simplify lockdep check in update_nodePedro Falcato
container_of(node->array, ..., i_pages) just to access i_pages again is an incredibly roundabout way of accessing node->array itself. Simplify it. Link: https://lkml.kernel.org/r/20250421-workingset-simplify-v1-1-de5c40051e0e@suse.de Signed-off-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/mm_init: use for_each_valid_pfn() in init_unavailable_range()David Woodhouse
Currently, memmap_init initializes pfn_hole with 0 instead of ARCH_PFN_OFFSET. Then init_unavailable_range will start iterating each page from the page at address zero to the first available page, but it won't do anything for pages below ARCH_PFN_OFFSET because pfn_valid won't pass. If ARCH_PFN_OFFSET is very large (e.g., something like 2^64-2GiB if the kernel is used as a library and loaded at a very high address), the pointless iteration for pages below ARCH_PFN_OFFSET will take a very long time, and the kernel will look stuck at boot time. Use for_each_valid_pfn() to skip the pointless iterations. Link: https://lkml.kernel.org/r/20250423133821.789413-8-dwmw2@infradead.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reported-by: Ruihan Li <lrh2000@pku.edu.cn> Suggested-by: Mike Rapoport <rppt@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Ruihan Li <lrh2000@pku.edu.cn> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: use for_each_valid_pfn() in memory_hotplugDavid Woodhouse
Link: https://lkml.kernel.org/r/20250423133821.789413-7-dwmw2@infradead.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Ruihan Li <lrh2000@pku.edu.cn> Cc: Will Deacon <will@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm, x86: use for_each_valid_pfn() from __ioremap_check_ram()David Woodhouse
Instead of calling pfn_valid() separately for every single PFN in the range, use for_each_valid_pfn() and only look at the ones which are. Link: https://lkml.kernel.org/r/20250423133821.789413-6-dwmw2@infradead.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Ruihan Li <lrh2000@pku.edu.cn> Cc: Will Deacon <will@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm, PM: use for_each_valid_pfn() in kernel/power/snapshot.cDavid Woodhouse
Link: https://lkml.kernel.org/r/20250423133821.789413-5-dwmw2@infradead.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Ruihan Li <lrh2000@pku.edu.cn> Cc: Will Deacon <will@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: implement for_each_valid_pfn() for CONFIG_SPARSEMEMDavid Woodhouse
Implement for_each_valid_pfn() based on two helper functions. The first_valid_pfn() function largely mirrors pfn_valid(), calling into a pfn_section_first_valid() helper which is trivial for the !VMEMMAP case, and in the VMEMMAP case will skip to the next subsection as needed. Since next_valid_pfn() knows that its argument *is* a valid PFN, it doesn't need to do any checking at all while iterating over the low bits within a (sub)section mask; the whole (sub)section is either present or not. Note that the VMEMMAP version of pfn_section_first_valid() may return a value *higher* than end_pfn when skipping to the next subsection, and first_valid_pfn() happily returns that higher value. This is fine. [dwmw2@infradead.org: fix next_valid_pfn() for sparsemem] Link: https://lkml.kernel.org/r/c15100fcf6781a60b852c4dbb43bdc98a678fcf0.camel@infradead.org Link: https://lkml.kernel.org/r/20250423133821.789413-4-dwmw2@infradead.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Ruihan Li <lrh2000@pku.edu.cn> Cc: Will Deacon <will@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: implement for_each_valid_pfn() for CONFIG_FLATMEMDavid Woodhouse
In the FLATMEM case, the default pfn_valid() just checks that the PFN is within the range [ ARCH_PFN_OFFSET .. ARCH_PFN_OFFSET + max_mapnr ). The for_each_valid_pfn() function can therefore be a simple for() loop using those as min/max respectively. Link: https://lkml.kernel.org/r/20250423133821.789413-3-dwmw2@infradead.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Ruihan Li <lrh2000@pku.edu.cn> Cc: Will Deacon <will@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: introduce for_each_valid_pfn() and use it from reserve_bootmem_region()David Woodhouse
Patch series "mm: Introduce for_each_valid_pfn()", v4. There are cases where a naïve loop over a PFN range, calling pfn_valid() on each one, is horribly inefficient. Ruihan Li reported the case where memmap_init() iterates all the way from zero to a potentially large value of ARCH_PFN_OFFSET, and we at Amazon found the reserve_bootmem_region() one as it affects hypervisor live update. Others are more cosmetic. By introducing a for_each_valid_pfn() helper it can optimise away a lot of pointless calls to pfn_valid(), skipping immediately to the next valid PFN and also skipping *all* checks within a valid (sub)region according to the granularity of the memory model in use. This patch (of 7) Especially since commit 9092d4f7a1f8 ("memblock: update initialization of reserved pages"), the reserve_bootmem_region() function can spend a significant amount of time iterating over every 4KiB PFN in a range, calling pfn_valid() on each one, and ultimately doing absolutely nothing. On a platform used for virtualization, with large NOMAP regions that eventually get used for guest RAM, this leads to a significant increase in steal time experienced during kexec for a live update. Introduce for_each_valid_pfn() and use it from reserve_bootmem_region(). This implementation is precisely the same naïve loop that the functio used to have, but subsequent commits will provide optimised versions for FLATMEM and SPARSEMEM, and this version will remain for those architectures which provide their own pfn_valid() implementation, until/unless they also provide a matching for_each_valid_pfn(). Link: https://lkml.kernel.org/r/20250423133821.789413-1-dwmw2@infradead.org Link: https://lkml.kernel.org/r/20250423133821.789413-2-dwmw2@infradead.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Ruihan Li <lrh2000@pku.edu.cn> Cc: Will Deacon <will@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12Documentation: KHO: add memblock bindingsMike Rapoport (Microsoft)
We introduced KHO into Linux: A framework that allows Linux to pass metadata and memory across kexec from Linux to Linux. KHO reuses fdt as file format and shares a lot of the same properties of firmware-to- Linux boot formats: It needs a stable, documented ABI that allows for forward and backward compatibility as well as versioning. As first user of KHO, we introduced memblock which can now preserve memory ranges reserved with reserve_mem command line options contents across kexec, so you can use the post-kexec kernel to read traces from the pre-kexec kernel. This patch adds memblock schemas similar to "device" device tree ones to a new kho bindings directory. This allows us to force contributors to document the data that moves across KHO kexecs and catch breaking change during review. Link: https://lkml.kernel.org/r/20250509074635.3187114-18-changyuanl@google.com Co-developed-by: Alexander Graf <graf@amazon.com> Signed-off-by: Alexander Graf <graf@amazon.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12Documentation: add documentation for KHOAlexander Graf
With KHO in place, let's add documentation that describes what it is and how to use it. Link: https://lkml.kernel.org/r/20250509074635.3187114-17-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12memblock: add KHO support for reserve_memAlexander Graf
Linux has recently gained support for "reserve_mem": A mechanism to allocate a region of memory early enough in boot that we can cross our fingers and hope it stays at the same location during most boots, so we can store for example ftrace buffers into it. Thanks to KASLR, we can never be really sure that "reserve_mem" allocations are static across kexec. Let's teach it KHO awareness so that it serializes its reservations on kexec exit and deserializes them again on boot, preserving the exact same mapping across kexec. This is an example user for KHO in the KHO patch set to ensure we have at least one (not very controversial) user in the tree before extending KHO's use to more subsystems. Link: https://lkml.kernel.org/r/20250509074635.3187114-16-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12x86/Kconfig: enable kexec handover for 64 bitsAlexander Graf
Add ARCH_SUPPORTS_KEXEC_HANDOVER for 64 bits to allow enabling of KEXEC_HANDOVER configuration option. Link: https://lkml.kernel.org/r/20250509074635.3187114-15-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12x86/boot: make sure KASLR does not step over KHO preserved memoryAlexander Graf
During kexec handover (KHO) memory contains data that should be preserved and this data would be consumed by kexec'ed kernel. To make sure that the preserved memory is not overwritten, KHO uses "scratch regions" to bootstrap kexec'ed kernel. These regions are guaranteed to not have any memory that KHO would preserve and are used as the only memory the kernel sees during the early boot. The scratch regions are passed in the setup_data by the first kernel with other KHO parameters. If the setup_data contains the KHO parameters, limit randomization to scratch areas only to make sure preserved memory won't get overwritten. Since all the pointers in setup_data are represented by u64, they require double casting (first to unsigned long and then to the actual pointer type) to compile on 32-bits. This looks goofy out of context, but it is unfortunately the way that this is handled across the tree. There are at least a dozen instances of casting like this. Link: https://lkml.kernel.org/r/20250509074635.3187114-14-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12x86/e820: temporarily enable KHO scratch for memory below 1MAlexander Graf
KHO kernels are special and use only scratch memory for memblock allocations, but memory below 1M is ignored by kernel after early boot and cannot be naturally marked as scratch. To allow allocation of the real-mode trampoline and a few (if any) other very early allocations from below 1M forcibly mark the memory below 1M as scratch. After real mode trampoline is allocated, clear that scratch marking. Link: https://lkml.kernel.org/r/20250509074635.3187114-13-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12x86/kexec: add support for passing kexec handover (KHO) dataAlexander Graf
kexec handover (KHO) creates a metadata that the kernels pass between each other during kexec. This metadata is stored in memory and kexec image contains a (physical) pointer to that memory. In addition, KHO keeps "scratch regions" available for kexec: physically contiguous memory regions that are guaranteed to not have any memory that KHO would preserve. The new kernel bootstraps itself using the scratch regions and sets all handed over memory as in use. When subsystems that support KHO initialize, they introspect the KHO metadata, restore preserved memory regions, and retrieve their state stored in the preserved memory. Enlighten x86 kexec-file and boot path about the KHO metadata and make sure it gets passed along to the next kernel. Link: https://lkml.kernel.org/r/20250509074635.3187114-12-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12x86/setup: use memblock_reserve_kern for memory used by kernelMike Rapoport (Microsoft)
memblock_reserve() does not distinguish memory used by firmware from memory used by kernel. The distinction is nice to have for accounting of early memory allocations and reservations, but it is essential for kexec handover (kho) to know how much memory kernel consumes during boot. Use memblock_reserve_kern() to reserve kernel memory, such as kernel image, initrd and setup data. Link: https://lkml.kernel.org/r/20250509074635.3187114-11-changyuanl@google.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Graf <graf@amazon.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12arm64: add KHO supportAlexander Graf
We now have all bits in place to support KHO kexecs. Add awareness of KHO in the kexec file as well as boot path for arm64 and adds the respective kconfig option to the architecture so that it can use KHO successfully. Changes to the "chosen" node have been sent to https://github.com/devicetree-org/dt-schema/pull/158. Link: https://lkml.kernel.org/r/20250509074635.3187114-10-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12kexec: add config option for KHOAlexander Graf
We have all generic code in place now to support Kexec with KHO. This patch adds a config option that depends on architecture support to enable KHO support. Link: https://lkml.kernel.org/r/20250509074635.3187114-9-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12kexec: add KHO support to kexec file loadsAlexander Graf
Kexec has 2 modes: A user space driven mode and a kernel driven mode. For the kernel driven mode, kernel code determines the physical addresses of all target buffers that the payload gets copied into. With KHO, we can only safely copy payloads into the "scratch area". Teach the kexec file loader about it, so it only allocates for that area. In addition, enlighten it with support to ask the KHO subsystem for its respective payloads to copy into target memory. Also teach the KHO subsystem how to fill the images for file loads. Link: https://lkml.kernel.org/r/20250509074635.3187114-8-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12kexec: enable KHO support for memory preservationMike Rapoport (Microsoft)
Introduce APIs allowing KHO users to preserve memory across kexec and get access to that memory after boot of the kexeced kernel kho_preserve_folio() - record a folio to be preserved over kexec kho_restore_folio() - recreates the folio from the preserved memory kho_preserve_phys() - record physically contiguous range to be preserved over kexec. The memory preservations are tracked by two levels of xarrays to manage chunks of per-order 512 byte bitmaps. For instance if PAGE_SIZE = 4096, the entire 1G order of a 1TB x86 system would fit inside a single 512 byte bitmap. For order 0 allocations each bitmap will cover 16M of address space. Thus, for 16G of memory at most 512K of bitmap memory will be needed for order 0. At serialization time all bitmaps are recorded in a linked list of pages for the next kernel to process and the physical address of the list is recorded in KHO FDT. The next kernel then processes that list, reserves the memory ranges and later, when a user requests a folio or a physical range, KHO restores corresponding memory map entries. Link: https://lkml.kernel.org/r/20250509074635.3187114-7-changyuanl@google.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12kexec: add KHO parsing supportAlexander Graf
When we have a KHO kexec, we get an FDT blob and scratch region to populate the state of the system. Provide helper functions that allow architecture code to easily handle memory reservations based on them and give device drivers visibility into the KHO FDT and memory reservations so they can recover their own state. Include a fix from Arnd Bergmann <arnd@arndb.de> https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/. Link: https://lkml.kernel.org/r/20250509074635.3187114-6-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12kexec: add Kexec HandOver (KHO) generation helpersAlexander Graf
Add the infrastructure to generate Kexec HandOver metadata. Kexec HandOver is a mechanism that allows Linux to preserve state - arbitrary properties as well as memory locations - across kexec. It does so using 2 concepts: 1) KHO FDT - Every KHO kexec carries a KHO specific flattened device tree blob that describes preserved memory regions. Device drivers can register to KHO to serialize and preserve their states before kexec. 2) Scratch Regions - CMA regions that we allocate in the first kernel. CMA gives us the guarantee that no handover pages land in those regions, because handover pages must be at a static physical memory location. We use these regions as the place to load future kexec images so that they won't collide with any handover data. Link: https://lkml.kernel.org/r/20250509074635.3187114-5-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Pratyush Yadav <ptyadav@amazon.de> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12memblock: introduce memmap_init_kho_scratch()Mike Rapoport (Microsoft)
With deferred initialization of struct page it will be necessary to initialize memory map for KHO scratch regions early. Add memmap_init_kho_scratch() method that will allow such initialization in upcoming patches. Link: https://lkml.kernel.org/r/20250509074635.3187114-4-changyuanl@google.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12memblock: add support for scratch memoryAlexander Graf
With KHO (Kexec HandOver), we need a way to ensure that the new kernel does not allocate memory on top of any memory regions that the previous kernel was handing over. But to know where those are, we need to include them in the memblock.reserved array which may not be big enough to hold all ranges that need to be persisted across kexec. To resize the array, we need to allocate memory. That brings us into a catch 22 situation. The solution to that is limit memblock allocations to the scratch regions: safe regions to operate in the case when there is memory that should remain intact across kexec. KHO provides several "scratch regions" as part of its metadata. These scratch regions are contiguous memory blocks that known not to contain any memory that should be persisted across kexec. These regions should be large enough to accommodate all memblock allocations done by the kexeced kernel. We introduce a new memblock_set_scratch_only() function that allows KHO to indicate that any memblock allocation must happen from the scratch regions. Later, we may want to perform another KHO kexec. For that, we reuse the same scratch regions. To ensure that no eventually handed over data gets allocated inside a scratch region, we flip the semantics of the scratch region with memblock_clear_scratch_only(): After that call, no allocations may happen from scratch memblock regions. We will lift that restriction in the next patch. Link: https://lkml.kernel.org/r/20250509074635.3187114-3-changyuanl@google.com Signed-off-by: Alexander Graf <graf@amazon.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>