linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2025-05-12	memblock: add MEMBLOCK_RSRV_KERN flag	Mike Rapoport (Microsoft)
	Patch series "kexec: introduce Kexec HandOver (KHO)", v8. Kexec today considers itself purely a boot loader: When we enter the new kernel, any state the previous kernel left behind is irrelevant and the new kernel reinitializes the system. However, there are use cases where this mode of operation is not what we actually want. In virtualization hosts for example, we want to use kexec to update the host kernel while virtual machine memory stays untouched. When we add device assignment to the mix, we also need to ensure that IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we need to do the same for the PCI subsystem. If we want to kexec while an SEV-SNP enabled virtual machine is running, we need to preserve the VM context pages and physical memory. See "pkernfs: Persisting guest memory and kernel/device state safely across kexec" Linux Plumbers Conference 2023 presentation for details: https://lpc.events/event/17/contributions/1485/ To start us on the journey to support all the use cases above, this patch implements basic infrastructure to allow hand over of kernel state across kexec (Kexec HandOver, aka KHO). As a really simple example target, we use memblock's reserve_mem. With this patchset applied, memory that was reserved using "reserve_mem" command line options remains intact after kexec and it is guaranteed to reside at the same physical address. == Alternatives == There are alternative approaches to (parts of) the problems above: * Memory Pools [1] - preallocated persistent memory region + allocator * PRMEM [2] - resizable persistent memory regions with fixed metadata pointer on the kernel command line + allocator * Pkernfs [3] - preallocated file system for in-kernel data with fixed address location on the kernel command line * PKRAM [4] - handover of user space pages using a fixed metadata page specified via command line All of the approaches above fundamentally have the same problem: They require the administrator to explicitly carve out a physical memory location because they have no mechanism outside of the kernel command line to pass data (including memory reservations) between kexec'ing kernels. KHO provides that base foundation. We will determine later whether we still need any of the approaches above for fast bulk memory handover of for example IOMMU page tables. But IMHO they would all be users of KHO, with KHO providing the foundational primitive to pass metadata and bulk memory reservations as well as provide easy versioning for data. == Overview == We introduce a metadata file that the kernels pass between each other. How they pass it is architecture specific. The file's format is a Flattened Device Tree (fdt) which has a generator and parser already included in Linux. KHO is enabled in the kernel command line by `kho=on`. When the root user enables KHO through /sys/kernel/debug/kho/out/finalize, the kernel invokes callbacks to every KHO users to register preserved memory regions, which contain drivers' states. When the actual kexec happens, the fdt is part of the image set that we boot into. In addition, we keep "scratch regions" available for kexec: physically contiguous memory regions that are guaranteed to not have any memory that KHO would preserve. The new kernel bootstraps itself using the scratch regions and sets all handed over memory as in use. When drivers initialize that support KHO, they introspect the fdt, restore preserved memory regions, and retrieve their states stored in the preserved memory. == Limitations == Currently KHO is only implemented for file based kexec. The kernel interfaces in the patch set are already in place to support user space kexec as well, but it is still not implemented it yet inside kexec tools. == How to Use == To use the code, please boot the kernel with the "kho=on" command line parameter. KHO will automatically create scratch regions. If you want to set the scratch size explicitly you can use "kho_scratch=" command line parameter. For instance, "kho_scratch=16M,512M,256M" will reserve a 16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB per NUMA node scratch regions on boot. Make sure to have a reserved memory range requested with reserv_mem command line option, for example, "reserve_mem=64m:4k:n1". Then before you invoke file based "kexec -l", finalize KHO FDT: # echo 1 > /sys/kernel/debug/kho/out/finalize You can preview the generated FDT using `dtc`, # dtc /sys/kernel/debug/kho/out/fdt # dtc /sys/kernel/debug/kho/out/sub_fdts/memblock `dtc` is available on ubuntu by `sudo apt-get install device-tree-compiler`. Now kexec into the new kernel, # kexec -l Image --initrd=initrd -s # kexec -e (The order of KHO finalization and "kexec -l" does not matter.) The new kernel will boot up and contain the previous kernel's reserve_mem contents at the same physical address as the first kernel. You can also review the FDT passed from the old kernel, # dtc /sys/kernel/debug/kho/in/fdt # dtc /sys/kernel/debug/kho/in/sub_fdts/memblock This patch (of 17): To denote areas that were reserved for kernel use either directly with memblock_reserve_kern() or via memblock allocations. Link: https://lore.kernel.org/lkml/20250424083258.2228122-1-changyuanl@google.com/ Link: https://lore.kernel.org/lkml/aAeaJ2iqkrv_ffhT@kernel.org/ Link: https://lore.kernel.org/lkml/35c58191-f774-40cf-8d66-d1e2aaf11a62@intel.com/ Link: https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/ Link: https://lkml.kernel.org/r/20250509074635.3187114-1-changyuanl@google.com Link: https://lkml.kernel.org/r/20250509074635.3187114-2-changyuanl@google.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-24	Add tests for memblock_alloc_node()	Claudio Migliorelli
	This test is aimed at verifying the memblock_alloc_node() to work as expected, so setting the correct NUMA node for the new allocated region. The memblock_alloc_node() is called directly without using any stub. The core check is between the requested NUMA node and the `nid` field inside the memblock_region structure. These two are supposed to be equal for the test to succeed. Signed-off-by: Claudio Migliorelli <claudio.migliorelli@mail.polimi.it> Link: https://lore.kernel.org/r/ea5e938e-6b74-b188-af59-4b94b18bc0@mail.polimi.it Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
2022-11-08	memblock tests: introduce range tests for memblock_alloc_exact_nid_raw	Rebecca Mckeever
	Add TEST_F_EXACT flag, which specifies that tests should run memblock_alloc_exact_nid_raw(). Introduce range tests for memblock_alloc_exact_nid_raw() by using the TEST_F_EXACT flag to run the range tests in alloc_nid_api.c, since memblock_alloc_exact_nid_raw() and memblock_alloc_try_nid_raw() behave the same way when nid = NUMA_NO_NODE. Rename tests and other functions in alloc_nid_api.c by removing "_try". Since the test names will be displayed in verbose output, they need to be general enough to refer to any of the memblock functions that the tests may run. Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Rebecca Mckeever <remckee0@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/5a4b6d1b6130ab7375314e1c45a6d5813dfdabbd.1667802195.git.remckee0@gmail.com
2022-09-18	memblock tests: add generic NUMA tests for memblock_alloc_try_nid*	Rebecca Mckeever
	Add tests for memblock_alloc_try_nid() and memblock_alloc_try_nid_raw() where the simulated physical memory is set up with multiple NUMA nodes. Additionally, two of these tests set nid != NUMA_NO_NODE. All tests are run for both top-down and bottom-up allocation directions. The tested scenarios are: Range unrestricted: - region cannot be allocated: + none of the nodes have enough memory to allocate the region Range restricted: - region can be allocated in the specific node requested without dropping min_addr: + the range fully overlaps with the node, and there are adjacent reserved regions - region cannot be allocated: + nid is set to NUMA_NO_NODE and the total range can fit the region, but the range is split between two nodes and everything else is reserved Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Rebecca Mckeever <remckee0@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/4b2c7e6e5f3a9837939e99293c77e0e6fc3ae4f9.1663046060.git.remckee0@gmail.com
2022-09-18	memblock tests: add bottom-up NUMA tests for memblock_alloc_try_nid*	Rebecca Mckeever
	Add tests for memblock_alloc_try_nid() and memblock_alloc_try_nid_raw() where the simulated physical memory is set up with multiple NUMA nodes. Additionally, all of these tests set nid != NUMA_NO_NODE. These tests are run with a bottom-up allocation direction. The tested scenarios are: Range unrestricted: - region can be allocated in the specific node requested: + there are no previously reserved regions + the requested node is partially reserved but has enough space - the specific node requested cannot accommodate the request, but the region can be allocated in a different node: + there are no previously reserved regions, but node is too small + the requested node is fully reserved + the requested node is partially reserved and does not have enough space Range restricted: - region can be allocated in the specific node requested after dropping min_addr: + range partially overlaps with two different nodes, where the first node is the requested node + range partially overlaps with two different nodes, where the requested node ends before min_addr - region cannot be allocated in the specific node requested, but it can be allocated in the requested range: + range overlaps with multiple nodes along node boundaries, and the requested node ends before min_addr + range overlaps with multiple nodes along node boundaries, and the requested node starts after max_addr - region cannot be allocated in the specific node requested, but it can be allocated after dropping min_addr: + range partially overlaps with two different nodes, where the second node is the requested node Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Rebecca Mckeever <remckee0@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/00c4810daaf5d050abc71915b24ed7419bb16b51.1663046060.git.remckee0@gmail.com
2022-09-18	memblock tests: add top-down NUMA tests for memblock_alloc_try_nid*	Rebecca Mckeever
	Add tests for memblock_alloc_try_nid() and memblock_alloc_try_nid_raw() where the simulated physical memory is set up with multiple NUMA nodes. Additionally, all of these tests set nid != NUMA_NO_NODE. These tests are run with a top-down allocation direction. The tested scenarios are: Range unrestricted: - region can be allocated in the specific node requested: + there are no previously reserved regions + the requested node is partially reserved but has enough space - the specific node requested cannot accommodate the request, but the region can be allocated in a different node: + there are no previously reserved regions, but node is too small + the requested node is fully reserved + the requested node is partially reserved and does not have enough space Range restricted: - region can be allocated in the specific node requested after dropping min_addr: + range partially overlaps with two different nodes, where the first node is the requested node + range partially overlaps with two different nodes, where the requested node ends before min_addr - region cannot be allocated in the specific node requested, but it can be allocated in the requested range: + range overlaps with multiple nodes along node boundaries, and the requested node ends before min_addr + range overlaps with multiple nodes along node boundaries, and the requested node starts after max_addr - region cannot be allocated in the specific node requested, but it can be allocated after dropping min_addr: + range partially overlaps with two different nodes, where the second node is the requested node Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Rebecca Mckeever <remckee0@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/84009c5b3969337ccf89df850db56d364f8c228b.1663046060.git.remckee0@gmail.com
2022-09-04	memblock_tests: move variable declarations to single block	Rebecca Mckeever
	Move variable declarations to a single block at the beginning of each testing function. Signed-off-by: Rebecca Mckeever <remckee0@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/e61431e73977f305fdd027bca99d1dc119e96d84.1662264355.git.remckee0@gmail.com
2022-09-04	memblock tests: remove 'cleared' from comment blocks	Rebecca Mckeever
	The tests in alloc_nid_api can now run either memblock_alloc_try_nid() or memblock_alloc_try_nid_raw(). The comment blocks for these tests should not refer to a 'cleared' region since that only applies to memblock_alloc_try_nid(). Remove 'cleared' from the comment blocks so that the comments are accurate for either memblock function. Signed-off-by: Rebecca Mckeever <remckee0@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/e8be24137e54e9f81a06af969ded82b319114d7a.1662264347.git.remckee0@gmail.com
2022-08-30	memblock tests: update alloc_nid_api to test memblock_alloc_try_nid_raw	Rebecca Mckeever
	Update memblock_alloc_try_nid() tests so that they test either memblock_alloc_try_nid() or memblock_alloc_try_nid_raw() depending on the value of alloc_nid_test_flags. Run through all the existing tests in alloc_nid_api twice: once for memblock_alloc_try_nid() and once for memblock_alloc_try_nid_raw(). When the tests run memblock_alloc_try_nid(), they test that the entire memory region is zero. When the tests run memblock_alloc_try_nid_raw(), they test that the entire memory region is nonzero. The content of the memory region is initialized to nonzero, and we expect it to remain unchanged if running memblock_alloc_try_nid_raw(). Reviewed-by: Shaoqin Huang <shaoqin.huang@intel.com> Signed-off-by: Rebecca Mckeever <remckee0@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/6fa8938f67872841c10a00afb042947d1d280a04.1661578349.git.remckee0@gmail.com
2022-08-30	memblock tests: add labels to verbose output for generic alloc tests	Rebecca Mckeever
	Generic tests for memblock_alloc() functions do not use separate functions for testing top-down and bottom-up allocation directions. Therefore, the function name that is displayed in the verbose testing output does not include the allocation direction. Add an additional prefix when running generic tests for memblock_alloc() functions that indicates which allocation direction is set. The prefix will be displayed when the tests are run in verbose mode. Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Shaoqin Huang <shaoqin.huang@intel.com> Signed-off-by: Rebecca Mckeever <remckee0@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/fb76a42253d2a196a7daea29dd8121a69904f58e.1661578349.git.remckee0@gmail.com
2022-08-30	memblock tests: update zeroed memory check for memblock_alloc_* tests	Rebecca Mckeever
	Update the assert in memblock_alloc_try_nid() and memblock_alloc_from() tests that checks whether the memory is cleared so that it checks the entire chunk of allocated memory instead of just the first byte. Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Shaoqin Huang <shaoqin.huang@intel.com> Signed-off-by: Rebecca Mckeever <remckee0@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/24b3271751756100142e65b75284d43b4d30c9b7.1661578349.git.remckee0@gmail.com
2022-07-04	memblock tests: add verbose output to memblock tests	Rebecca Mckeever
	Add and use functions and macros for printing verbose testing output. If the Memblock simulator was compiled with VERBOSE=1: - prefix_push(): appends the given string to a prefix string that will be printed in test_fail() and test_pass*(). - prefix_pop(): removes the last prefix from the prefix string. - prefix_reset(): clears the prefix string. - test_fail(): prints a message after a test fails containing the test number of the failing test and the prefix. - test_pass(): prints a message after a test passes containing its test number and the prefix. - test_print(): prints the given formatted output string. - test_pass_pop(): runs test_pass() followed by prefix_pop(). - PREFIX_PUSH(): runs prefix_push(__func__). If the Memblock simulator was not compiled with VERBOSE=1, these functions/macros do nothing. Add the assert wrapper macros ASSERT_EQ(), ASSERT_NE(), and ASSERT_LT(). If the assert condition fails, these macros call test_fail() before executing assert(). Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Shaoqin Huang <shaoqin.huang@intel.com> Signed-off-by: Rebecca Mckeever <remckee0@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/f234d443fe154d5ae8d8aa07284aff69edfb6f61.1656907314.git.remckee0@gmail.com
2022-03-09	memblock tests: Add memblock_alloc_try_nid tests for bottom up	Karolina Drobnik
	Add checks for memblock_alloc_try_nid for bottom up allocation direction. As the definition of this function is pretty close to the core memblock_alloc_range_nid, the test cases implemented here cover most of the code paths related to the memory allocations. The tested scenarios are: - Region can be allocated within the requested range (both with aligned and misaligned boundaries) - Region can be allocated between two already existing entries - Not enough space between already reserved regions - Memory at the range boundaries is reserved but there is enough space to allocate a new region - The memory range is too narrow but memory can be allocated before the maximum address - Edge cases: + Minimum address is below memblock_start_of_DRAM() + Maximum address is above memblock_end_of_DRAM() Add test case wrappers to test both directions in the same context. Signed-off-by: Karolina Drobnik <karolinadrobnik@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/1c0ba11b8da5dc8f71ad45175c536fa4be720984.1646055639.git.karolinadrobnik@gmail.com
2022-03-09	memblock tests: Add memblock_alloc_try_nid tests for top down	Karolina Drobnik
	Add tests for memblock_alloc_try_nid for top down allocation direction. As the definition of this function is pretty close to the core memblock_alloc_range_nid, the test cases implemented here cover most of the code paths related to the memory allocations. The tested scenarios are: - Region can be allocated within the requested range (both with aligned and misaligned boundaries) - Region can be allocated between two already existing entries - Not enough space between already reserved regions - Memory range is too narrow but memory can be allocated before the maximum address - Edge cases: + Minimum address is below memblock_start_of_DRAM() + Maximum address is above memblock_end_of_DRAM() Add checks for both allocation directions: - Region starts at the min_addr and ends at max_addr - Maximum address is too close to the beginning of the available memory - Memory at the range boundaries is reserved but there is enough space to allocate a new region Signed-off-by: Karolina Drobnik <karolinadrobnik@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Link: https://lore.kernel.org/r/d6c282e0f9f62c15bf74c216214604764232d637.1646055639.git.karolinadrobnik@gmail.com