summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-04-01Merge tag 'linux-watchdog-6.15-rc1' of ↵Linus Torvalds
git://www.linux-watchdog.org/linux-watchdog Pull watchdog updates from Wim Van Sebroeck: - Add watchdog driver for Lenovo SE30 platform - Add support for Allwinner A523 - Add i.MX94 support - watchdog framework: Convert to use device property - renesas,wdt: Document RZ/G3E support - Various other fixes and improvemenents * tag 'linux-watchdog-6.15-rc1' of git://www.linux-watchdog.org/linux-watchdog: watchdog: sunxi_wdt: Add support for Allwinner A523 dt-bindings: watchdog: sunxi: add Allwinner A523 compatible string watchdog: aspeed: fix 64-bit division watchdog: npcm: Remove unnecessary NULL check before clk_prepare_enable/clk_disable_unprepare dt-bindings: watchdog: renesas,wdt: Document RZ/G3E support watchdog: Convert to use device property watchdog: lenovo_se30_wdt: include io.h for devm_ioremap() dt-bindings: watchdog: fsl-imx7ulp-wdt: Add i.MX94 support watchdog: nic7018_wdt: tidy up ACPI ID table watchdog: s3c2410_wdt: Fix PMU register bits for ExynosAutoV920 SoC watchdog: lenovo_se30_wdt: Watchdog driver for Lenovo SE30 platform watchdog: Enable RZV2HWDT driver depend on ARCH_RENESAS watchdog: cros-ec: Add newlines to printks watchdog: aspeed: Update bootstatus handling
2025-04-02spi: bcm2835: Do not call gpiod_put() on invalid descriptorFlorian Fainelli
If we are unable to lookup the chip-select GPIO, the error path will call bcm2835_spi_cleanup() which unconditionally calls gpiod_put() on the cs->gpio variable which we just determined was invalid. Fixes: 21f252cd29f0 ("spi: bcm2835: reduce the abuse of the GPIO API") Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/20250401224238.2854256-1-florian.fainelli@broadcom.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-04-01lib: scatterlist: fix sg_split_phys to preserve original scatterlist offsetsT Pratham
The split_sg_phys function was incorrectly setting the offsets of all scatterlist entries (except the first) to 0. Only the first scatterlist entry's offset and length needs to be modified to account for the skip. Setting the rest entries' offsets to 0 could lead to incorrect data access. I am using this function in a crypto driver that I'm currently developing (not yet sent to mailing list). During testing, it was observed that the output scatterlists (except the first one) contained incorrect garbage data. I narrowed this issue down to the call of sg_split(). Upon debugging inside this function, I found that this resetting of offset is the cause of the problem, causing the subsequent scatterlists to point to incorrect memory locations in a page. By removing this code, I am obtaining expected data in all the split output scatterlists. Thus, this was indeed causing observable runtime effects! This patch removes the offending code, ensuring that the page offsets in the input scatterlist are preserved in the output scatterlist. Link: https://lkml.kernel.org/r/20250319111437.1969903-1-t-pratham@ti.com Fixes: f8bcbe62acd0 ("lib: scatterlist: add sg splitting function") Signed-off-by: T Pratham <t-pratham@ti.com> Cc: Robert Jarzmik <robert.jarzmik@free.fr> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kamlesh Gurudasani <kamlesh@ti.com> Cc: Praneeth Bajjuri <praneeth@ti.com> Cc: Vignesh Raghavendra <vigneshr@ti.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01lib/sort.c: add _nonatomic() variants with cond_resched()Kent Overstreet
bcachefs calls sort() during recovery to sort all keys it found in the journal, and this may be very large - gigabytes on large machines. This has been causing "task blocked" warnings, so needs a cond_resched(). [kent.overstreet@linux.dev: fix kerneldoc] Link: https://lkml.kernel.org/r/cgsr5a447pxqomc4gvznsp5yroqmif4omd7o5lsr2swifjhoic@yzjjrx2bvrq7 Link: https://lkml.kernel.org/r/20250326152606.2594920-1-kent.overstreet@linux.dev Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mailmap: add an entry for Nicolas SchierNicolas Schier
Add mapping to linux.dev address as mail usage at work is limited and my mail provider for private mails is suffering from its own domain name block lists. Link: https://lkml.kernel.org/r/20250326-mailmap-add-entry-for-nicolas-v1-1-3c26579a7fdf@fjasle.eu Signed-off-by: Nicolas Schier <nicolas@fjasle.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mseal sysmap: add arch-support txtJeff Xu
Add Documentation/features/core/mseal_sys_mappings/arch-support.txt N/A: the arch is 32bits only and mseal is not supported in 32 bits, therefore N/A (until mseal is available in 32 bits kernel). [jeffxu@chromium.org: update to v3] Link: https://lkml.kernel.org/r/20250324151537.1106542-2-jeffxu@google.com Link: https://lkml.kernel.org/r/20250321032627.4147562-2-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Eric Dumaze <edumazet@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: guoweikang <guoweikang.kernel@gmail.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Meghana Malladi <m-malladi@ti.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mseal sysmap: enable s390Heiko Carstens
Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on s390, covering the vdso. [hca@linux.ibm.com: update supported architectures] Link: https://lkml.kernel.org/r/20250317131917.1332402-1-hca@linux.ibm.com Link: https://lkml.kernel.org/r/20250311123326.2686682-3-hca@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01selftest: test system mappings are sealedJeff Xu
Add sysmap_is_sealed.c to test system mappings are sealed. Note: CONFIG_MSEAL_SYSTEM_MAPPINGS must be set, as indicated in config file. Link: https://lkml.kernel.org/r/20250305021711.3867874-8-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Berg <benjamin@sipsolutions.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Elliot Hughes <enh@google.com> Cc: Florian Faineli <f.fainelli@gmail.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <mike.rapoport@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mseal sysmap: update mseal.rstJeff Xu
Update memory sealing documentation to include details about system mappings. Link: https://lkml.kernel.org/r/20250305021711.3867874-7-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Berg <benjamin@sipsolutions.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Elliot Hughes <enh@google.com> Cc: Florian Faineli <f.fainelli@gmail.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <mike.rapoport@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mseal sysmap: uprobe mappingJeff Xu
Provide support to mseal the uprobe mapping. Unlike other system mappings, the uprobe mapping is not established during program startup. However, its lifetime is the same as the process's lifetime. It could be sealed from creation. Test was done with perf tool, and observe the uprobe mapping is sealed. Link: https://lkml.kernel.org/r/20250305021711.3867874-6-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Berg <benjamin@sipsolutions.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Elliot Hughes <enh@google.com> Cc: Florian Faineli <f.fainelli@gmail.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <mike.rapoport@gmail.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mseal sysmap: enable arm64Jeff Xu
Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on arm64, covering the vdso, vvar, and compat-mode vectors and sigpage mappings. Production release testing passes on Android and Chrome OS. Link: https://lkml.kernel.org/r/20250305021711.3867874-5-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Berg <benjamin@sipsolutions.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Elliot Hughes <enh@google.com> Cc: Florian Faineli <f.fainelli@gmail.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <mike.rapoport@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mseal sysmap: enable x86-64Jeff Xu
Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on x86-64, covering the vdso, vvar, vvar_vclock. Production release testing passes on Android and Chrome OS. Link: https://lkml.kernel.org/r/20250305021711.3867874-4-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Berg <benjamin@sipsolutions.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Elliot Hughes <enh@google.com> Cc: Florian Faineli <f.fainelli@gmail.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <mike.rapoport@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mseal sysmap: generic vdso vvar mappingHeiko Carstens
With the introduction of the generic vdso data storage the VM_SEALED_SYSMAP vm flag must be moved from the architecture specific _install_special_mapping() call [1] [2] which maps the vvar mapping to generic code. [1] https://lkml.kernel.org/r/20250305021711.3867874-4-jeffxu@google.com [2] https://lkml.kernel.org/r/20250305021711.3867874-5-jeffxu@google.com Link: https://lkml.kernel.org/r/20250311123326.2686682-2-hca@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01selftests: x86: test_mremap_vdso: skip if vdso is msealedJeff Xu
Add code to detect if the vdso is memory sealed, skip the test if it is. Link: https://lkml.kernel.org/r/20250305021711.3867874-3-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Berg <benjamin@sipsolutions.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Elliot Hughes <enh@google.com> Cc: Florian Faineli <f.fainelli@gmail.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <mike.rapoport@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mseal sysmap: kernel config and header changeJeff Xu
Patch series "mseal system mappings", v9. As discussed during mseal() upstream process [1], mseal() protects the VMAs of a given virtual memory range against modifications, such as the read/write (RW) and no-execute (NX) bits. For complete descriptions of memory sealing, please see mseal.rst [2]. The mseal() is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. The system mappings are readonly only, memory sealing can protect them from ever changing to writable or unmmap/remapped as different attributes. System mappings such as vdso, vvar, vvar_vclock, vectors (arm compat-mode), sigpage (arm compat-mode), are created by the kernel during program initialization, and could be sealed after creation. Unlike the aforementioned mappings, the uprobe mapping is not established during program startup. However, its lifetime is the same as the process's lifetime [3]. It could be sealed from creation. The vsyscall on x86-64 uses a special address (0xffffffffff600000), which is outside the mm managed range. This means mprotect, munmap, and mremap won't work on the vsyscall. Since sealing doesn't enhance the vsyscall's security, it is skipped in this patch. If we ever seal the vsyscall, it is probably only for decorative purpose, i.e. showing the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored. It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may alter the system mappings during restore operations. UML(User Mode Linux) and gVisor, rr are also known to change the vdso/vvar mappings. Consequently, this feature cannot be universally enabled across all systems. As such, CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default. To support mseal of system mappings, architectures must define CONFIG_ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS and update their special mappings calls to pass mseal flag. Additionally, architectures must confirm they do not unmap/remap system mappings during the process lifetime. The existence of this flag for an architecture implies that it does not require the remapping of thest system mappings during process lifetime, so sealing these mappings is safe from a kernel perspective. This version covers x86-64 and arm64 archiecture as minimum viable feature. While no specific CPU hardware features are required for enable this feature on an archiecture, memory sealing requires a 64-bit kernel. Other architectures can choose whether or not to adopt this feature. Currently, I'm not aware of any instances in the kernel code that actively munmap/mremap a system mapping without a request from userspace. The PPC does call munmap when _install_special_mapping fails for vdso; however, it's uncertain if this will ever fail for PPC - this needs to be investigated by PPC in the future [4]. The UML kernel can add this support when KUnit tests require it [5]. In this version, we've improved the handling of system mapping sealing from previous versions, instead of modifying the _install_special_mapping function itself, which would affect all architectures, we now call _install_special_mapping with a sealing flag only within the specific architecture that requires it. This targeted approach offers two key advantages: 1) It limits the code change's impact to the necessary architectures, and 2) It aligns with the software architecture by keeping the core memory management within the mm layer, while delegating the decision of sealing system mappings to the individual architecture, which is particularly relevant since 32-bit architectures never require sealing. Prior to this patch series, we explored sealing special mappings from userspace using glibc's dynamic linker. This approach revealed several issues: - The PT_LOAD header may report an incorrect length for vdso, (smaller than its actual size). The dynamic linker, which relies on PT_LOAD information to determine mapping size, would then split and partially seal the vdso mapping. Since each architecture has its own vdso/vvar code, fixing this in the kernel would require going through each archiecture. Our initial goal was to enable sealing readonly mappings, e.g. .text, across all architectures, sealing vdso from kernel since creation appears to be simpler than sealing vdso at glibc. - The [vvar] mapping header only contains address information, not length information. Similar issues might exist for other special mappings. - Mappings like uprobe are not covered by the dynamic linker, and there is no effective solution for them. This feature's security enhancements will benefit ChromeOS, Android, and other high security systems. Testing: This feature was tested on ChromeOS and Android for both x86-64 and ARM64. - Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly, i.e. "sl" shown in the smaps for those mappings, and mremap is blocked. - Passing various automation tests (e.g. pre-checkin) on ChromeOS and Android to ensure the sealing doesn't affect the functionality of Chromebook and Android phone. I also tested the feature on Ubuntu on x86-64: - With config disabled, vdso/vvar is not sealed, - with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK, normal operations such as browsing the web, open/edit doc are OK. Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1] Link: Documentation/userspace-api/mseal.rst [2] Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3] Link: https://lore.kernel.org/all/CABi2SkV6JJwJeviDLsq9N4ONvQ=EFANsiWkgiEOjyT9TQSt+HA@mail.gmail.com/ [4] Link: https://lore.kernel.org/all/202502251035.239B85A93@keescook/ [5] This patch (of 7): Provide infrastructure to mseal system mappings. Establish two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS, ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS) and VM_SEALED_SYSMAP macro for future patches. Link: https://lkml.kernel.org/r/20250305021711.3867874-1-jeffxu@google.com Link: https://lkml.kernel.org/r/20250305021711.3867874-2-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Berg <benjamin@sipsolutions.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Elliot Hughes <enh@google.com> Cc: Florian Faineli <f.fainelli@gmail.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <mike.rapoport@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm: pgtable: remove tlb_remove_page_ptdesc()Qi Zheng
The tlb_remove_ptdesc()/tlb_remove_table() is specially designed for page table pages, and now all architectures have been converted to use it to remove page table pages. So let's remove tlb_remove_page_ptdesc(), it currently has no users and should not be used for page table pages. Link: https://lkml.kernel.org/r/3df04c8494339073b71be4acb2d92e108ecd1b60.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01x86: pgtable: convert to use tlb_remove_ptdesc()Qi Zheng
The x86 has already been converted to use struct ptdesc, so convert it to use tlb_remove_ptdesc() instead of tlb_remove_table(). Link: https://lkml.kernel.org/r/36ad56b7e06fa4b17fb23c4fc650e8e0d72bb3cd.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01riscv: pgtable: unconditionally use tlb_remove_ptdesc()Qi Zheng
To support fast gup, the commit 69be3fb111e7 ("riscv: enable MMU_GATHER_RCU_TABLE_FREE for SMP && MMU") did the following: 1) use tlb_remove_page_ptdesc() for those platforms which use IPI to perform TLB shootdown 2) use tlb_remove_ptdesc() for those platforms which use SBI to perform TLB shootdown The tlb_remove_page_ptdesc() is the wrapper of the tlb_remove_page(). By design, the tlb_remove_page() should be used to remove a normal page from a page table entry, and should not be used for page table pages. The tlb_remove_ptdesc() is the wrapper of the tlb_remove_table(), which is designed specifically for freeing page table pages. If the CONFIG_MMU_GATHER_TABLE_FREE is enabled, the tlb_remove_table() will use semi RCU to free page table pages, that is: - batch table freeing: asynchronous free by RCU - single table freeing: IPI + synchronous free If the CONFIG_MMU_GATHER_TABLE_FREE is disabled, the tlb_remove_table() will fall back to pagetable_dtor() + tlb_remove_page(). For case 1), since we need to perform TLB shootdown before freeing the page table page, the local_irq_save() in fast gup can block the freeing and protect the fast gup page walker. Therefore we can ensure safety by just using tlb_remove_page_ptdesc(). In addition, we can also the tlb_remove_ptdesc()/tlb_remove_table() to achieve it, and it doesn't matter whether CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected. And in theory, the performance of freeing pages asynchronously via RCU will not be lower than synchronous free. For case 2), since local_irq_save() only disable S-privilege IPI irq but not M-privilege's, which is used by the SBI implementation to perform TLB shootdown, so we must select CONFIG_MMU_GATHER_RCU_TABLE_FREE and use tlb_remove_ptdesc() to ensure safety. The riscv selects this config for SMP && MMU, the CONFIG_RISCV_SBI is dependent on MMU. Therefore, only the UP system may have the situation where CONFIG_MMU_GATHER_RCU_TABLE_FREE is disabled but CONFIG_RISCV_SBI is enabled. But there is no freeing vs fast gup race in the UP system. So, in summary, we can use tlb_remove_ptdesc() to support fast gup in all cases, and this interface is specifically designed for page table pages. So let's use it unconditionally. Link: https://lkml.kernel.org/r/9025595e895515515c95e48db54b29afa489c41d.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm: pgtable: convert some architectures to use tlb_remove_ptdesc()Qi Zheng
Now, the nine architectures of csky, hexagon, loongarch, m68k, mips, nios2, openrisc, sh and um do not select CONFIG_MMU_GATHER_RCU_TABLE_FREE, and just call pagetable_dtor() + tlb_remove_page_ptdesc() (the wrapper of tlb_remove_page()). This is the same as the implementation of tlb_remove_{ptdesc|table}() under !CONFIG_MMU_GATHER_TABLE_FREE, so convert these architectures to use tlb_remove_ptdesc(). The ultimate goal is to make the architecture only use tlb_remove_ptdesc() or tlb_remove_table() for page table pages. [zhengqi.arch@bytedance.com: v2] Link: https://lkml.kernel.org/r/20250303072603.45423-1-zhengqi.arch@bytedance.com [akpm@linux-foundation.org: remove trailing semi in arch/loongarch/include/asm/pgalloc.h] Link: https://lkml.kernel.org/r/19db3e8673b67bad2f1df1ab37f1c89d99eacfea.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc*Qi Zheng
All callers of tlb_remove_ptdesc() pass it a pointer of struct ptdesc, so let's change the pt parameter from void * to struct ptdesc * to perform a type safety check. Link: https://lkml.kernel.org/r/60bb44299cf2d731df6592e446e7f694054d0dbe.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm: pgtable: make generic tlb_remove_table() use struct ptdescQi Zheng
Patch series "remove tlb_remove_page_ptdesc()", v2. As suggested by Peter Zijlstra below [1], this series aims to remove tlb_remove_page_ptdesc(). : Fundamentally tlb_remove_page() is about removing *pages* as from a PTE, : there should not be a page-table anywhere near here *ever*. : : Yes, some architectures use tlb_remove_page() for page-tables too, but : that is more or less an implementation detail that can be fixed. After this series, all architectures use tlb_remove_table() or tlb_remove_ptdesc() to remove the page table pages. In the future, once all architectures using tlb_remove_table() have also converted to using struct ptdesc (eg. powerpc), it may be possible to use only tlb_remove_ptdesc(). [1] https://lore.kernel.org/linux-mm/20250103111457.GC22934@noisy.programming.kicks-ass.net/ This patch (of 6): Now only arm will call tlb_remove_ptdesc()/tlb_remove_table() when CONFIG_MMU_GATHER_TABLE_FREE is disabled. In this case, the type of the table parameter is actually struct ptdesc * instead of struct page *. Since struct ptdesc still overlaps with struct page and has not been separated from it, forcing the table parameter to struct page * will not cause any problems at this time. But this is definitely incorrect and needs to be fixed. So just like the generic __tlb_remove_table(), let generic tlb_remove_table() use struct ptdesc by default when CONFIG_MMU_GATHER_TABLE_FREE is disabled. Link: https://lkml.kernel.org/r/cover.1740454179.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/5be8c3ab7bd68510bf0db4cf84010f4dfe372917.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01microblaze/mm: put mm_cmdline_setup() in .init.text sectionWei Yang
As reported by lkp, there is a section mismatch of mm_cmdline_setup() and memblock. The reason is we don't specify the section of mm_cmdline_setup() and gcc put it into .text.unlikely. As mm_cmdline_setup() is only used in mmu_init(), which is in .init.text section, put mm_cmdline_setup() into it too. Link: https://lkml.kernel.org/r/20250328010136.13139-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503241259.kJV3U7Xj-lkp@intel.com/ Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_rangeJinjiang Tu
We triggered the below BUG: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x2 pfn:0x240402 head: order:9 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x1ffffe0000000040(head|node=1|zone=3|lastcpupid=0x1ffff) page_type: f4(hugetlb) page dumped because: VM_BUG_ON_PAGE(page->compound_head & 1) ------------[ cut here ]------------ kernel BUG at ./include/linux/page-flags.h:310! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 7 UID: 0 PID: 166 Comm: sh Not tainted 6.14.0-rc7-dirty #374 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : const_folio_flags+0x3c/0x58 lr : const_folio_flags+0x3c/0x58 Call trace: const_folio_flags+0x3c/0x58 (P) do_migrate_range+0x164/0x720 offline_pages+0x63c/0x6fc memory_subsys_offline+0x190/0x1f4 device_offline+0xc0/0x13c state_store+0x90/0xd8 dev_attr_store+0x18/0x2c sysfs_kf_write+0x44/0x54 kernfs_fop_write_iter+0x120/0x1cc vfs_write+0x240/0x378 ksys_write+0x70/0x108 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x48/0x10c el0_svc_common.constprop.0+0x40/0xe0 When allocating a hugetlb folio, between the folio is taken from buddy and prep_compound_page() is called, start_isolate_page_range() and do_migrate_range() is called. When do_migrate_range() scans the head page of the hugetlb folio, the compound_head field isn't set, so scans the tail page next. And at this time, the compound_head field of tail page is set, folio_test_large() is called by tail page, thus triggers VM_BUG_ON(). To fix it, get folio refcount before calling folio_test_large(). Link: https://lkml.kernel.org/r/20250324131750.1551884-1-tujinjiang@huawei.com Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration") Fixes: b62b51d2d159 ("mm: memory_hotplug: remove head variable in do_migrate_range()") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01MAINTAINERS: mm: add entry for secretmemMike Rapoport (Microsoft)
Link: https://lkml.kernel.org/r/20250326215541.1809379-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Willaims <dan.j.williams@intel.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01MAINTAINERS: mm: add entry for numa memblocks and numa emulationMike Rapoport (Microsoft)
Link: https://lkml.kernel.org/r/20250326215541.1809379-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Willaims <dan.j.williams@intel.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01MAINTAINERS: mm: add entry for execmemMike Rapoport (Microsoft)
Link: https://lkml.kernel.org/r/20250326215541.1809379-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Willaims <dan.j.williams@intel.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01MAINTAINERS: fixup USERFAULTFD entryMike Rapoport (Microsoft)
Patch series "MAINTAINERS: add my isub-entries to MM part." Following discussion at LSF/MM/BPF I'm adding execmem, secretmem and numa memblocks sub-entries for MEMORY MANAGEMENT in MAINTAINERS. This patch (of 4): Change title to "MEMORY MANAGEMENT - USERFAULTFD" and make it sub-topic in memory management and add missing include/linux/userfaultfd_k.h and mailing list Link: https://lkml.kernel.org/r/20250326215541.1809379-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20250326215541.1809379-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Willaims <dan.j.williams@intel.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01selftest/mm: va_high_addr_switch: add ppc64 support checkLi Wang
Add PPC64 Radix MMU support to the va_high_addr_switch.sh by introducing check_supported_ppc64(). The function verifies: - 5-level paging (PGTABLE_LEVELS >= 5) enable in kernel config - Radix MMU (required for PPC64 5-level translation) - HugePages availability (needed for some tests) If any check fails, the test is skipped (ksft_skip). This ensures compatibility with Power9/Power10 systems running in Radix MMU mode. Avoid failures on 4-level paging system: # mmap(NULL, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(LOW_ADDR, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(HIGH_ADDR, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(HIGH_ADDR, MAP_HUGETLB) again: 0xffffffffffffffff - FAILED # mmap(HIGH_ADDR, MAP_FIXED | MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(-1, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(-1, MAP_HUGETLB) again: 0xffffffffffffffff - FAILED # mmap(ADDR_SWITCH_HINT - PAGE_SIZE, 2*HUGETLB_SIZE, MAP_HUGETLB): 0xffffffffffffffff - FAILED # mmap(ADDR_SWITCH_HINT , 2*HUGETLB_SIZE, MAP_FIXED | MAP_HUGETLB): 0xffffffffffffffff - FAILED Link: https://lkml.kernel.org/r/20250327114813.25980-1-liwang@redhat.com Signed-off-by: Li Wang <liwang@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01memblock: don't release high memory to page allocator when HIGHMEM is offMike Rapoport (Microsoft)
Nathan Chancellor reports the following crash on a MIPS system with CONFIG_HIGHMEM=n: Linux version 6.14.0-rc6-00359-g6faea3422e3b (nathan@ax162) (mips-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 SMP Fri Mar 21 08:12:02 MST 2025 earlycon: uart8250 at I/O port 0x3f8 (options '38400n8') printk: legacy bootconsole [uart8250] enabled Config serial console: console=ttyS0,38400n8r CPU0 revision is: 00019300 (MIPS 24Kc) FPU revision is: 00739300 MIPS: machine is mti,malta Software DMA cache coherency enabled Initial ramdisk at: 0x8fad0000 (5360128 bytes) OF: reserved mem: Reserved memory: No reserved-memory node in the DT Primary instruction cache 2kB, VIPT, 2-way, linesize 16 bytes. Primary data cache 2kB, 2-way, VIPT, no aliases, linesize 16 bytes Zone ranges: DMA [mem 0x0000000000000000-0x0000000000ffffff] Normal [mem 0x0000000001000000-0x000000001fffffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000000000000-0x000000000fffffff] node 0: [mem 0x0000000090000000-0x000000009fffffff] Initmem setup node 0 [mem 0x0000000000000000-0x000000009fffffff] On node 0, zone Normal: 16384 pages in unavailable ranges random: crng init done percpu: Embedded 3 pages/cpu s18832 r8192 d22128 u49152 Kernel command line: rd_start=0xffffffff8fad0000 rd_size=5360128 console=ttyS0,38400n8r printk: log buffer data + meta data: 32768 + 102400 = 135168 bytes Dentry cache hash table entries: 65536 (order: 4, 262144 bytes, linear) Inode-cache hash table entries: 32768 (order: 3, 131072 bytes, linear) Writing ErrCtl register=00000000 Readback ErrCtl register=00000000 Built 1 zonelists, mobility grouping on. Total pages: 16384 mem auto-init: stack:all(zero), heap alloc:off, heap free:off Unhandled kernel unaligned access[#1]: CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc6-00359-g6faea3422e3b #1 Hardware name: mti,malta $ 0 : 00000000 00000001 81cb0880 00129027 $ 4 : 00000001 0000000a 00000002 00129026 $ 8 : ffffdfff 80101e00 00000002 00000000 $12 : 81c9c224 81c63e68 00000002 00000000 $16 : 805b1e00 00025800 81cb0880 00000002 $20 : 00000000 81c63e64 0000000a 81f10000 $24 : 81c63e64 81c63e60 $28 : 81c60000 81c63de0 00000001 81cc9d20 Hi : 00000000 Lo : 00000000 epc : 814a227c __free_pages_ok+0x144/0x3c0 ra : 81cc9d20 memblock_free_all+0x1d4/0x27c Status: 10000002 KERNEL EXL Cause : 00800410 (ExcCode 04) BadVA : 00129026 PrId : 00019300 (MIPS 24Kc) Modules linked in: Process swapper (pid: 0, threadinfo=(ptrval), task=(ptrval), tls=00000000) Stack : 81f10000 805a9e00 81c80000 00000000 00000002 814aa240 000003ff 00000400 00000000 81f10000 81c9c224 00003b1f 81c80000 81c63e60 81ca0000 81c63e64 81f10000 0000000a 0000001f 81cc9d20 81f10000 81cc96d8 00000000 81c80000 81c9c224 81c63e60 81c63e64 00000000 81f10000 00024000 00028000 00025c00 90000000 a0000000 00000002 00000017 00000000 00000000 81f10000 81f10000 ... Call Trace: [<814a227c>] __free_pages_ok+0x144/0x3c0 [<81cc9d20>] memblock_free_all+0x1d4/0x27c [<81cc6764>] mm_core_init+0x100/0x138 [<81cb4ba4>] start_kernel+0x4a0/0x6e4 Code: 1080ffd5 02003825 2467ffff <8ce30000> 7c630500 1060ffd4 00000000 8ce30000 7c630180 The crash happens because commit 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") too eagerly frees high memory to the page allocator even when HIGHMEM is disabled. Make sure that when CONFIG_HIGHMEM=n the high memory is not released to the page allocator. Link: https://lore.kernel.org/all/20250323190647.GA1009914@ax162 Link: https://lkml.kernel.org/r/20250325114928.1791109-3-rppt@kernel.org Reported-by: Nathan Chancellor <nathan@kernel.org> Fixes: 6faea3422e3b ("arch, mm: streamline HIGHMEM freeing") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm/mm_init: init holes in the end of the memory map for FLATMEMMike Rapoport (Microsoft)
Patch series "mm: fixes for fallouts from mem_init() cleanup". These are the fixes for fallouts from mem_init() cleanup reported by Nathan Chancellor and kbuild. The details are in the commit messages. This patch (of 2): Kernel test robot reports the following crash on 32-bit system with FLATMEM and DEBUG_VM_PGFLAGS enabled: [ 0.478822][ T0] kernel BUG at include/linux/page-flags.h:536! [ 0.479312][ T0] Oops: invalid opcode: 0000 [#1] PREEMPT SMP [ 0.479768][ T0] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc6-00357-g8268af309d07 #1 [ 0.480470][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 0.481260][ T0] EIP: reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.481683][ T0] Code: 5d c3 01 f1 89 c8 ba e1 38 f4 c3 e8 1e 37 8e fc 0f 0b b8 90 e2 62 c4 e8 e2 05 5e fc 01 f1 89 c8 ba be 85 f7 c3 e8 04 37 8e fc <0f> 0b b8 80 e2 62 c4 e8 c8 05 5e fc 55 89 e5 53 57 56 83 ec 10 89 [ 0.483177][ T0] EAX: 00000000 EBX: c425df50 ECX: 00000000 EDX: 00000000 [ 0.483712][ T0] ESI: 017ffc00 EDI: ffffffff EBP: c425df34 ESP: c425df2c [ 0.484248][ T0] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00210046 [ 0.484846][ T0] CR0: 80050033 CR2: 00000000 CR3: 04b48000 CR4: 00000090 [ 0.485376][ T0] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 0.485907][ T0] DR6: fffe0ff0 DR7: 00000400 [ 0.486253][ T0] Call Trace: [ 0.486494][ T0] ? __die_body (arch/x86/kernel/dumpstack.c:478) [ 0.486822][ T0] ? die (arch/x86/kernel/dumpstack.c:?) [ 0.487099][ T0] ? do_trap (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:197) [ 0.487409][ T0] ? do_error_trap (arch/x86/kernel/traps.c:217) [ 0.487752][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.488153][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301) [ 0.488490][ T0] ? handle_invalid_op (arch/x86/kernel/traps.c:254) [ 0.488869][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.489271][ T0] ? exc_invalid_op (arch/x86/kernel/traps.c:316) [ 0.489619][ T0] ? handle_exception (arch/x86/entry/entry_32.S:1055) [ 0.489996][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301) [ 0.490332][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.490733][ T0] ? exc_overflow (arch/x86/kernel/traps.c:301) [ 0.491068][ T0] ? reserve_bootmem_region (include/linux/page-flags.h:536) [ 0.491470][ T0] memmap_init_reserved_pages (mm/memblock.c:2203) [ 0.491887][ T0] free_low_memory_core_early (mm/memblock.c:?) [ 0.492302][ T0] memblock_free_all (mm/memblock.c:2272 include/linux/atomic/atomic-arch-fallback.h:546 include/linux/atomic/atomic-long.h:123 include/linux/atomic/atomic-instrumented.h:3261 include/linux/mm.h:67 mm/memblock.c:2273) [ 0.492659][ T0] mem_init (arch/x86/mm/init_32.c:735) [ 0.492952][ T0] mm_core_init (mm/mm_init.c:2730) [ 0.493271][ T0] start_kernel (init/main.c:958) [ 0.493604][ T0] i386_start_kernel (arch/x86/kernel/head32.c:79) [ 0.493969][ T0] startup_32_smp (arch/x86/kernel/head_32.S:292) The crash happens because after commit 8268af309d07 ("arch, mm: set max_mapnr when allocating memory map for FLATMEM") max_mapnr is rounded up to MAX_ORDER_NR_PAGES and the pages in the end of the memory map are passing pfn_valid() check in reserve_bootmem_region(). Make sure that that pages in the end of the memory map are initialized, just like the pages in the end of the last section for SPARSEMEM. Link: https://lkml.kernel.org/r/20250325114928.1791109-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20250325114928.1791109-2-rppt@kernel.org Fixes: 8268af309d07 ("arch, mm: set max_mapnr when allocating memory map for FLATMEM") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503241424.d16223ec-lkp@intel.com Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01MAINTAINERS: add peterx as userfaultfd reviewerPeter Xu
Add an entry for userfaultfd and make myself a reviewer of it, just in case it helps people manage the cc list. I named it MEMORY USERFAULTFD, could be a bad name, but then it can be together with the MEMORY* entries when everything is in alphabetic order, which is definitely a benefit. The line may not change much on how I'd work with userfaultfd; I think I'll do the same as before.. But maybe it still, more or less, adds some responsibility on top, indeed. [akpm@linux-foundation.org: add include/linux/userfaultfd_k.h, per Mike] [akpm@linux-foundation.org: fix misordering] Link: https://lkml.kernel.org/r/20250322002124.131736-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm/page_alloc: replace flag check with PageHWPoison() in check_new_page_bad()Ye Liu
This patch replaces the direct check for the __PG_HWPOISON flag with the PageHWPoison() macro, improving code readability and maintaining consistency with other parts of the memory management code. Link: https://lkml.kernel.org/r/20250320063346.489030-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm/damon/core: simplify control flow in damon_register_ops()Taotao Chen
The function logic is not complex, so using goto is unnecessary. Replace it with a straightforward if-else to simplify control flow and improve readability. Link: https://lkml.kernel.org/r/Z9vxcPCw8tDsjKw1@OneApple Signed-off-by: Taotao Chen <chentaotao@didiglobal.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm/kasan: use SLAB_NO_MERGE flag instead of an empty constructorHarry Yoo
Use SLAB_NO_MERGE flag to prevent merging instead of providing an empty constructor. Using an empty constructor in this manner is an abuse of slab interface. The SLAB_NO_MERGE flag should be used with caution, but in this case, it is acceptable as the cache is intended solely for debugging purposes. No functional changes intended. Link: https://lkml.kernel.org/r/20250318015926.1629748-1-harry.yoo@oracle.com Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Alexander Potapenko <glider@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm: page_alloc: fix defrag_mode's retry & OOM pathJohannes Weiner
Brendan points out that defrag_mode doesn't properly clear ALLOC_NOFRAGMENT on its last-ditch attempt to allocate. But looking closer, the problem is actually more severe: it doesn't actually *check* whether it's already retried, and keeps looping. This means the OOM path is never taken, and the thread can loop indefinitely. This is verified with an intentional OOM test on defrag_mode=1, which results in the machine hanging. After this patch, it triggers the OOM kill reliably and recovers. Clear ALLOC_NOFRAGMENT properly, and only retry once. Link: https://lkml.kernel.org/r/20250401041231.GA2117727@cmpxchg.org Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm/mremap: do not set vrm->vma NULL immediately prior to checking itLorenzo Stoakes
This seems rather unwise. If we cannot merge, extend, then we need to recall the original VMA to see if we need to uncharge. If we do need to, do so. Link: https://lkml.kernel.org/r/b2fb6b9c-376d-4e9b-905e-26d847fd3865@lucifer.local Fixes: d5c8aec0542e ("mm/mremap: initial refactor of move_vma()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-=by: "Lai, Yi" <yi1.lai@linux.intel.com> Closes: https://lore.kernel.org/linux-mm/Z+lcvEIHMLiKVR1i@ly-workstation/ Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm: zswap: fix crypto_free_acomp() deadlock in zswap_cpu_comp_dead()Yosry Ahmed
Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock (through crypto_exit_scomp_ops_async()). On the other hand, crypto_alloc_acomp_node() holds the scomp_lock (through crypto_scomp_init_tfm()), and then allocates memory. If the allocation results in reclaim, we may attempt to hold the per-CPU acomp_ctx mutex. The above dependencies can cause an ABBA deadlock. For example in the following scenario: (1) Task A running on CPU #1: crypto_alloc_acomp_node() Holds scomp_lock Enters reclaim Reads per_cpu_ptr(pool->acomp_ctx, 1) (2) Task A is descheduled (3) CPU #1 goes offline zswap_cpu_comp_dead(CPU #1) Holds per_cpu_ptr(pool->acomp_ctx, 1)) Calls crypto_free_acomp() Waits for scomp_lock (4) Task A running on CPU #2: Waits for per_cpu_ptr(pool->acomp_ctx, 1) // Read on CPU #1 DEADLOCK Since there is no requirement to call crypto_free_acomp() with the per-CPU acomp_ctx mutex held in zswap_cpu_comp_dead(), move it after the mutex is unlocked. Also move the acomp_request_free() and kfree() calls for consistency and to avoid any potential sublte locking dependencies in the future. With this, only setting acomp_ctx fields to NULL occurs with the mutex held. This is similar to how zswap_cpu_comp_prepare() only initializes acomp_ctx fields with the mutex held, after performing all allocations before holding the mutex. Opportunistically, move the NULL check on acomp_ctx so that it takes place before the mutex dereference. Link: https://lkml.kernel.org/r/20250226185625.2672936-1-yosry.ahmed@linux.dev Fixes: 12dcb0ef5406 ("mm: zswap: properly synchronize freeing resources during CPU hotunplug") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reported-by: syzbot+1a517ccfcbc6a7ab0f82@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67bcea51.050a0220.bbfd1.0096.GAE@google.com/ Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Nhat Pham <nphamcs@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Murphy <lists@colorremedies.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm/hugetlb: move hugetlb_sysctl_init() to the __init sectionMarc Herbert
hugetlb_sysctl_init() is only invoked once by an __init function and is merely a wrapper around another __init function so there is not reason to keep it. Fixes the following warning when toning down some GCC inline options: WARNING: modpost: vmlinux: section mismatch in reference: hugetlb_sysctl_init+0x1b (section: .text) -> __register_sysctl_init (section: .init.text) Link: https://lkml.kernel.org/r/20250319060041.2737320-1-marc.herbert@linux.intel.com Signed-off-by: Marc Herbert <Marc.Herbert@linux.intel.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm: page_isolation: avoid calling folio_hstate() without hugetlb_lockLiu Shixin
I found a NULL pointer dereference as followed: BUG: kernel NULL pointer dereference, address: 0000000000000028 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 5 UID: 0 PID: 5964 Comm: sh Kdump: loaded Not tainted 6.13.0-dirty #20 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1. RIP: 0010:has_unmovable_pages+0x184/0x360 ... Call Trace: <TASK> set_migratetype_isolate+0xd1/0x180 start_isolate_page_range+0xd2/0x170 alloc_contig_range_noprof+0x101/0x660 alloc_contig_pages_noprof+0x238/0x290 alloc_gigantic_folio.isra.0+0xb6/0x1f0 only_alloc_fresh_hugetlb_folio.isra.0+0xf/0x60 alloc_pool_huge_folio+0x80/0xf0 set_max_huge_pages+0x211/0x490 __nr_hugepages_store_common+0x5f/0xe0 nr_hugepages_store+0x77/0x80 kernfs_fop_write_iter+0x118/0x200 vfs_write+0x23c/0x3f0 ksys_write+0x62/0xe0 do_syscall_64+0x5b/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e As has_unmovable_pages() call folio_hstate() without hugetlb_lock, there is a race to free the HugeTLB page between PageHuge() and folio_hstate(). There is no need to add hugetlb_lock here as the HugeTLB page can be freed in lot of places. So it's enough to unfold folio_hstate() and add a check to avoid NULL pointer dereference for hugepage_migration_supported(). Link: https://lkml.kernel.org/r/20250122061151.578768-1-liushixin2@huawei.com Fixes: 464c7ffbcb16 ("mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm/hugetlb_vmemmap: fix memory loads orderingYu Zhao
Using x86_64 as an example, for a 32KB struct page[] area describing a 2MB hugeTLB, HVO reduces the area to 4KB by the following steps: 1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs; 2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped by PTE 0, and at the same time change the permission from r/w to r/o; 3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB to 4KB. However, the following race can happen due to improperly memory loads ordering: CPU 1 (HVO) CPU 2 (speculative PFN walker) page_ref_freeze() synchronize_rcu() rcu_read_lock() page_is_fake_head() is false vmemmap_remap_pte() XXX: struct page[] becomes r/o page_ref_unfreeze() page_ref_count() is not zero atomic_add_unless(&page->_refcount) XXX: try to modify r/o struct page[] Specifically, page_is_fake_head() must be ordered after page_ref_count() on CPU 2 so that it can only return true for this case, to avoid the later attempt to modify r/o struct page[]. This patch adds the missing memory barrier and makes the tests on page_is_fake_head() and page_ref_count() done in the proper order. Link: https://lkml.kernel.org/r/20250108074822.722696-1-yuzhao@google.com Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: Will Deacon <will@kernel.org> Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/ Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: Will Deacon <will@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01mm/userfaultfd: fix release hang over concurrent GUPPeter Xu
This patch should fix a possible userfaultfd release() hang during concurrent GUP. This problem was initially reported by Dimitris Siakavaras in July 2023 [1] in a firecracker use case. Firecracker has a separate process handling page faults remotely, and when the process releases the userfaultfd it can race with a concurrent GUP from KVM trying to fault in a guest page during the secondary MMU page fault process. A similar problem was reported recently again by Jinjiang Tu in March 2025 [2], even though the race happened this time with a mlockall() operation, which does GUP in a similar fashion. In 2017, commit 656710a60e36 ("userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS") was trying to fix this issue. AFAIU, that fixes well the fault paths but may not work yet for GUP. In GUP, the issue is NOPAGE will be almost treated the same as "page fault resolved" in faultin_page(), then the GUP will follow page again, seeing page missing, and it'll keep going into a live lock situation as reported. This change makes core mm return RETRY instead of NOPAGE for both the GUP and fault paths, proactively releasing the mmap read lock. This should guarantee the other release thread make progress on taking the write lock and avoid the live lock even for GUP. When at it, rearrange the comments to make sure it's uptodate. [1] https://lore.kernel.org/r/79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr [2] https://lore.kernel.org/r/20250307072133.3522652-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20250312145131.1143062-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Jinjiang Tu <tujinjiang@huawei.com> Cc: Dimitris Siakavaras <jimsiak@cslab.ece.ntua.gr> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-04-01Merge tag 'i2c-for-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: "i2c-core updates (collected by Wolfram): - remove last user and unexport i2c_of_match_device() - irq usage cleanup from Jiri i2c-host updates (collected by Andi): Refactoring and cleanups: - octeon, cadence, i801, pasemi, mlxbf, bcm-iproc: general refactorings - octeon: remove 10-bit address support Improvements: - amd-asf: improved error handling - designware: use guard(mutex) - amd-asf, designware: update naming to follow latest specs - cadence: fix cleanup path in probe - i801: use MMIO and I/O mapping helpers to access registers - pxa: handle error after clk_prepare_enable New features: - added i2c_10bit_addr_*_from_msg() and updated multiple drivers - omap: added multiplexer state handling - qcom-geni: update frequency configuration - qup: introduce DMA usage policy New hardware support: - exynos: add support for Samsung exynos7870 - k1: add support for spacemit k1 (new driver) - imx: add support for i.mx94 lpi2c - rk3x: add support for rk3562 - designware: add support for Renesas RZ/N1D Multiplexers: - ltc4306, reg: fix assignment in platform_driver structure at24 eeprom updates (collected by Bartosz): - add two new compatible entries to the DT binding document - drop of_match_ptr() and ACPI_PTR() macros" * tag 'i2c-for-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (50 commits) dt-bindings: i2c: snps,designware-i2c: describe Renesas RZ/N1D variant irqdomain: i2c: Switch to irq_find_mapping() i2c: iproc: Refactor prototype and remove redundant error checks i2c: qcom-geni: Update i2c frequency table to match hardware guidance i2c: mlxbf: Use readl_poll_timeout_atomic() for polling i2c: pasemi: Add registers bits and switch to BIT() i2c: k1: Initialize variable before use i2c: spacemit: add support for SpacemiT K1 SoC dt-bindings: i2c: spacemit: add support for K1 SoC i2c: omap: Add support for setting mux dt-bindings: i2c: omap: Add mux-states property i2c: octeon: remove 10-bit addressing support i2c: octeon: fix return commenting i2c: i801: Use MMIO if available i2c: i801: Switch to iomapped register access i2c: i801: Improve too small kill wait time in i801_check_post i2c: i801: Move i801_wait_intr and i801_wait_byte_done in the code i2c: i801: Cosmetic improvements i2c: cadence: Move reset_control_assert after pm_runtime_set_suspended in probe error path i2c: cadence: Simplify using devm_clk_get_enabled() ...
2025-04-01selftests: ublk: kublk: fix an error log lineUday Shankar
When doing io_uring operations using liburing, errno is not used to indicate errors, so the %m format specifier does not provide any relevant information for failed io_uring commands. Fix a log line emitted on get_params failure to translate the error code returned in the cqe->res field instead. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250401-ublk_selftests-v1-2-98129c9bc8bb@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-01selftests: ublk: kublk: use ioctl-encoded opcodesUday Shankar
There are a couple of places in the kublk selftests ublk server which use the legacy ublk opcodes. These operations fail (with -EOPNOTSUPP) on a kernel compiled without CONFIG_BLKDEV_UBLK_LEGACY_OPCODES set. We could easily require it to be set as a prerequisite for these selftests, but since new applications should not be using the legacy opcodes, use the ioctl-encoded opcodes everywhere in kublk. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250401-ublk_selftests-v1-1-98129c9bc8bb@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-01x86/fred: Fix system hang during S4 resume with FRED enabledXin Li (Intel)
Upon a wakeup from S4, the restore kernel starts and initializes the FRED MSRs as needed from its perspective. It then loads a hibernation image, including the image kernel, and attempts to load image pages directly into their original page frames used before hibernation unless those frames are currently in use. Once all pages are moved to their original locations, it jumps to a "trampoline" page in the image kernel. At this point, the image kernel takes control, but the FRED MSRs still contain values set by the restore kernel, which may differ from those set by the image kernel before hibernation. Therefore, the image kernel must ensure the FRED MSRs have the same values as before hibernation. Since these values depend only on the location of the kernel text and data, they can be recomputed from scratch. Reported-by: Xi Pardee <xi.pardee@intel.com> Reported-by: Todd Brandt <todd.e.brandt@intel.com> Tested-by: Todd Brandt <todd.e.brandt@intel.com> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250401075728.3626147-1-xin@zytor.com
2025-04-01Documentation/EDAC: Fix warning document isn't included in any toctreeShiju Jose
Fix the build (htmldocs) warning: Documentation/edac/index.rst: WARNING: document isn't included in any toctree. Fixes: db99ea5f2c03 ("EDAC: Add support for EDAC device features control") Closes: https://lore.kernel.org/all/20250228185102.15842f8b@canb.auug.org.au/ Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250401115823.573-1-shiju.jose@huawei.com
2025-04-01io_uring/zcrx: return early from io_zcrx_recv_skb if readlen is 0David Wei
When readlen is set for a recvzc request, tcp_read_sock() will call io_zcrx_recv_skb() one final time with len == desc->count == 0. This is caused by the !desc->count check happening too late. The offset + 1 != skb->len happens earlier and causes the while loop to continue. Fix this in io_zcrx_recv_skb() instead of tcp_read_sock(). Return early if len is 0 i.e. the read is done. Fixes: 6699ec9a23f8 ("io_uring/zcrx: add a read limit to recvzc requests") Signed-off-by: David Wei <dw@davidwei.uk> Link: https://lore.kernel.org/r/20250401195355.1613813-1-dw@davidwei.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-01Merge tag 'dmaengine-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine Pull dmaengine updates from Vinod Koul: "The dmaengine subsystem updates for this cycle consist of a new driver (Microchip) along with couple of yaml binding conversions, core api updates and bunch of driver updates etc. New HW support: - Microchip sama7d65 dma controller - Yaml conversion of atmel dma binding and Freescale Elo DMA Controller binding Core: - Remove device_prep_dma_imm_data() API as users are removed - Reduce scope of some less frequently used DMA request channel APIs with aim to cleanup these in future Updates: - Drop Fenghua Yu from idxd maintainers, as he changed jobs - AMD ptdma support for multiqueue and ae4dma deprecated PCI IDs removal" * tag 'dmaengine-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (29 commits) dmaengine: ptdma: Utilize the AE4DMA engine's multi-queue functionality dmaengine: ae4dma: Use the MSI count and its corresponding IRQ number dmaengine: ae4dma: Remove deprecated PCI IDs dmaengine: Remove device_prep_dma_imm_data from struct dma_device dmaengine: ti: edma: support sw triggered chans in of_edma_xlate() dmaengine: ti: k3-udma: Enable second resource range for BCDMA and PKTDMA dmaengine: fsl-edma: free irq correctly in remove path dmaengine: fsl-edma: cleanup chan after dma_async_device_unregister dt-bindings: dma: snps,dw-axi-dmac: Allow devices to be marked as noncoherent dmaengine: dmatest: Fix dmatest waiting less when interrupted dt-bindings: dma: Convert fsl,elo*-dma to YAML dt-bindings: dma: fsl-mxs-dma: Add compatible string for i.MX8 chips dmaengine: Fix typo in comment dmaengine: ti: k3-udma-glue: Drop skip_fdq argument from k3_udma_glue_reset_rx_chn dmaengine: bcm2835-dma: fix warning when CONFIG_PM=n dt-bindings: dma: fsl,edma: Add i.MX94 support dt-bindings: dma: atmel: add microchip,sama7d65-dma dmaengine: img-mdc: remove incorrect of_match_ptr annotation dmaengine: idxd: Delete unnecessary NULL check dmaengine: pxa: Enable compile test ...
2025-04-01Merge tag 'phy-for-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy Pull phy updates from Vinod Koul: "A fairly moderate sized request for the generic phy subsystem with some new device and driver support along with driver updates with Samsung and Qualcomm ones being major ones. New HW Support: - Qualcomm X1P42100 PCIe Gen4x4, QCS615 qmp usbc, PCIe UNIPHY 28LP driver, SM8750 QMP UFS PHY - Rockchip rk3576 hdptx, rk3562 naneng-combo support - Samsung MIPI D-/C-PHY driver, ExynosAutov920 ufs phy driver Updates: - Samsung USB3 Type-C lane orientation detection and configuration for Google gs101 - Qualcomm support for dual lane PHY support for QCS8300 SoC" * tag 'phy-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy: (47 commits) phy: rockchip-naneng-combo: Support rk3562 dt-bindings: phy: rockchip: Add rk3562 naneng-combophy compatible phy: rockchip: Add Samsung MIPI D-/C-PHY driver dt-bindings: phy: Add Rockchip MIPI C-/D-PHY schema phy: qcom: uniphy-28lp: add COMMON_CLK dependency phy: rockchip: usbdp: Remove unnecessary bool conversion phy: rockchip: usbdp: Avoid call hpd_event_trigger in dp_phy_init phy: rockchip: usbdp: Only verify link rates/lanes/voltage when the corresponding set flags are set phy: qcom-qmp-pcie: add dual lane PHY support for QCS8300 dt-bindings: phy: qcom,sc8280xp-qmp-pcie-phy: Document the QCS8300 QMP PCIe PHY Gen4 x2 phy: qcom-qmp-ufs: Add PHY Configuration support for sm8750 dt-bindings: phy: qcom,sc8280xp-qmp-ufs-phy: document the SM8750 QMP UFS PHY phy: qcom: Introduce PCIe UNIPHY 28LP driver dt-bindings: phy: qcom,uniphy-pcie: Document PCIe uniphy phy: qcom: qmp-usbc: Add qmp configuration for QCS615 phy: freescale: imx8m-pcie: assert phy reset and perst in power off phy: freescale: imx8m-pcie: cleanup reset logic phy: core: Remove unused phy_pm_runtime_(allow|forbid) dt-bindings: phy: document Allwinner A523 USB-2.0 PHY phy: phy-rockchip-samsung-hdptx: Add support for RK3576 ...
2025-04-01Merge tag 'soundwire-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire Pull soundwire updates from Vinod Koul: - Support for SoundWire Bulk Register Access (BRA) protocol in core along with Intel driver support and ASoC bits required - AMD driver updates and support for ACP 7.0 and 7.1 platforms * tag 'soundwire-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/soundwire: (28 commits) soundwire: take in count the bandwidth of a prepared stream ASoC: rt711-sdca: add DP0 support soundwire: debugfs: add interface for BPT/BRA transfers ASoC: SOF: Intel: hda-sdw-bpt: add CHAIN_DMA support soundwire: intel_ace2x: add BPT send_async/wait callbacks soundwire: intel: add BPT context definition ASoC: SOF: Intel: hda-sdw-bpt: add helpers for SoundWire BPT DMA soundwire: intel_auxdevice: add indirection for BPT send_async/wait soundwire: cadence: add BTP/BRA helpers to format data soundwire: bus: add bpt_stream pointer soundwire: bus: add send_async/wait APIs for BPT protocol soundwire: stream: reuse existing code for BPT stream soundwire: stream: special-case the bus compute_params() routine soundwire: stream: extend sdw_alloc_stream() to take 'type' parameter soundwire: extend sdw_stream_type to BPT soundwire: cadence: add BTP support for DP0 Documentation: driver: add SoundWire BRA description soundwire: amd: change the log level for command response log soundwire: slave: fix an OF node reference leak in soundwire slave device soundwire: Use str_enable_disable-like helpers ...