summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-07-24scsi: dt-bindings: mediatek,ufs: Add ufs-disable-mcq flag for UFS hostMacpaul Lin
Add the 'mediatek,ufs-disable-mcq' property to the UFS device-tree bindings. This flag corresponds to the UFS_MTK_CAP_DISABLE_MCQ host capability recently introduced in the UFS host driver, allowing it to disable the Multiple Circular Queue (MCQ) feature when present. The binding schema has also been updated to resolve DTBS check errors. Cc: stable@vger.kernel.org Fixes: 46bd3e31d74b ("scsi: ufs: mediatek: Add UFS_MTK_CAP_DISABLE_MCQ") Signed-off-by: Macpaul Lin <macpaul.lin@mediatek.com> Link: https://lore.kernel.org/r/20250722085721.2062657-2-macpaul.lin@mediatek.com Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Peter Wang <peter.wang@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: ufs-mediatek: Add UFS host support for MT8195 SoCMacpaul Lin
Add "mediatek,mt8195-ufshci" to the of_device_id table to enable support for MediaTek MT8195/MT8395 UFS host controller. This matches the device node entry in the MT8195/MT8395 device tree and allows proper driver binding. Signed-off-by: Macpaul Lin <macpaul.lin@mediatek.com> Link: https://lore.kernel.org/r/20250722085721.2062657-1-macpaul.lin@mediatek.com Reviewed-by: Peter Wang <peter.wang@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24Merge patch series "scsi: ufs: ufs-pci: Fix hibernate state transition for ↵Martin K. Petersen
Intel MTL-like host controllers" Adrian Hunter <adrian.hunter@intel.com> says: Hi Here is V2 of a couple of fixes for Intel MTL-like UFS host controllers, related to link Hibernation state. Following the fixes are some improvements for the enabling and disabling of UIC Completion interrupts. Link: https://lore.kernel.org/r/20250723165856.145750-1-adrian.hunter@intel.com Conflicts: drivers/ufs/core/ufshcd.c Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: ufs-pci: Remove control of UIC Completion interrupt for Intel MTLAdrian Hunter
Now that UFS core enables the UIC Completion interrupt only when needed, Intel MTL driver no longer needs to control the interrupt itself. So remove the associated code. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250723165856.145750-9-adrian.hunter@intel.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: core: Do not write interrupt enable register unnecessarilyAdrian Hunter
Write a new value to the interrupt enable register only if it is different from the old value, thereby saving a register write operation. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250723165856.145750-8-adrian.hunter@intel.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: core: Set and clear UIC Completion interrupt as neededAdrian Hunter
Currently the UIC Completion interrupt is left enabled except for when issuing link hibernate commands, in which case the interrupt is disabled and then re-enabled. Instead, set and clear the interrupt enable bit as needed. That is slightly simpler and less error prone, but also avoids side effects of accessing the interrupt enable register after entering link hibernation. Specifically, for some host controllers like Intel MTL, doing so disrupts the link state transition. Note also, the interrupt register is not read back anymore after it is updated. No other code does that, so it is assumed to be no longer necessary if it ever was. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250723165856.145750-7-adrian.hunter@intel.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: core: Remove duplicated code in ufshcd_send_bsg_uic_cmd()Adrian Hunter
Make ufshcd_send_bsg_uic_cmd() call ufshcd_send_uic_cmd() instead of duplicating its code. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250723165856.145750-6-adrian.hunter@intel.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: core: Move ufshcd_enable_intr() and ufshcd_disable_intr()Adrian Hunter
Move ufshcd_enable_intr() and ufshcd_disable_intr() so they can be called in subsequent patches without forward declarations. No functional change. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250723165856.145750-5-adrian.hunter@intel.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: ufs-pci: Remove UFS PCI driver's ->late_init() call backAdrian Hunter
->late_init() was introduced to allow the default values for rpm_lvl and spm_lvl to be set. Since commit bb9850704c04 ("scsi: ufs: core: Honor runtime/system PM levels if set by host controller drivers") and commit fe06b7c07f3f ("scsi: ufs: core: Set default runtime/system PM levels before ufshcd_hba_init()"), those default values can be set in the ->init() variant call back. Move the setting of default values for rpm_lvl and spm_lvl to ->init() and remove ->late_init(). Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250723165856.145750-4-adrian.hunter@intel.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: ufs-pci: Fix default runtime and system PM levelsAdrian Hunter
Intel MTL-like host controllers support auto-hibernate. Using auto-hibernate with manual (driver initiated) hibernate produces more complex operation. For example, the host controller will have to exit auto-hibernate simply to allow the driver to enter hibernate state manually. That is not recommended. The default rpm_lvl and spm_lvl is 3, which includes manual hibernate. Change the default values to 2, which does not. Note, to be simpler to backport to stable kernels, utilize the UFS PCI driver's ->late_init() call back. Recent commits have made it possible to set up a controller-specific default in the regular ->init() call back, but not all stable kernels have those changes. Fixes: 4049f7acef3e ("scsi: ufs: ufs-pci: Add support for Intel MTL") Cc: stable@vger.kernel.org Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250723165856.145750-3-adrian.hunter@intel.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: ufs-pci: Fix hibernate state transition for Intel MTL-like host ↵Archana Patni
controllers UFSHCD core disables the UIC completion interrupt when issuing UIC hibernation commands, and re-enables it afterwards if it was enabled to start with, refer ufshcd_uic_pwr_ctrl(). For Intel MTL-like host controllers, accessing the register to re-enable the interrupt disrupts the state transition. Use hibern8_notify variant operation to disable the interrupt during the entire hibernation, thereby preventing the disruption. Fixes: 4049f7acef3e ("scsi: ufs: ufs-pci: Add support for Intel MTL") Cc: stable@vger.kernel.org Signed-off-by: Archana Patni <archana.patni@intel.com> Link: https://lore.kernel.org/r/20250723165856.145750-2-adrian.hunter@intel.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24clk: spacemit: ccu_pll: fix error return value in recalc_rate callbackAkhilesh Patil
Return 0 instead of -EINVAL if function ccu_pll_recalc_rate() fails to get correct rate entry. Follow .recalc_rate callback documentation as mentioned in include/linux/clk-provider.h for error return value. Signed-off-by: Akhilesh Patil <akhilesh@ee.iitb.ac.in> Fixes: 1b72c59db0add ("clk: spacemit: Add clock support for SpacemiT K1 SoC") Reviewed-by: Haylen Chu <heylenay@4d2.org> Reviewed-by: Alex Elder <elder@riscstar.com> Link: https://lore.kernel.org/r/aIBzVClNQOBrjIFG@bhairav-test.ee.iitb.ac.in Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2025-07-24Merge patch series "ufs: host: mediatek: Provide features and fixes in ↵Martin K. Petersen
MediaTek platforms" peter.wang@mediatek.com says: This series fixes some defects and provide features in MediaTek UFS drivers. Link: https://lore.kernel.org/r/20250722030841.1998783-1-peter.wang@mediatek.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: host: mediatek: Support FDE (AES) clock scalingPeter Wang
Add support for scaling the FDE (AES) clock to achieve higher performance, particularly for HS-G5: 1. Parse DTS settings for FDE min/max mux. 2. Scale up the FDE clock when required for enhanced performance. These changes ensure that the FDE clock can be dynamically adjusted based on performance needs, leveraging DTS configurations. Signed-off-by: Peter Wang <peter.wang@mediatek.com> Link: https://lore.kernel.org/r/20250722030841.1998783-10-peter.wang@mediatek.com Reviewed-by: Chun-Hung Wu <chun-hung.wu@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: host: mediatek: Support clock scaling with Vcore bindingPeter Wang
Add support for clock scaling with Vcore binding: 1. Parse the DTS setting for Vcore voltage. 2. Set the Vcore voltage to the DTS-specified value before scaling up. 3. Reset the Vcore voltage to the default setting after scaling down. These changes ensure that the Vcore voltage is appropriately managed during clock scaling operations to maintain system stability and performance. Signed-off-by: Peter Wang <peter.wang@mediatek.com> Link: https://lore.kernel.org/r/20250722030841.1998783-9-peter.wang@mediatek.com Reviewed-by: Chun-Hung Wu <chun-hung.wu@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: host: mediatek: Add clock scaling query functionPeter Wang
Introduce a clock scaling readiness query function to streamline the process of checking clock scaling parameters. This function simplifies the code by encapsulating the logic for determining if clock scaling is ready. Signed-off-by: Peter Wang <peter.wang@mediatek.com> Link: https://lore.kernel.org/r/20250722030841.1998783-8-peter.wang@mediatek.com Reviewed-by: Chun-Hung Wu <chun-hung.wu@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: host: mediatek: Add more UFSCHI hardware versionsAlice Chao
Introduce a function for version control to distinguish between new and old platforms. Update the handling of hardware IP versions, ensuring correct version comparisons by adjusting the version format for specific projects. Signed-off-by: Alice Chao <alice.chao@mediatek.com> Link: https://lore.kernel.org/r/20250722030841.1998783-7-peter.wang@mediatek.com Reviewed-by: Peter Wang <peter.wang@mediatek.com> Reviewed-by: Chun-Hung Wu <chun-hung.wu@mediatek.com> Signed-off-by: Peter Wang <peter.wang@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: host: mediatek: Set IRQ affinity policy for MCQ modePeter Wang
Set the IRQ affinity for MCQ mode to improve performance. Specifically, it migrates the IRQ from CPU0 to CPU3 to enhance IRQ handling efficiency. Setting IRQ affinity directly from the kernel allows the configuration to take effect earlier, and provides greater security and consistency, especially important for systems with strict performanceor real-time requirements. Signed-off-by: Peter Wang <peter.wang@mediatek.com> Link: https://lore.kernel.org/r/20250722030841.1998783-6-peter.wang@mediatek.com Reviewed-by: Chun-Hung Wu <chun-hung.wu@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: host: mediatek: Handle broken RTC based on DTS settingPeter Wang
Introduce a mechanism to handle broken RTC by checking the DTS setting. The configuration is specifically required for legacy platform. Signed-off-by: Peter Wang <peter.wang@mediatek.com> Link: https://lore.kernel.org/r/20250722030841.1998783-5-peter.wang@mediatek.com Reviewed-by: Chun-Hung Wu <chun-hung.wu@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: host: mediatek: Change ref-clk timeout policyPeter Wang
Update the timeout policy for ref-clk control. - If a clock-on operation times out, it is assumed that the clock is off. The system will notify TFA to perform clock-off settings. - If a clock-off operation times out, it is assumed that the clock will eventually turn off. The 'ref_clk_enabled' flag is set directly. Signed-off-by: Peter Wang <peter.wang@mediatek.com> Link: https://lore.kernel.org/r/20250722030841.1998783-4-peter.wang@mediatek.com Reviewed-by: Chun-Hung Wu <chun-hung.wu@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: host: mediatek: Add DDR_EN settingNaomi Chu
On MT6989 and later platforms, control of DDR_EN has been switched from SPM to EMI. To prevent abnormal access to DRAM, it is necessary to wait for 'ddren_ack' or assert 'ddren_urgent' after sending 'ddren_req'. Introduce the DDR_EN configuration in the UFS initialization flow, utilizing the assertion of 'ddren_urgent' to maintain performance. Signed-off-by: Naomi Chu <naomi.chu@mediatek.com> Link: https://lore.kernel.org/r/20250722030841.1998783-3-peter.wang@mediatek.com Reviewed-by: Peter Wang <peter.wang@mediatek.com> Reviewed-by: Chun-Hung Wu <chun-hung.wu@mediatek.com> Signed-off-by: Peter Wang <peter.wang@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24scsi: ufs: host: mediatek: Simplify boolean conversionPeter Wang
Simplify the conversion from unsigned int to boolean by removing explicit conversions and parentheses, relying on implicit conversion instead. This change ensures consistency with other usages in ufs-mediatek.c and streamlines the code. Signed-off-by: Peter Wang <peter.wang@mediatek.com> Link: https://lore.kernel.org/r/20250722030841.1998783-2-peter.wang@mediatek.com Reviewed-by: Chun-Hung Wu <chun-hung.wu@mediatek.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-07-24Merge tag 'mm-hotfixes-stable-2025-07-24-18-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "11 hotfixes. 9 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 7 are for MM" * tag 'mm-hotfixes-stable-2025-07-24-18-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: sprintf.h requires stdarg.h resource: fix false warning in __request_region() mm/damon/core: commit damos_quota_goal->nid kasan: use vmalloc_dump_obj() for vmalloc error reports mm/ksm: fix -Wsometimes-uninitialized from clang-21 in advisor_mode_show() mm: update MAINTAINERS entry for HMM nilfs2: reject invalid file types when reading inodes selftests/mm: fix split_huge_page_test for folio_split() tests mailmap: add entry for Senozhatsky mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
2025-07-24mm: remove grab_cache_page()Matthew Wilcox (Oracle)
All callers have been converted to use filemap_grab_folio(). Link: https://lkml.kernel.org/r/20250721204619.163883-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm/damon/ops-common: ignore migration request to invalid nodesSeongJae Park
damon_migrate_pages() tries migration even if the target node is invalid. If users mistakenly make such invalid requests via DAMOS_MIGRATE_{HOT,COLD} action, the below kernel BUG can happen. [ 7831.883495] BUG: unable to handle page fault for address: 0000000000001f48 [ 7831.884160] #PF: supervisor read access in kernel mode [ 7831.884681] #PF: error_code(0x0000) - not-present page [ 7831.885203] PGD 0 P4D 0 [ 7831.885468] Oops: Oops: 0000 [#1] SMP PTI [ 7831.885852] CPU: 31 UID: 0 PID: 94202 Comm: kdamond.0 Not tainted 6.16.0-rc5-mm-new-damon+ #93 PREEMPT(voluntary) [ 7831.886913] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-4.el9 04/01/2014 [ 7831.887777] RIP: 0010:__alloc_frozen_pages_noprof (include/linux/mmzone.h:1724 include/linux/mmzone.h:1750 mm/page_alloc.c:4936 mm/page_alloc.c:5137) [...] [ 7831.895953] Call Trace: [ 7831.896195] <TASK> [ 7831.896397] __folio_alloc_noprof (mm/page_alloc.c:5183 mm/page_alloc.c:5192) [ 7831.896787] migrate_pages_batch (mm/migrate.c:1189 mm/migrate.c:1851) [ 7831.897228] ? __pfx_alloc_migration_target (mm/migrate.c:2137) [ 7831.897735] migrate_pages (mm/migrate.c:2078) [ 7831.898141] ? __pfx_alloc_migration_target (mm/migrate.c:2137) [ 7831.898664] damon_migrate_folio_list (mm/damon/ops-common.c:321 mm/damon/ops-common.c:354) [ 7831.899140] damon_migrate_pages (mm/damon/ops-common.c:405) [...] Add a target node validity check in damon_migrate_pages(). The validity check is stolen from that of do_pages_move(), which is being used for the move_pages() system call. Link: https://lkml.kernel.org/r/20250720185822.1451-1-sj@kernel.org Fixes: b51820ebea65 ("mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion") [6.11.x] Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Hyeongtak Ji <hyeongtak.ji@sk.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24docs: update THP documentation to clarify sysfs "never" settingLorenzo Stoakes
Rather confusingly, setting all Transparent Huge Page sysfs settings to "never" does not in fact result in THP being globally disabled. Rather, it results in khugepaged being disabled, but one can still obtain THP pages using madvise(..., MADV_COLLAPSE). This is something that has remained poorly documented for some time, and it is likely the received wisdom of most users of THP that never does, in fact, mean never. It is therefore important to highlight, very clearly, that this is not the case. [lorenzo.stoakes@oracle.com: update transhuge page to mention MADV_COLLAPSE] Link: https://lkml.kernel.org/r/d54d1dfb-f06d-4979-983b-73998f05867e@lucifer.local Link: https://lkml.kernel.org/r/20250721155530.75944-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24tools/testing/selftests: explicitly test split multi VMA mremap moveLorenzo Stoakes
Check that moving a range of VMAs where we are offset into the first and last VMAs works correctly. This results in the VMAs being split at these points at which we are offset into VMAs. We explicitly test both the ordinary MREMAP_FIXED multi VMA move case and the MREMAP_DONTUNMAP multi VMA move case. Link: https://lkml.kernel.org/r/b04920bb6c09dc86c207c251eab8ec670fbbcaef.1753119043.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24tools/testing/selftests: test MREMAP_DONTUNMAP on multiple VMA moveLorenzo Stoakes
We support MREMAP_MAYMOVE | MREMAP_FIXED | MREMAP_DONTUNMAP for moving multiple VMAs via mremap(), so assert that the tests pass with both MREMAP_DONTUNMAP set and not set. Additionally, add success = false settings when mremap() fails. This is something that cannot realistically happen, so in no way impacted test outcome, but it is incorrect to indicate a test pass when something has failed. Link: https://lkml.kernel.org/r/d7359941981e4e44c774753b3e364d1c54928e6a.1753119043.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24tools/testing/selftests: add mremap() shrink test for multiple VMAsLorenzo Stoakes
Patch series "tools/testing: expand mremap testing". Expand our mremap() testing to further assert that behaviour is as expected. There is a poorly documented mremap() feature whereby it is possible to mremap() multiple VMAs (even with gaps) when shrinking, as long as the resultant shrunk range spans only a single VMA. So we start by asserting this behaviour functions correctly both with an in-place shrink and a shrink/move. Next, we further test the newly introduced ability to mremap() multiple VMAs when performing a MAP_FIXED move (that is without the size being changed), firstly by asserting that MREMAP_DONTUNMAP has no bearing on this behaviour. Finally, we explicitly test that such moves, when splitting source VMAs, function correctly. This patch (of 3): There is an apparently little-known feature of mremap() whereby, in stark contrast to other modes (other than the recently introduced capacity to move multiple VMAs), the input source range span multiple VMAs with gaps between. This is, when shrinking a VMA, whether moving it or not, and the shrink would reduce the range to a single VMA - this is permitted, as the shrink is actioned by an unmap. This patch adds tests to assert that this behaves as expected. Link: https://lkml.kernel.org/r/cover.1753119043.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/f08122893a26092a2bec6e69443e87f468ffdbed.1753119043.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24selftests/mm: guard-regions: Use SKIP() instead of ksft_exit_skip()wang lian
To ensure only the current test is skipped on permission failure, instead of terminating the entire test binary. Link: https://lkml.kernel.org/r/20250717131857.59909-3-lianux.mm@gmail.com Signed-off-by: wang lian <lianux.mm@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));"wang lian
Patch series "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup", v2. This series introduces a common FORCE_READ() macro to replace the cryptic asm volatile("" : "+r" (variable)); construct used in several mm selftests. This improves code readability and maintainability by removing duplicated, hard-to-understand code. This patch (of 2): Several mm selftests use the `asm volatile("" : "+r" (variable));` construct to force a read of a variable, preventing the compiler from optimizing away the memory access. This idiom is cryptic and duplicated across multiple test files. Following a suggestion from David[1], this patch refactors this common pattern into a FORCE_READ() macro Link: https://lkml.kernel.org/r/20250717131857.59909-1-lianux.mm@gmail.com Link: https://lkml.kernel.org/r/20250717131857.59909-2-lianux.mm@gmail.com Link: https://lore.kernel.org/lkml/4a3e0759-caa1-4cfa-bc3f-402593f1eee3@redhat.com/ [1] Signed-off-by: wang lian <lianux.mm@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24arm64: add batched versions of ptep_modify_prot_start/commitDev Jain
Override the generic definition of modify_prot_start_ptes() to use get_and_clear_full_ptes(). This helper does a TLBI only for the starting and ending contpte block of the range, whereas the current implementation will call ptep_get_and_clear() for every contpte block, thus doing a TLBI on every contpte block. Therefore, we have a performance win. The arm64 definition of pte_accessible() allows us to batch in the errata specific case: #define pte_accessible(mm, pte) \ (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte)) All ptes are obviously present in the folio batch, and they are also valid. Override the generic definition of modify_prot_commit_ptes() to simply use set_ptes() to map the new ptes into the pagetable. Link: https://lkml.kernel.org/r/20250718090244.21092-8-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm: optimize mprotect() by PTE batchingDev Jain
Use folio_pte_batch to batch process a large folio. Note that, PTE batching here will save a few function calls, and this strategy in certain cases (not this one) batches atomic operations in general, so we have a performance win for all arches. This patch paves the way for patch 7 which will help us elide the TLBI per contig block on arm64. The correctness of this patch lies on the correctness of setting the new ptes based upon information only from the first pte of the batch (which may also have accumulated a/d bits via modify_prot_start_ptes()). Observe that the flag combination we pass to mprotect_folio_pte_batch() guarantees that the batch is uniform w.r.t the soft-dirty bit and the writable bit. Therefore, the only bits which may differ are the a/d bits. So we only need to worry about code which is concerned about the a/d bits of the PTEs. Setting extra a/d bits on the new ptes where previously they were not set, is fine - setting access bit when it was not set is not an incorrectness problem but will only possibly delay the reclaim of the page mapped by the pte (which is in fact intended because the kernel just operated on this region via mprotect()!). Setting dirty bit when it was not set is again not an incorrectness problem but will only possibly force an unnecessary writeback. So now we need to reason whether something can go wrong via can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp, and userfaultfd_pte_wp cases are solved due to uniformity in the corresponding bits guaranteed by the flag combination. The ptes all belong to the same VMA (since callers guarantee that [start, end) will lie within the VMA) therefore the conditional based on the VMA is also safe to batch around. Since the dirty bit on the PTE really is just an indication that the folio got written to - even if the PTE is not actually dirty but one of the PTEs in the batch is, the wp-fault optimization can be made. Therefore, it is safe to batch around pte_dirty() in can_change_shared_pte_writable() (in fact this is better since without batching, it may happen that some ptes aren't changed to writable just because they are not dirty, even though the other ptes mapping the same large folio are dirty). To batch around the PageAnonExclusive case, we must check the corresponding condition for every single page. Therefore, from the large folio batch, we process sub batches of ptes mapping pages with the same PageAnonExclusive condition, and process that sub batch, then determine and process the next sub batch, and so on. Note that this does not cause any extra overhead; if suppose the size of the folio batch is 512, then the sub batch processing in total will take 512 iterations, which is the same as what we would have done before. For pte_needs_flush(): ppc does not care about the a/d bits. For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits get cleared; since we can only have extra a/d bits due to batching, we will only have an extra flush, not a case where we elide a flush due to batching when we shouldn't have. Link: https://lkml.kernel.org/r/20250718090244.21092-7-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm: split can_change_pte_writable() into private and shared partsDev Jain
In preparation for patch 6 and modularizing the code in general, split can_change_pte_writable() into private and shared VMA parts. No functional change intended. Link: https://lkml.kernel.org/r/20250718090244.21092-6-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm: introduce FPB_RESPECT_WRITE for PTE batching infrastructureDev Jain
Patch 6 ("mm: Optimize mprotect() by PTE batching") optimizes mprotect() by batch clearing the ptes, masking in the new protections, and batch setting the ptes. Suppose that the first pte of the batch is writable - with the current implementation of folio_pte_batch(), it is not guaranteed that the other ptes in the batch are already writable too, so we may incorrectly end up setting the writable bit on all ptes via modify_prot_commit_ptes(). Therefore, introduce FPB_RESPECT_WRITE so that all ptes in the batch are writable or not. Link: https://lkml.kernel.org/r/20250718090244.21092-5-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm: add batched versions of ptep_modify_prot_start/commitDev Jain
Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect, implementing them as a simple loop over the corresponding single pte helpers. Architecture may override these helpers. Link: https://lkml.kernel.org/r/20250718090244.21092-4-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm: optimize mprotect() for MM_CP_PROT_NUMA by batch-skipping PTEsDev Jain
For the MM_CP_PROT_NUMA skipping case, observe that, if we skip an iteration due to the underlying folio satisfying any of the skip conditions, then for all subsequent ptes which map the same folio, the iteration will be skipped for them too. Therefore, we can optimize by using folio_pte_batch() to batch skip the iterations. Use prot_numa_skip() introduced in the previous patch to determine whether we need to skip the iteration. Change its signature to have a double pointer to a folio, which will be used by mprotect_folio_pte_batch() to determine the number of iterations we can safely skip. Link: https://lkml.kernel.org/r/20250718090244.21092-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm: refactor MM_CP_PROT_NUMA skipping case into new functionDev Jain
Patch series "Optimize mprotect() for large folios", v5. Use folio_pte_batch() to optimize change_pte_range(). On arm64, if the ptes are painted with the contig bit, then ptep_get() will iterate through all 16 entries to collect a/d bits. Hence this optimization will result in a 16x reduction in the number of ptep_get() calls. Next, ptep_modify_prot_start() will eventually call contpte_try_unfold() on every contig block, thus flushing the TLB for the complete large folio range. Instead, use get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only do them on the starting and ending contig block. For split folios, there will be no pte batching; the batch size returned by folio_pte_batch() will be 1. For pagetable split folios, the ptes will still point to the same large folio; for arm64, this results in the optimization described above, and for other arches, a minor improvement is expected due to a reduction in the number of function calls. mm-selftests pass on arm64. I have some failing tests on my x86 VM already; no new tests fail as a result of this patchset. We use the following test cases to measure performance, mprotect()'ing the mapped memory to read-only then read-write 40 times: Test case 1: Mapping 1G of memory, touching it to get PMD-THPs, then pte-mapping those THPs Test case 2: Mapping 1G of memory with 64K mTHPs Test case 3: Mapping 1G of memory with 4K pages Average execution time on arm64, Apple M3: Before the patchset: T1: 2.1 seconds T2: 2 seconds T3: 1 second After the patchset: T1: 0.65 seconds T2: 0.7 seconds T3: 1.1 seconds Observing T1/T2 and T3 before the patchset, we also remove the regression introduced by ptep_get() on a contpte block. And, for large folios we get an almost 74% performance improvement, albeit the trade-off being a slight degradation in the small folio case. For x86: Before the patchset: T1: 3.75 seconds T2: 3.7 seconds T3: 3.85 seconds After the patchset: T1: 3.7 seconds T2: 3.7 seconds T3: 3.9 seconds So there is a minor improvement due to reduction in number of function calls, and a slight degradation in the small folio case due to the overhead of vm_normal_folio() + folio_test_large(). Here is the test program: #define _GNU_SOURCE #include <sys/mman.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <unistd.h> #define SIZE (1024*1024*1024) unsigned long pmdsize = (1UL << 21); unsigned long pagesize = (1UL << 12); static void pte_map_thps(char *mem, size_t size) { size_t offs; int ret = 0; /* PTE-map each THP by temporarily splitting the VMAs. */ for (offs = 0; offs < size; offs += pmdsize) { ret |= madvise(mem + offs, pagesize, MADV_DONTFORK); ret |= madvise(mem + offs, pagesize, MADV_DOFORK); } if (ret) { fprintf(stderr, "ERROR: mprotect() failed\n"); exit(1); } } int main(int argc, char *argv[]) { char *p; int ret = 0; p = mmap((1UL << 30), SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (p != (1UL << 30)) { perror("mmap"); return 1; } memset(p, 0, SIZE); if (madvise(p, SIZE, MADV_NOHUGEPAGE)) perror("madvise"); explicit_bzero(p, SIZE); pte_map_thps(p, SIZE); for (int loops = 0; loops < 40; loops++) { if (mprotect(p, SIZE, PROT_READ)) perror("mprotect"), exit(1); if (mprotect(p, SIZE, PROT_READ|PROT_WRITE)) perror("mprotect"), exit(1); explicit_bzero(p, SIZE); } } This patch (of 7): Reduce indentation by refactoring the prot_numa case into a new function. No functional change intended. Link: https://lkml.kernel.org/r/20250718090244.21092-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20250718090244.21092-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Joey Gouly <joey.gouly@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm/huge_memory: refactor after-split (page) cache codeZi Yan
Smatch/coverity checkers report NULL mapping referencing issues[1][2][3] every time the code is modified, because they do not understand that mapping cannot be NULL when a folio is in page cache in the code. Refactor the code to make it explicit. Remove "end = -1" for anonymous folios, since after code refactoring, end is no longer used by anonymous folio handling code. No functional change is intended. Link: https://lkml.kernel.org/r/20250718023000.4044406-7-ziy@nvidia.com Link: https://lore.kernel.org/linux-mm/2afe3d59-aca5-40f7-82a3-a6d976fb0f4f@stanley.mountain/ [1] Link: https://lore.kernel.org/oe-kbuild/64b54034-f311-4e7d-b935-c16775dbb642@suswa.mountain/ [2] Link: https://lore.kernel.org/linux-mm/20250716145804.4836-1-antonio@mandelbit.com/ [3] Link: https://lkml.kernel.org/r/20250718183720.4054515-7-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <k.shutemov@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm/huge_memory: get frozen folio refcount with folio_expected_ref_count()Zi Yan
Instead of open coding the refcount calculation, use folio_expected_ref_count() to calculate frozen folio refcount. Because: 1. __folio_split() does not split a folio with PG_private, so no elevated refcount from PG_private; 2. a frozen folio in __folio_split() is fully unmapped, so folio_mapcount() in folio_expected_ref_count() is always 0; 3. (mapping || swap_cache) ? folio_nr_pages(folio) is taken care of by folio_expected_ref_count() too. Link: https://lkml.kernel.org/r/20250718023000.4044406-6-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250718183720.4054515-6-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Antonio Quartulli <antonio@mandelbit.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <k.shutemov@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm/huge_memory: convert VM_BUG* to VM_WARN* in __folio_splitZi Yan
These VM_BUG* can be handled gracefully without crashing kernel. Link: https://lkml.kernel.org/r/20250718023000.4044406-5-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250718183720.4054515-5-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Antonio Quartulli <antonio@mandelbit.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <k.shutemov@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm/huge_memory: deduplicate code in __folio_split()Zi Yan
xas unlock, remap_page(), local_irq_enable() are moved out of if branches to deduplicate the code. While at it, add remap_flags to clean up remap_page() call site. nr_dropped is renamed to nr_shmem_dropped, as it becomes a variable at __folio_split() scope. Link: https://lkml.kernel.org/r/20250718183720.4054515-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Antonio Quartulli <antonio@mandelbit.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <k.shutemov@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm/huge_memory: remove after_split label in __split_unmapped_folio()Zi Yan
Check stop_split instead to avoid the goto statement. Link: https://lkml.kernel.org/r/20250718183720.4054515-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Antonio Quartulli <antonio@mandelbit.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <k.shutemov@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm/huge_memory: move unrelated code out of __split_unmapped_folio()Zi Yan
Patch series "__folio_split() clean up", v5. This patchset refactors __folio_split() and __split_unmapped_folio() to: 1. make __split_unmapped_folio() reusable for splitting unmapped folios. It avoids the need for a new boolean unmapped parameter to guard mapping-related code when __split_unmapped_folio() is reused to split unmapped folios. 2. improve code readability and prevent smatch/coverity checkers from complaining about NULL mapping referencing. An additional benefit for __split_unmapped_folio() refactoring is that __split_unmapped_folio() could be called on after-split folios by __folio_split(). It can enable new split methods. For example, at deferred split time, unmapped subpages can scatter arbitrarily within a large folio, neither uniform nor non-uniform split can maximize after-split folio orders for mapped subpages. The hope is that by calling __split_unmapped_folio() multiple times, a better split result can be achieved. This patch (of 6): remap(), folio_ref_unfreeze(), lru_add_split_folio() are not relevant to splitting unmapped folio operations. Move them out to __folio_split() so that __split_unmapped_folio() only handles unmapped folio splits. This makes __split_unmapped_folio() reusable. Remove the swapcache folio split check code before __split_unmapped_folio() call, since it is already checked at the beginning of __folio_split() in uniform_split_supported() and non_uniform_split_supported(). Along with the code move, there are some variable renames: 1. release is renamed to new_folio, 2. origin_folio is now folio, since __folio_split() has folio pointing to the original folio already. Link: https://lkml.kernel.org/r/20250718023000.4044406-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250718023000.4044406-2-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250718183720.4054515-1-ziy@nvidia.com Link: https://lkml.kernel.org/r/20250718183720.4054515-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Antonio Quartulli <antonio@mandelbit.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <k.shutemov@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24fs/Kconfig: enable HUGETLBFS only if ARCH_SUPPORTS_HUGETLBFSAnshuman Khandual
Enable HUGETLBFS only when platform subscrbes via ARCH_SUPPORTS_HUGETLBFS. Hence select ARCH_SUPPORTS_HUGETLBFS on existing x86 and sparc for their continuing HUGETLBFS support. While here also just drop existing 'BROKEN' dependency. Link: https://lkml.kernel.org/r/20250711102934.2399533-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24mm: mempool: fix wake-up edge case bug for zero-minimum poolsYadan Fan
The mempool wake-up path has a edge case bug that affects pools created with min_nr=0. When a thread blocks waiting for memory from an empty pool (curr_nr == 0), subsequent mempool_free() calls fail to wake the waiting thread because the condition "curr_nr < min_nr" evaluates to "0 < 0" which is false, this can cause threads to sleep indefinitely according to the code logic. There is at least 2 places where the mempool created with min_nr=0: 1. lib/btree.c:191: mempool_create(0, btree_alloc, btree_free, NULL) 2. drivers/md/dm-verity-fec.c:791: mempool_init_slab_pool(&f->extra_pool, 0, f->cache) Add an explicit check in mempool_free() to handle the min_nr=0 case: when the pool has zero minimum reserves, is currently empty, and has active waiters, allocate the element then wake up the sleeper. Link: https://lkml.kernel.org/r/f28a81ba-615c-481e-86fb-c0bf4115ec89@suse.com Signed-off-by: Yadan Fan <ydfan@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24fs/proc/task_mmu: read proc/pid/maps under per-vma lockSuren Baghdasaryan
With maple_tree supporting vma tree traversal under RCU and per-vma locks, /proc/pid/maps can be read while holding individual vma locks instead of locking the entire address space. A completely lockless approach (walking vma tree under RCU) would be quite complex with the main issue being get_vma_name() using callbacks which might not work correctly with a stable vma copy, requiring original (unstable) vma - see special_mapping_name() for example. When per-vma lock acquisition fails, we take the mmap_lock for reading, lock the vma, release the mmap_lock and continue. This fallback to mmap read lock guarantees the reader to make forward progress even during lock contention. This will interfere with the writer but for a very short time while we are acquiring the per-vma lock and only when there was contention on the vma reader is interested in. We shouldn't see a repeated fallback to mmap read locks in practice, as this require a very unlikely series of lock contentions (for instance due to repeated vma split operations). However even if this did somehow happen, we would still progress. One case requiring special handling is when a vma changes between the time it was found and the time it got locked. A problematic case would be if a vma got shrunk so that its vm_start moved higher in the address space and a new vma was installed at the beginning: reader found: |--------VMA A--------| VMA is modified: |-VMA B-|----VMA A----| reader locks modified VMA A reader reports VMA A: | gap |----VMA A----| This would result in reporting a gap in the address space that does not exist. To prevent this we retry the lookup after locking the vma, however we do that only when we identify a gap and detect that the address space was changed after we found the vma. This change is designed to reduce mmap_lock contention and prevent a process reading /proc/pid/maps files (often a low priority task, such as monitoring/data collection services) from blocking address space updates. Note that this change has a userspace visible disadvantage: it allows for sub-page data tearing as opposed to the previous mechanism where data tearing could happen only between pages of generated output data. Since current userspace considers data tearing between pages to be acceptable, we assume is will be able to handle sub-page data tearing as well. Link: https://lkml.kernel.org/r/20250719182854.3166724-7-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeongjun Park <aha310510@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: T.J. Mercier <tjmercier@google.com> Cc: Ye Bin <yebin10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24fs/proc/task_mmu: remove conversion of seq_file position to unsignedSuren Baghdasaryan
Back in 2.6 era, last_addr used to be stored in seq_file->version variable, which was unsigned long. As a result, sentinels to represent gate vma and end of all vmas used unsigned values. In more recent kernels we don't used seq_file->version anymore and therefore conversion from loff_t into unsigned type is not needed. Similarly, sentinel values don't need to be unsigned. Remove type conversion for set_file position and change sentinel values to signed. While at it, change the hardcoded sentinel values with named definitions for better documentation. Link: https://lkml.kernel.org/r/20250719182854.3166724-6-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jann Horn <jannh@google.com> Cc: Jeongjun Park <aha310510@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: T.J. Mercier <tjmercier@google.com> Cc: Ye Bin <yebin10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24selftests/proc: add verbose mode for /proc/pid/maps tearing testsSuren Baghdasaryan
Add verbose mode to the /proc/pid/maps tearing tests to print debugging information. VERBOSE environment variable is used to enable it. Usage example: VERBOSE=1 ./proc-maps-race Link: https://lkml.kernel.org/r/20250719182854.3166724-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeongjun Park <aha310510@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: T.J. Mercier <tjmercier@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Ye Bin <yebin10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-24selftests/proc: extend /proc/pid/maps tearing test to include vma remappingSuren Baghdasaryan
Test that /proc/pid/maps does not report unexpected holes in the address space when we concurrently remap a part of a vma into the middle of another vma. This remapping results in the destination vma being split into three parts and the part in the middle being patched back from, all done concurrently from under the reader. We should always see either original vma or the split one with no holes. Link: https://lkml.kernel.org/r/20250719182854.3166724-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeongjun Park <aha310510@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: T.J. Mercier <tjmercier@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Ye Bin <yebin10@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>