summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2025-07-09gup: optimize longterm pin_user_pages() for large folioLi Zhe
In the current implementation of longterm pin_user_pages(), we invoke collect_longterm_unpinnable_folios(). This function iterates through the list to check whether each folio belongs to the "longterm_unpinnabled" category. The folios in this list essentially correspond to a contiguous region of userspace addresses, with each folio representing a physical address in increments of PAGESIZE. If this userspace address range is mapped with large folio, we can optimize the performance of function collect_longterm_unpinnable_folios() by reducing the using of READ_ONCE() invoked in pofs_get_folio()->page_folio()->_compound_head(). Also, we can simplify the logic of collect_longterm_unpinnable_folios(). Instead of comparing with prev_folio after calling pofs_get_folio(), we can check whether the next page is within the same folio. The performance test results, based on v6.15, obtained through the gup_test tool from the kernel source tree are as follows. We achieve an improvement of over 66% for large folio with pagesize=2M. For small folio, we have only observed a very slight degradation in performance. Without this patch: [root@localhost ~] ./gup_test -HL -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:14391 put:10858 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [root@localhost ~]# ./gup_test -LT -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:130538 put:31676 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 With this patch: [root@localhost ~] ./gup_test -HL -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:4867 put:10516 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [root@localhost ~]# ./gup_test -LT -m 8192 -n 512 TAP version 13 1..1 # PIN_LONGTERM_BENCHMARK: Time: get:131798 put:31328 us# ok 1 ioctl status 0 # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 [lizhe.67@bytedance.com: whitespace fix, per David] Link: https://lkml.kernel.org/r/20250606091917.91384-1-lizhe.67@bytedance.com Link: https://lkml.kernel.org/r/20250606023742.58344-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/pagewalk: split walk_page_range_novma() into kernel/user partsLorenzo Stoakes
walk_page_range_novma() is rather confusing - it supports two modes, one used often, the other used only for debugging. The first mode is the common case of traversal of kernel page tables, which is what nearly all callers use this for. Secondly it provides an unusual debugging interface that allows for the traversal of page tables in a userland range of memory even for that memory which is not described by a VMA. It is far from certain that such page tables should even exist, but perhaps this is precisely why it is useful as a debugging mechanism. As a result, this is utilised by ptdump only. Historically, things were reversed - ptdump was the only user, and other parts of the kernel evolved to use the kernel page table walking here. Since we have some complicated and confusing locking rules for the novma case, it makes sense to separate the two usages into their own functions. Doing this also provide self-documentation as to the intent of the caller - are they doing something rather unusual or are they simply doing a standard kernel page table walk? We therefore establish two separate functions - walk_page_range_debug() for this single usage, and walk_kernel_page_table_range() for general kernel page table walking. The walk_page_range_debug() function is currently used to traverse both userland and kernel mappings, so we maintain this and in the case of kernel mappings being traversed, we have walk_page_range_debug() invoke walk_kernel_page_table_range() internally. We additionally make walk_page_range_debug() internal to mm. Link: https://lkml.kernel.org/r/20250605135104.90720-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Barry Song <baohua@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/memfd: clarify error handling labels in memfd_create()Ye Liu
err_name --> err_free_name (fd failure case) err_fd --> err_free_fd (file failure case) Link: https://lkml.kernel.org/r/20250610083730.527619-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/filemap: allow arch to request folio size for exec memoryRyan Roberts
Change the readahead config so that if it is being requested for an executable mapping, do a synchronous read into a set of folios with an arch-specified order and in a naturally aligned manner. We no longer center the read on the faulting page but simply align it down to the previous natural boundary. Additionally, we don't bother with an asynchronous part. On arm64 if memory is physically contiguous and naturally aligned to the "contpte" size, we can use contpte mappings, which improves utilization of the TLB. When paired with the "multi-size THP" feature, this works well to reduce dTLB pressure. However iTLB pressure is still high due to executable mappings having a low likelihood of being in the required folio size and mapping alignment, even when the filesystem supports readahead into large folios (e.g. XFS). The reason for the low likelihood is that the current readahead algorithm starts with an order-0 folio and increases the folio order by 2 every time the readahead mark is hit. But most executable memory tends to be accessed randomly and so the readahead mark is rarely hit and most executable folios remain order-0. So let's special-case the read(ahead) logic for executable mappings. The trade-off is performance improvement (due to more efficient storage of the translations in iTLB) vs potential for making reclaim more difficult (due to the folios being larger so if a part of the folio is hot the whole thing is considered hot). But executable memory is a small portion of the overall system memory so I doubt this will even register from a reclaim perspective. I've chosen 64K folio size for arm64 which benefits both the 4K and 16K base page size configs. Crucially the same amount of data is still read (usually 128K) so I'm not expecting any read amplification issues. I don't anticipate any write amplification because text is always RO. Note that the text region of an ELF file could be populated into the page cache for other reasons than taking a fault in a mmapped area. The most common case is due to the loader read()ing the header which can be shared with the beginning of text. So some text will still remain in small folios, but this simple, best effort change provides good performance improvements as is. Confine this special-case approach to the bounds of the VMA. This prevents wasting memory for any padding that might exist in the file between sections. Previously the padding would have been contained in order-0 folios and would be easy to reclaim. But now it would be part of a larger folio so more difficult to reclaim. Solve this by simply not reading it into memory in the first place. Benchmarking ============ The below shows pgbench and redis benchmarks on Graviton3 arm64 system. First, confirmation that this patch causes more text to be contained in 64K folios: +----------------------+---------------+---------------+---------------+ | File-backed folios by| system boot | pgbench | redis | | size as percentage of+-------+-------+-------+-------+-------+-------+ | all mapped text mem |before | after |before | after |before | after | +======================+=======+=======+=======+=======+=======+=======+ | base-page-4kB | 78% | 30% | 78% | 11% | 73% | 14% | | thp-aligned-8kB | 1% | 0% | 0% | 0% | 1% | 0% | | thp-aligned-16kB | 17% | 4% | 17% | 3% | 20% | 4% | | thp-aligned-32kB | 1% | 1% | 1% | 2% | 1% | 1% | | thp-aligned-64kB | 3% | 63% | 3% | 81% | 4% | 77% | | thp-aligned-128kB | 0% | 1% | 1% | 1% | 1% | 2% | | thp-unaligned-64kB | 0% | 0% | 0% | 1% | 0% | 1% | | thp-unaligned-128kB | 0% | 1% | 0% | 0% | 0% | 0% | | thp-partial | 0% | 0% | 0% | 1% | 0% | 1% | +----------------------+-------+-------+-------+-------+-------+-------+ | cont-aligned-64kB | 4% | 65% | 4% | 83% | 6% | 79% | +----------------------+-------+-------+-------+-------+-------+-------+ The above shows that for both workloads (each isolated with cgroups) as well as the general system state after boot, the amount of text backed by 4K and 16K folios reduces and the amount backed by 64K folios increases significantly. And the amount of text that is contpte-mapped significantly increases (see last row). And this is reflected in performance improvement. "(I)" indicates a statistically significant improvement. Note TPS and Reqs/sec are rates so bigger is better, ms is time so smaller is better: +-------------+-------------------------------------------+------------+ | Benchmark | Result Class | Improvemnt | +=============+===========================================+============+ | pts/pgbench | Scale: 1 Clients: 1 RO (TPS) | (I) 3.47% | | | Scale: 1 Clients: 1 RO - Latency (ms) | -2.88% | | | Scale: 1 Clients: 250 RO (TPS) | (I) 5.02% | | | Scale: 1 Clients: 250 RO - Latency (ms) | (I) -4.79% | | | Scale: 1 Clients: 1000 RO (TPS) | (I) 6.16% | | | Scale: 1 Clients: 1000 RO - Latency (ms) | (I) -5.82% | | | Scale: 100 Clients: 1 RO (TPS) | 2.51% | | | Scale: 100 Clients: 1 RO - Latency (ms) | -3.51% | | | Scale: 100 Clients: 250 RO (TPS) | (I) 4.75% | | | Scale: 100 Clients: 250 RO - Latency (ms) | (I) -4.44% | | | Scale: 100 Clients: 1000 RO (TPS) | (I) 6.34% | | | Scale: 100 Clients: 1000 RO - Latency (ms)| (I) -5.95% | +-------------+-------------------------------------------+------------+ | pts/redis | Test: GET Connections: 50 (Reqs/sec) | (I) 3.20% | | | Test: GET Connections: 1000 (Reqs/sec) | (I) 2.55% | | | Test: LPOP Connections: 50 (Reqs/sec) | (I) 4.59% | | | Test: LPOP Connections: 1000 (Reqs/sec) | (I) 4.81% | | | Test: LPUSH Connections: 50 (Reqs/sec) | (I) 5.31% | | | Test: LPUSH Connections: 1000 (Reqs/sec) | (I) 4.36% | | | Test: SADD Connections: 50 (Reqs/sec) | (I) 2.64% | | | Test: SADD Connections: 1000 (Reqs/sec) | (I) 4.15% | | | Test: SET Connections: 50 (Reqs/sec) | (I) 3.11% | | | Test: SET Connections: 1000 (Reqs/sec) | (I) 3.36% | +-------------+-------------------------------------------+------------+ [ryan.roberts@arm.com: fix use-after-free] Link: https://lkml.kernel.org/r/ea7f9da7-9a9f-4b85-9d0a-35b320f5ed25@arm.com [ryan.roberts@arm.com: use the vma_pages() helper instead of open-coding] Link: https://lkml.kernel.org/r/0e0f674b-3b7e-494f-ae7a-fc9dbb98dad4@arm.com Link: https://lkml.kernel.org/r/20250609092729.274960-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Will Deacon <will@kernel.org> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/readahead: store folio order in struct file_ra_stateRyan Roberts
Previously the folio order of the previous readahead request was inferred from the folio who's readahead marker was hit. But due to the way we have to round to non-natural boundaries sometimes, this first folio in the readahead block is often smaller than the preferred order for that request. This means that for cases where the initial sync readahead is poorly aligned, the folio order will ramp up much more slowly. So instead, let's store the order in struct file_ra_state so we are not affected by any required alignment. We previously made enough room in the struct for a 16 order field. This should be plenty big enough since we are limited to MAX_PAGECACHE_ORDER anyway, which is certainly never larger than ~20. Since we now pass order in struct file_ra_state, page_cache_ra_order() no longer needs it's new_order parameter, so let's remove that. Worked example: Here we are touching pages 17-256 sequentially just as we did in the previous commit, but now that we are remembering the preferred order explicitly, we no longer have the slow ramp up problem. Note specifically that we no longer have 2 rounds (2x ~128K) of order-2 folios: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00050000 65536 64 80 16 4 FOLIO 0x00050000 0x00060000 65536 80 96 16 4 FOLIO 0x00060000 0x00080000 131072 96 128 32 5 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 FOLIO 0x000e0000 0x00100000 131072 224 256 32 5 FOLIO 0x00100000 0x00120000 131072 256 288 32 5 FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y HOLE 0x00140000 0x00800000 7077888 320 2048 1728 Link: https://lkml.kernel.org/r/20250609092729.274960-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/readahead: make space in struct file_ra_stateRyan Roberts
We need to be able to store the preferred folio order associated with a readahead request in the struct file_ra_state so that we can more accurately increase the order across subsequent readahead requests. But struct file_ra_state is per-struct file, so we don't really want to increase it's size. mmap_miss is currently 32 bits but it is only counted up to 10 * MMAP_LOTSAMISS, which is currently defined as 1000. So 16 bits should be plenty. Redefine it to unsigned short, making room for order as unsigned short in follow up commit. Link: https://lkml.kernel.org/r/20250609092729.274960-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/readahead: terminate async readahead on natural boundaryRyan Roberts
Previously asynchonous readahead would read ra_pages (usually 128K) directly after the end of the synchonous readahead and given the synchronous readahead portion had no alignment guarantees (beyond page boundaries) it is possible (and likely) that the end of the initial 128K region would not fall on a natural boundary for the folio size being used. Therefore smaller folios were used to align down to the required boundary, both at the end of the previous readahead block and at the start of the new one. In the worst cases, this can result in never properly ramping up the folio size, and instead getting stuck oscillating between order-0, -1 and -2 folios. The next readahead will try to use folios whose order is +2 bigger than the folio that had the readahead marker. But because of the alignment requirements, that folio (the first one in the readahead block) can end up being order-0 in some cases. There will be 2 modifications to solve this issue: 1) Calculate the readahead size so the end is aligned to a folio boundary. This prevents needing to allocate small folios to align down at the end of the window and fixes the oscillation problem. 2) Remember the "preferred folio order" in the ra state instead of inferring it from the folio with the readahead marker. This solves the slow ramp up problem (discussed in a subsequent patch). This patch addresses (1) only. A subsequent patch will address (2). Worked example: The following shows the previous pathalogical behaviour when the initial synchronous readahead is unaligned. We start reading at page 17 in the file and read sequentially from there. I'm showing a dump of the pages in the page cache just after we read the first page of the folio with the readahead marker. Initially there are no pages in the page cache: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00800000 8388608 0 2048 2048 Then we access page 17, causing synchonous read-around of 128K with a readahead marker set up at page 25. So far, all as expected: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 Y FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 HOLE 0x00021000 0x00800000 8253440 33 2048 2015 Now access pages 18-25 inclusive. This causes an asynchronous 128K readahead starting at page 33. But since we are unaligned, even though the preferred folio order is 2, the first folio in this batch (the one with the new readahead marker) is order-0: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 Y FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00041000 4096 64 65 1 0 HOLE 0x00041000 0x00800000 8122368 65 2048 1983 Which means that when we now read pages 26-33 and readahead is kicked off again, the new preferred order is 2 (0 + 2), not 4 as we intended: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00041000 4096 64 65 1 0 FOLIO 0x00041000 0x00042000 4096 65 66 1 0 Y FOLIO 0x00042000 0x00044000 8192 66 68 2 1 FOLIO 0x00044000 0x00048000 16384 68 72 4 2 FOLIO 0x00048000 0x0004c000 16384 72 76 4 2 FOLIO 0x0004c000 0x00050000 16384 76 80 4 2 FOLIO 0x00050000 0x00054000 16384 80 84 4 2 FOLIO 0x00054000 0x00058000 16384 84 88 4 2 FOLIO 0x00058000 0x0005c000 16384 88 92 4 2 FOLIO 0x0005c000 0x00060000 16384 92 96 4 2 FOLIO 0x00060000 0x00061000 4096 96 97 1 0 HOLE 0x00061000 0x00800000 7991296 97 2048 1951 This ramp up from order-0 with smaller orders at the edges for alignment cycle continues all the way to the end of the file (not shown). After the change, we round down the end boundary to the order boundary so we no longer get stuck in the cycle and can ramp up the order over time. Note that the rate of the ramp up is still not as we would expect it. We will fix that next. Here we are touching pages 17-256 sequentially: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER RA ----- ---------- ---------- ---------- ------- ------- ----- ----- -- HOLE 0x00000000 0x00001000 4096 0 1 1 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00021000 4096 32 33 1 0 FOLIO 0x00021000 0x00022000 4096 33 34 1 0 FOLIO 0x00022000 0x00024000 8192 34 36 2 1 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00044000 16384 64 68 4 2 FOLIO 0x00044000 0x00048000 16384 68 72 4 2 FOLIO 0x00048000 0x0004c000 16384 72 76 4 2 FOLIO 0x0004c000 0x00050000 16384 76 80 4 2 FOLIO 0x00050000 0x00054000 16384 80 84 4 2 FOLIO 0x00054000 0x00058000 16384 84 88 4 2 FOLIO 0x00058000 0x0005c000 16384 88 92 4 2 FOLIO 0x0005c000 0x00060000 16384 92 96 4 2 FOLIO 0x00060000 0x00070000 65536 96 112 16 4 FOLIO 0x00070000 0x00080000 65536 112 128 16 4 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 FOLIO 0x000e0000 0x00100000 131072 224 256 32 5 FOLIO 0x00100000 0x00120000 131072 256 288 32 5 FOLIO 0x00120000 0x00140000 131072 288 320 32 5 Y HOLE 0x00140000 0x00800000 7077888 320 2048 1728 Link: https://lkml.kernel.org/r/20250609092729.274960-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/readahead: honour new_order in page_cache_ra_order()Ryan Roberts
Patch series "Readahead tweaks for larger folios", v5. This series adds some tweaks to readahead so that it does a better job of ramping up folio sizes as readahead extends further into the file. And it additionally special-cases executable mappings to allow the arch to request a preferred folio size for text. This patch (of 5): page_cache_ra_order() takes a parameter called new_order, which is intended to express the preferred order of the folios that will be allocated for the readahead operation. Most callers indeed call this with their preferred new order. But page_cache_async_ra() calls it with the preferred order of the previous readahead request (actually the order of the folio that had the readahead marker, which may be smaller when alignment comes into play). And despite the parameter name, page_cache_ra_order() always treats it at the old order, adding 2 to it on entry. As a result, a cold readahead always starts with order-2 folios. Let's fix this behaviour by always passing in the *new* order. Worked example: Prior to the change, mmaping an 8MB file and touching each page sequentially, resulted in the following, where we start with order-2 folios for the first 128K then ramp up to order-4 for the next 128K, then get clamped to order-5 for the rest of the file because pa_pages is limited to 128K: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER ----- ---------- ---------- --------- ------- ------- ----- ----- FOLIO 0x00000000 0x00004000 16384 0 4 4 2 FOLIO 0x00004000 0x00008000 16384 4 8 4 2 FOLIO 0x00008000 0x0000c000 16384 8 12 4 2 FOLIO 0x0000c000 0x00010000 16384 12 16 4 2 FOLIO 0x00010000 0x00014000 16384 16 20 4 2 FOLIO 0x00014000 0x00018000 16384 20 24 4 2 FOLIO 0x00018000 0x0001c000 16384 24 28 4 2 FOLIO 0x0001c000 0x00020000 16384 28 32 4 2 FOLIO 0x00020000 0x00030000 65536 32 48 16 4 FOLIO 0x00030000 0x00040000 65536 48 64 16 4 FOLIO 0x00040000 0x00060000 131072 64 96 32 5 FOLIO 0x00060000 0x00080000 131072 96 128 32 5 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 ... After the change, the same operation results in the first 128K being order-0, then we start ramping up to order-2, -4, and finally get clamped at order-5: TYPE STARTOFFS ENDOFFS SIZE STARTPG ENDPG NRPG ORDER ----- ---------- ---------- --------- ------- ------- ----- ----- FOLIO 0x00000000 0x00001000 4096 0 1 1 0 FOLIO 0x00001000 0x00002000 4096 1 2 1 0 FOLIO 0x00002000 0x00003000 4096 2 3 1 0 FOLIO 0x00003000 0x00004000 4096 3 4 1 0 FOLIO 0x00004000 0x00005000 4096 4 5 1 0 FOLIO 0x00005000 0x00006000 4096 5 6 1 0 FOLIO 0x00006000 0x00007000 4096 6 7 1 0 FOLIO 0x00007000 0x00008000 4096 7 8 1 0 FOLIO 0x00008000 0x00009000 4096 8 9 1 0 FOLIO 0x00009000 0x0000a000 4096 9 10 1 0 FOLIO 0x0000a000 0x0000b000 4096 10 11 1 0 FOLIO 0x0000b000 0x0000c000 4096 11 12 1 0 FOLIO 0x0000c000 0x0000d000 4096 12 13 1 0 FOLIO 0x0000d000 0x0000e000 4096 13 14 1 0 FOLIO 0x0000e000 0x0000f000 4096 14 15 1 0 FOLIO 0x0000f000 0x00010000 4096 15 16 1 0 FOLIO 0x00010000 0x00011000 4096 16 17 1 0 FOLIO 0x00011000 0x00012000 4096 17 18 1 0 FOLIO 0x00012000 0x00013000 4096 18 19 1 0 FOLIO 0x00013000 0x00014000 4096 19 20 1 0 FOLIO 0x00014000 0x00015000 4096 20 21 1 0 FOLIO 0x00015000 0x00016000 4096 21 22 1 0 FOLIO 0x00016000 0x00017000 4096 22 23 1 0 FOLIO 0x00017000 0x00018000 4096 23 24 1 0 FOLIO 0x00018000 0x00019000 4096 24 25 1 0 FOLIO 0x00019000 0x0001a000 4096 25 26 1 0 FOLIO 0x0001a000 0x0001b000 4096 26 27 1 0 FOLIO 0x0001b000 0x0001c000 4096 27 28 1 0 FOLIO 0x0001c000 0x0001d000 4096 28 29 1 0 FOLIO 0x0001d000 0x0001e000 4096 29 30 1 0 FOLIO 0x0001e000 0x0001f000 4096 30 31 1 0 FOLIO 0x0001f000 0x00020000 4096 31 32 1 0 FOLIO 0x00020000 0x00024000 16384 32 36 4 2 FOLIO 0x00024000 0x00028000 16384 36 40 4 2 FOLIO 0x00028000 0x0002c000 16384 40 44 4 2 FOLIO 0x0002c000 0x00030000 16384 44 48 4 2 FOLIO 0x00030000 0x00034000 16384 48 52 4 2 FOLIO 0x00034000 0x00038000 16384 52 56 4 2 FOLIO 0x00038000 0x0003c000 16384 56 60 4 2 FOLIO 0x0003c000 0x00040000 16384 60 64 4 2 FOLIO 0x00040000 0x00050000 65536 64 80 16 4 FOLIO 0x00050000 0x00060000 65536 80 96 16 4 FOLIO 0x00060000 0x00080000 131072 96 128 32 5 FOLIO 0x00080000 0x000a0000 131072 128 160 32 5 FOLIO 0x000a0000 0x000c0000 131072 160 192 32 5 FOLIO 0x000c0000 0x000e0000 131072 192 224 32 5 ... Link: https://lkml.kernel.org/r/20250609092729.274960-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20250609092729.274960-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/mempolicy: skip unnecessary synchronize_rcu()Joshua Hahn
By unconditionally setting wi_state to NULL and conditionally calling synchronize_rcu(), we can save an unncessary call when there is no old_wi_state. Link: https://lkml.kernel.org/r/20250602162345.2595696-2-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: kernel test robot <lkp@intel.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: use per_vma lock for MADV_DONTNEEDBarry Song
Certain madvise operations, especially MADV_DONTNEED, occur far more frequently than other madvise options, particularly in native and Java heaps for dynamic memory management. Currently, the mmap_lock is always held during these operations, even when unnecessary. This causes lock contention and can lead to severe priority inversion, where low-priority threads—such as Android's HeapTaskDaemon— hold the lock and block higher-priority threads. This patch enables the use of per-VMA locks when the advised range lies entirely within a single VMA, avoiding the need for full VMA traversal. In practice, userspace heaps rarely issue MADV_DONTNEED across multiple VMAs. Tangquan's testing shows that over 99.5% of memory reclaimed by Android benefits from this per-VMA lock optimization. After extended runtime, 217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while only 1,231 fell back to mmap_lock. To simplify handling, the implementation falls back to the standard mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of userfaultfd_remove(). Many thanks to Lorenzo's work[1] on "mm/madvise: support VMA read locks for MADV_DONTNEED[_LOCKED]" Then use this mechanism to permit VMA locking to be done later in the madvise() logic and also to allow altering of the locking mode to permit falling back to an mmap read lock if required." One important point, as pointed out by Jann[2], is that untagged_addr_remote() requires holding mmap_lock. This is because address tagging on x86 and RISC-V is quite complex. Until untagged_addr_remote() becomes atomic—which seems unlikely in the near future—we cannot support per-VMA locks for remote processes. So for now, only local processes are supported. Lance said: : Just to put some numbers on it, I ran a micro-benchmark with 100 : parallel threads, where each thread calls madvise() on its own 1GiB : chunk of 64KiB mTHP-backed memory. The performance gain is huge: : : 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s : (~47% faster) : : 2) MADV_FREE saw its average time drop from 0.3078s to 0.1095s (~64% : faster) [lorenzo.stoakes@oracle.com: avoid any chance of uninitialised pointer deref] Link: https://lkml.kernel.org/r/309d22ca-6cd9-4601-8402-d441a07d9443@lucifer.local Link: https://lore.kernel.org/all/0b96ce61-a52c-4036-b5b6-5c50783db51f@lucifer.local/ [1] Link: https://lore.kernel.org/all/CAG48ez11zi-1jicHUZtLhyoNPGGVB+ROeAJCUw48bsjk4bbEkA@mail.gmail.com/ [2] Link: https://lkml.kernel.org/r/20250607220150.2980-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Lance Yang <ioworker0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09userfaultfd: remove (VM_)BUG_ON()sTal Zussman
BUG_ON() is deprecated [1]. Convert all the BUG_ON()s and VM_BUG_ON()s to use VM_WARN_ON_ONCE(). There are a few additional cases that are converted or modified: - Convert the printk(KERN_WARNING ...) in handle_userfault() to use pr_warn(). - Convert the WARN_ON_ONCE()s in move_pages() to use VM_WARN_ON_ONCE(), as the relevant conditions are already checked in validate_range() in move_pages()'s caller. - Convert the VM_WARN_ON()'s in move_pages() to VM_WARN_ON_ONCE(). These cases should never happen and are similar to those in mfill_atomic() and mfill_atomic_hugetlb(), which were previously BUG_ON()s. move_pages() was added later than those functions and makes use of VM_WARN_ON() as a replacement for the deprecated BUG_ON(), but. VM_WARN_ON_ONCE() is likely a better direct replacement. - Convert the WARN_ON() for !VM_MAYWRITE in userfaultfd_unregister() and userfaultfd_register_range() to VM_WARN_ON_ONCE(). This condition is enforced in userfaultfd_register() so it should never happen, and can be converted to a debug check. [1] https://www.kernel.org/doc/html/v6.15/process/coding-style.html#use-warn-rather-than-bug Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-3-a7274d3bd5e4@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09drivers/base/node: rename __register_one_node() to register_one_node()Donet Tom
The register_one_node() function was a simple wrapper around __register_one_node(). To simplify the code, register_one_node() has been removed, and __register_one_node() has been renamed to register_one_node(). Link: https://lkml.kernel.org/r/8262cd0f44eeb048a1fcd3ac8382760d7f7dea60.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09drivers/base/node: rename register_memory_blocks_under_node() and remove ↵Donet Tom
context argument The function register_memory_blocks_under_node() is now only called from the memory hotplug path, as register_memory_blocks_under_node_early() handles registration during early boot. Therefore, the context argument used to differentiate between early boot and hotplug is no longer needed and was removed. Since the function is only called from the hotplug path, we renamed register_memory_blocks_under_node() to register_memory_blocks_under_node_hotplug() Link: https://lkml.kernel.org/r/907c22292b0ee4975107876efc875c75c11badd9.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom <donettom@linux.ibm.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09readahead: fix return value of page_cache_next_miss() when no hole is foundChi Zhiling
max_scan in page_cache_next_miss always decreases to zero when no hole is found, causing the return value to be index + 0. Fix this by preserving the max_scan value throughout the loop. Jan said "From what I know and have seen in the past, wrong responses from page_cache_next_miss() can lead to readahead window reduction and thus reduced read speeds." Link: https://lkml.kernel.org/r/20250605054935.2323451-1-chizhiling@163.com Fixes: 901a269ff3d5 ("filemap: fix page_cache_next_miss() when no hole found") Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/cma: pair the trace_cma_alloc_start/finishRichard Chang
In the bad input validation cases, there is no trace_cma_alloc_finish to match the trace_cma_alloc_start. Move the trace_cma_alloc_start event after the validations. Link: https://lkml.kernel.org/r/20250605072532.972081-1-richardycc@google.com Signed-off-by: Richard Chang <richardycc@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: madvise: use walk_page_range_vma() instead of walk_page_range()Barry Song
We've already found the VMA within madvise_walk_vmas() before calling specific madvise behavior functions like madvise_free_single_vma(). So calling walk_page_range() and doing find_vma() again seems unnecessary. It also prevents potential optimizations in those madvise callbacks, particularly the use of dedicated per-VMA locking. [v-songbaohua@oppo.com: revert the walk_page_range_vma change for MADV_GUARD_INSTALL] Link: https://lkml.kernel.org/r/20250609105513.10901-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20250605083144.43046-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: stop passing a writeback_control structure to swap_writeoutChristoph Hellwig
swap_writeout only needs the swap_iocb cookie from the writeback_control structure, so pass it explicitly. Link: https://lkml.kernel.org/r/20250610054959.2057526-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: stop passing a writeback_control structure to __swap_writepageChristoph Hellwig
__swap_writepage only needs the swap_iocb cookie from the writeback_control structure, so pass it explicitly and remove the now unused swap_iocb member from struct writeback_control. Link: https://lkml.kernel.org/r/20250610054959.2057526-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: tidy up swap_writeoutChristoph Hellwig
Use a goto label to consolidate the unlock folio and return pattern and don't bother with an else after a return / goto. Link: https://lkml.kernel.org/r/20250610054959.2057526-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: stop passing a writeback_control structure to shmem_writeoutChristoph Hellwig
shmem_writeout only needs the swap_iocb cookie and the split folio list. Pass those explicitly and remove the now unused list member from struct writeback_control. Link: https://lkml.kernel.org/r/20250610054959.2057526-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: split out a writeout helper from pageoutChristoph Hellwig
Patch series "stop passing a writeback_control to swap/shmem writeout", v3. This series was intended to remove the last remaining users of AOP_WRITEPAGE_ACTIVATE after my other pending patches removed the rest, but spectacularly failed at that. But instead it nicely improves the code, and removes two pointers from struct writeback_control. This patch (of 6): Move the code to write back swap / shmem folios into a self-contained helper to keep prepare for refactoring it. Link: https://lkml.kernel.org/r/20250610054959.2057526-1-hch@lst.de Link: https://lkml.kernel.org/r/20250610054959.2057526-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Baolin Wang <baolin.wang@linux.aibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm, list_lru: refactor the locking codeKairui Song
Cocci is confused by the try lock then release RCU and return logic here. So separate the try lock part out into a standalone helper. The code is easier to follow too. No feature change, fixes: cocci warnings: (new ones prefixed by >>) >> mm/list_lru.c:82:3-9: preceding lock on line 77 >> mm/list_lru.c:82:3-9: preceding lock on line 77 mm/list_lru.c:82:3-9: preceding lock on line 75 mm/list_lru.c:82:3-9: preceding lock on line 75 Link: https://lkml.kernel.org/r/20250526180638.14609-1-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Julia Lawall <julia.lawall@inria.fr> Closes: https://lore.kernel.org/r/202505252043.pbT1tBHJ-lkp@intel.com/ Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: rename CONFIG_PAGE_BLOCK_ORDER to CONFIG_PAGE_BLOCK_MAX_ORDERZi Yan
The config is in fact an additional upper limit of pageblock_order, so rename it to avoid confusion. Link: https://lkml.kernel.org/r/20250604211427.1590859-1-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Juan Yescas <jyescas@google.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/gup: remove (VM_)BUG_ONsDavid Hildenbrand
Especially once we hit one of the assertions in sanity_check_pinned_pages(), observing follow-up assertions failing in other code can give good clues about what went wrong, so use VM_WARN_ON_ONCE instead. While at it, let's just convert all VM_BUG_ON to VM_WARN_ON_ONCE as well. Add one comment for the pfn_valid() check. We have to introduce VM_WARN_ON_ONCE_VMA() to make that fly. Drop the BUG_ON after mmap_read_lock_killable(), if that ever returns something > 0 we're in bigger trouble. Convert the other BUG_ON's into VM_WARN_ON_ONCE as well, they are in a similar domain "should never happen", but more reasonable to check for during early testing. [david@redhat.com: use the _FOLIO variant where possible, per Lorenzo] Link: https://lkml.kernel.org/r/844bd929-a551-48e3-a12e-285cd65ba580@redhat.com Link: https://lkml.kernel.org/r/20250604140544.688711-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon/stat: calculate and expose idle time percentilesSeongJae Park
Knowing how much memory is how cold can be useful for understanding coldness and utilization efficiency of memory. The raw form of DAMON's monitoring results has the information. Convert the raw results into the per-byte idle time distributions and expose it as percentiles metric to users, as a read-only DAMON_STAT parameter. In detail, the metrics are calculated as follows. First, DAMON's per-region access frequency and age information is converted into per-byte idle time. If access frequency of a region is higher than zero, every byte of the region has zero idle time. If the access frequency of a region is zero, every byte of the region has idle time as the age of the region. Then the logic sorts the per-byte idle times and provides the value at 0/100, 1/100, ..., 99/100 and 100/100 location of the sorted array. The metric can be easily aggregated and compared on large scale production systems. For example, if an average of 75-th percentile idle time of machines that collected on similar time is two minutes, it means the system's 25 percent memory is not accessed at all for two minutes or more on average. If a workload considers two minutes as unit work time, we can conclude its working set size is only 75 percent of the memory. If the system utilizes proactive reclamation and it supports coldness-based thresholds like DAMON_RECLAIM, the idle time percentiles can be used to find a more safe or aggressive coldness threshold for aimed memory saving. Link: https://lkml.kernel.org/r/20250604183127.13968-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon/stat: calculate and expose estimated memory bandwidthSeongJae Park
The raw form of DAMON's monitoring results captures many details of the information. However, not every bit of the information is always required for understanding practical access patterns. Especially on real world production systems of high scale time and size, the raw form is difficult to be aggregated and compared. Convert the raw monitoring results into a single number metric, namely estimated memory bandwidth and expose it to users as a read-only DAMON_STAT parameter. The metric represents access intensiveness (hotness) of the system. It can easily be aggregated and compared for high level understanding of the access pattern on large systems. Link: https://lkml.kernel.org/r/20250604183127.13968-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon: introduce DAMON_STAT moduleSeongJae Park
Patch series "mm/damon: introduce DAMON_STAT for simple and practical access monitoring", v2. DAMON-based access monitoring is not simple due to required DAMON control and results visualizations. Introduce a static kernel module for making it simple. The module can be enabled without manual setup and provides access pattern metrics that easy to fetch and understand the practical access pattern information, namely estimated memory bandwidth and memory idle time percentiles. Background and Problems ======================= DAMON can be used for monitoring data access patterns of the system and workloads. Specifically, users can start DAMON to monitor access events on specific address space with fine controls including address ranges to monitor and time intervals between samplings and aggregations. The resulting access information snapshot contains access frequency (nr_accesses) and how long the frequency was kept (age) for each byte. The monitoring usage is not simple and practical enough for production usage. Users should first start DAMON with a number of parameters, and wait until DAMON's monitoring results capture a reasonable amount of the time data (age). In production, such manual start and wait is impractical to capture useful information from a high number of machines in a timely manner. The monitoring result is also too detailed to be used on production environments. The raw results are hard to be aggregated and/or compared for production environments having a large scale of time, space and machines fleet. Users have to implement and use their own automation of DAMON control and results processing. It is repetitive and challenging since there is no good reference or guideline for such automation. Solution: DAMON_STAT ==================== Implement such automation in kernel space as a static kernel module, namely DAMON_STAT. It can be enabled at build, boot, or run time via its build configuration or module parameter. It monitors the entire physical address space with monitoring intervals that auto-tuned for a reasonable amount of access observations and minimum overhead. It converts the raw monitoring results into simpler metrics that can easily be aggregated and compared, namely estimated memory bandwidth and idle time percentiles. Understanding of the metrics and the user interface of DAMON_STAT is essential. Refer to the commit messages of the second and the third patches of this patch series for more details about the metrics. For the user interface, the standard module parameters system is used. Refer to the fourth patch of this patch series for details of the user interface. Discussions =========== The module aims to be useful on production environments constructed with a large number of machines that run a long time. The auto-tuned monitoring intervals ensure a reasonable quality of the outputs. The auto-tuning also ensures its overhead be reasonable and low enough to be enabled always on the production. The simplified monitoring results metrics can be useful for showing both coldness (idle time percentiles) and hotness (memory bandwidth) of the system's access pattern. We expect the information can be useful for assessing system memory utilization and inspiring optimizations or investigations on both kernel and user space memory management logics for large scale fleets. We hence expect the module is good enough to be just used in most environments. For special cases that require a custom access monitoring automation, users will still benefit by using DAMON_STAT as a reference or a guideline for their specialized automation. This patch (of 4): To use DAMON for monitoring access patterns of the system, users should manually start DAMON via DAMON sysfs ABI with a number of parameters for specifying the monitoring target address space, address ranges, and monitoring intervals. After that, users should also wait until desired amount of time data is captured into DAMON's monitoring results. It is bothersome and take a long time to be practical for access monitoring on large fleet level production environments. For access-aware system operations use cases like proactive cold memory reclamation, similar problems existed. We we solved those by introducing dedicated static kernel modules such as DAMON_RECLAIM. Implement such static kernel module for access monitoring, namely DAMON_STAT. It monitors the entire physical address space with auto-tuned monitoring intervals. The auto-tuning is set to capture 4 % of observable access events in each snapshot while keeping the sampling intervals 5 milliseconds in minimum and 10 seconds in maximum. From a few production environments, we confirmed this setup provides high quality monitoring results with minimum overheads. The module therefore receives only one user input, whether to enable or disable it. It can be set on build or boot time via build configuration or kernel boot command line. It can also be overridden at runtime. Note that this commit only implements the DAMON control part of the module. Users could get the monitoring results via damon:damon_aggregated tracepoint, but that's of course not the recommended way. Following commits will implement convenient and optimized ways for serving the monitoring results to users. [sj@kernel.org: use IS_ENABLED() for enabled initial value] Link: https://lkml.kernel.org/r/20250604205619.18929-1-sj@kernel.org [sj@kernel.org: reset enabled when DAMON start failed] Link: https://lkml.kernel.org/r/20250706184750.36588-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250604183127.13968-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250604183127.13968-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: Kconfig: use verb *use* in plural form in descriptionPaul Menzel
*workloads* is plural requiring the verb *use* in plural form. Link: https://lkml.kernel.org/r/20250603061303.479551-2-pmenzel@molgen.mpg.de Fixes: e13e7922d034 ("mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order") Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/hugetlb: convert hugetlb_change_protection() to foliosSidhartha Kumar
The for loop inside hugetlb_change_protection() increments by the huge page size: psize = huge_page_size(h); for (; address < end; address += psize) so we are operating on the head page of the huge pages between address and end. We can safely convert the struct page usage to struct folio. Link: https://lkml.kernel.org/r/20250528192013.91130-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: prevent KSM from breaking VMA merging for new VMAsLorenzo Stoakes
If a user wishes to enable KSM mergeability for an entire process and all fork/exec'd processes that come after it, they use the prctl() PR_SET_MEMORY_MERGE operation. This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set (in order to indicate they are KSM mergeable), as well as setting this flag for all existing VMAs and propagating this across fork/exec. However it also breaks VMA merging for new VMAs, both in the process and all forked (and fork/exec'd) child processes. This is because when a new mapping is proposed, the flags specified will never have VM_MERGEABLE set. However all adjacent VMAs will already have VM_MERGEABLE set, rendering VMAs unmergeable by default. To work around this, we try to set the VM_MERGEABLE flag prior to attempting a merge. In the case of brk() this can always be done. However on mmap() things are more complicated - while KSM is not supported for MAP_SHARED file-backed mappings, it is supported for MAP_PRIVATE file-backed mappings. These mappings may have deprecated .mmap() callbacks specified which could, in theory, adjust flags and thus KSM eligibility. So we check to determine whether this is possible. If not, we set VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the previous behaviour. This fixes VMA merging for all new anonymous mappings, which covers the majority of real-world cases, so we should see a significant improvement in VMA mergeability. For MAP_PRIVATE file-backed mappings, those which implement the .mmap_prepare() hook and shmem are both known to be safe, so we allow these, disallowing all other cases. Also add stubs for newly introduced function invocations to VMA userland testing. [lorenzo.stoakes@oracle.com: correctly invoke late KSM check after mmap hook] Link: https://lkml.kernel.org/r/5861f8f6-cf5a-4d82-a062-139fb3f9cddb@lucifer.local Link: https://lkml.kernel.org/r/3ba660af716d87a18ca5b4e635f2101edeb56340.1748537921.git.lorenzo.stoakes@oracle.com Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") # please no backport! Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: ksm: refer to special VMAs via VM_SPECIAL in ksm_compatible()Lorenzo Stoakes
There's no need to spell out all the special cases, also doing it this way makes it absolutely clear that we preclude unmergeable VMAs in general, and puts the other excluded flags in stark and clear contrast. Link: https://lkml.kernel.org/r/c8be5b055163b164c8824020164076ee3b9389bd.1748537921.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: ksm: have KSM VMA checks not require a VMA pointerLorenzo Stoakes
Patch series "mm: ksm: prevent KSM from breaking merging of new VMAs", v3. When KSM-by-default is established using prctl(PR_SET_MEMORY_MERGE), this defaults all newly mapped VMAs to having VM_MERGEABLE set, and thus makes them available to KSM for samepage merging. It also sets VM_MERGEABLE in all existing VMAs. However this causes an issue upon mapping of new VMAs - the initial flags will never have VM_MERGEABLE set when attempting a merge with adjacent VMAs (this is set later in the mmap() logic), and adjacent VMAs will ALWAYS have VM_MERGEABLE set. This renders all newly mapped VMAs unmergeable. To avoid this, this series performs the check for PR_SET_MEMORY_MERGE far earlier in the mmap() logic, prior to the merge being attempted. However we run into complexity with the depreciated .mmap() callback - if a driver hooks this, it might change flags which adjust KSM merge eligibility. We have to worry about this because, while KSM is only applicable to private mappings, this includes both anonymous and MAP_PRIVATE-mapped file-backed mappings. This isn't a problem for brk(), where the VMA must be anonymous. However in mmap() we must be conservative - if the VMA is anonymous then we can always proceed, however if not, we permit only shmem mappings (whose .mmap hook does not affect KSM eligibility) and drivers which implement .mmap_prepare() (invoked prior to the KSM eligibility check). If we can't be sure of the driver changing things, then we maintain the same behaviour of performing the KSM check later in the mmap() logic (and thus losing new VMA mergeability). A great many use-cases for this logic will use anonymous mappings any rate, so this change should already cover the majority of actual KSM use-cases. This patch (of 4): In subsequent commits we are going to determine KSM eligibility prior to a VMA being constructed, at which point we will of course not yet have access to a VMA pointer. It is trivial to boil down the check logic to be parameterised on mm_struct, file and VMA flags, so do so. As a part of this change, additionally expose and use file_is_dax() to determine whether a file is being mapped under a DAX inode. Link: https://lkml.kernel.org/r/cover.1748537921.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/36ad13eb50cdbd8aac6dcfba22c65d5031667295.1748537921.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: vmscan: apply proportional reclaim pressure for memcg when MGLRU is enabledKoichiro Den
The scan implementation for MGLRU was missing proportional reclaim pressure for memcg, which contradicts the description in Documentation/admin-guide/cgroup-v2.rst (memory.{low,min} section). This issue can be observed in kselftest cgroup:test_memcontrol (specifically test_memcg_min and test_memcg_low). The following table shows the actual values observed in my local test env (on xfs) and the error "e", which is the symmetric absolute percentage error from the ideal values of 29M for c[0] and 21M for c[1]. test_memcg_min | MGLRU enabled | MGLRU enabled | MGLRU disabled | Without patch | With patch | -----|-----------------|-----------------|--------------- c[0] | 25964544 (e=8%) | 28770304 (e=3%) | 27820032 (e=4%) c[1] | 26214400 (e=9%) | 23998464 (e=4%) | 24776704 (e=6%) test_memcg_low | MGLRU enabled | MGLRU enabled | MGLRU disabled | Without patch | With patch | -----|-----------------|-----------------|--------------- c[0] | 26214400 (e=7%) | 27930624 (e=4%) | 27688960 (e=5%) c[1] | 26214400 (e=9%) | 24764416 (e=6%) | 24920064 (e=6%) Factor out the proportioning logic to a new function and have MGLRU reuse it. While at it, update the eviction behavior via debugfs 'lru_gen' interface ('-' command with an explicit 'nr_to_reclaim' parameter) to ensure eviction is limited to the specified number. Link: https://lkml.kernel.org/r/20250530162353.541882-1-den@valinux.co.jp Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Reviewed-by: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: restore documentation for __free_pages()Matthew Wilcox (Oracle)
The documentation was converted to be for ___free_pages(), which doesn't need documentation as it's static. Link: https://lkml.kernel.org/r/20250604190327.814086-1-willy@infradead.org Fixes: 8c57b687e833 (mm, bpf: Introduce free_pages_nolock()) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09Revert "sched/numa: add statistics of numa balance task"Chen Yu
This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382. This commit introduces per-memcg/task NUMA balance statistics, but unfortunately it introduced a NULL pointer exception due to the following race condition: After a swap task candidate was chosen, its mm_struct pointer was set to NULL due to task exit. Later, when performing the actual task swapping, the p->mm caused the problem. CPU0 CPU1 : ... task_numa_migrate task_numa_find_cpu task_numa_compare # a normal task p is chosen env->best_task = p # p exit: exit_signals(p); p->flags |= PF_EXITING exit_mm p->mm = NULL; migrate_swap_stop __migrate_swap_task((arg->src_task, arg->dst_cpu) count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL task_lock() should be held and the PF_EXITING flag needs to be checked to prevent this from happening. After discussion, the conclusion was that adding a lock is not worthwhile for some statistics calculations. Revert the change and rely on the tracepoint for this purpose. Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/ Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com> Reported-by: Suneeth D <Suneeth.D@amd.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Libo Chen <libo.chen@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon: fix divide by zero in damon_get_intervals_score()Honggyu Kim
The current implementation allows having zero size regions with no special reasons, but damon_get_intervals_score() gets crashed by divide by zero when the region size is zero. [ 29.403950] Oops: divide error: 0000 [#1] SMP NOPTI This patch fixes the bug, but does not disallow zero size regions to keep the backward compatibility since disallowing zero size regions might be a breaking change for some users. In addition, the same crash can happen when intervals_goal.access_bp is zero so this should be fixed in stable trees as well. Link: https://lkml.kernel.org/r/20250702000205.1921-5-honggyu.kim@sk.com Fixes: f04b0fedbe71 ("mm/damon/core: implement intervals auto-tuning") Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09kasan: remove kasan_find_vm_area() to prevent possible deadlockYeoreum Yun
find_vm_area() couldn't be called in atomic_context. If find_vm_area() is called to reports vm area information, kasan can trigger deadlock like: CPU0 CPU1 vmalloc(); alloc_vmap_area(); spin_lock(&vn->busy.lock) spin_lock_bh(&some_lock); <interrupt occurs> <in softirq> spin_lock(&some_lock); <access invalid address> kasan_report(); print_report(); print_address_description(); kasan_find_vm_area(); find_vm_area(); spin_lock(&vn->busy.lock) // deadlock! To prevent possible deadlock while kasan reports, remove kasan_find_vm_area(). Link: https://lkml.kernel.org/r/20250703181018.580833-1-yeoreum.yun@arm.com Fixes: c056a364e954 ("kasan: print virtual mapping info in reports") Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Reported-by: Yunseong Kim <ysk@kzalloc.com> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/migrate: fix do_pages_stat in compat modeChristoph Berg
For arrays with more than 16 entries, the old code would incorrectly advance the pages pointer by 16 words instead of 16 compat_uptr_t. Fix by doing the pointer arithmetic inside get_compat_pages_array where pages32 is already a correctly-typed pointer. Discovered while working on PostgreSQL 18's new NUMA introspection code. Link: https://lkml.kernel.org/r/aGREU0XTB48w9CwN@msg.df7cb.de Fixes: 5b1b561ba73c ("mm: simplify compat_sys_move_pages") Signed-off-by: Christoph Berg <myon@debian.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reported-by: Tomas Vondra <tomas@vondra.me> Closes: https://www.postgresql.org/message-id/flat/6342f601-77de-4ee0-8c2a-3deb50ceac5b%40vondra.me#86402e3d80c031788f5f55b42c459471 Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/damon/core: handle damon_call_control as normal under kdmond deactivationSeongJae Park
DAMON sysfs interface internally uses damon_call() to update DAMON parameters as users requested, online. However, DAMON core cancels any damon_call() requests when it is deactivated by DAMOS watermarks. As a result, users cannot change DAMON parameters online while DAMON is deactivated. Note that users can turn DAMON off and on with different watermarks to work around. Since deactivated DAMON is nearly same to stopped DAMON, the work around should have no big problem. Anyway, a bug is a bug. There is no real good reason to cancel the damon_call() request under DAMOS deactivation. Fix it by simply handling the request as normal, rather than cancelling under the situation. Link: https://lkml.kernel.org/r/20250629204914.54114-1-sj@kernel.org Fixes: 42b7491af14c ("mm/damon/core: introduce damon_call()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/rmap: fix potential out-of-bounds page table access during batched unmapLance Yang
As pointed out by David[1], the batched unmap logic in try_to_unmap_one() may read past the end of a PTE table when a large folio's PTE mappings are not fully contained within a single page table. While this scenario might be rare, an issue triggerable from userspace must be fixed regardless of its likelihood. This patch fixes the out-of-bounds access by refactoring the logic into a new helper, folio_unmap_pte_batch(). The new helper correctly calculates the safe batch size by capping the scan at both the VMA and PMD boundaries. To simplify the code, it also supports partial batching (i.e., any number of pages from 1 up to the calculated safe maximum), as there is no strong reason to special-case for fully mapped folios. Link: https://lkml.kernel.org/r/20250701143100.6970-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250630011305.23754-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250627062319.84936-1-lance.yang@linux.dev Link: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com [1] Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com Suggested-by: Barry Song <baohua@kernel.org> Acked-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <huang.ying.caritas@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/hugetlb: don't crash when allocating a folio if there are no resvVivek Kasireddy
There are cases when we try to pin a folio but discover that it has not been faulted-in. So, we try to allocate it in memfd_alloc_folio() but there is a chance that we might encounter a fatal crash/failure (VM_BUG_ON(!h->resv_huge_pages) in alloc_hugetlb_folio_reserve()) if there are no active reservations at that instant. This issue was reported by syzbot: kernel BUG at mm/hugetlb.c:2403! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 UID: 0 PID: 5315 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00161-g63676eefb7a0 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:alloc_hugetlb_folio_reserve+0xbc/0xc0 mm/hugetlb.c:2403 Code: 1f eb 05 e8 56 18 a0 ff 48 c7 c7 40 56 61 8e e8 ba 21 cc 09 4c 89 f0 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 35 18 a0 ff 90 <0f> 0b 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f RSP: 0018:ffffc9000d3d77f8 EFLAGS: 00010087 RAX: ffffffff81ff6beb RBX: 0000000000000000 RCX: 0000000000100000 RDX: ffffc9000e51a000 RSI: 00000000000003ec RDI: 00000000000003ed RBP: 1ffffffff34810d9 R08: ffffffff81ff6ba3 R09: 1ffffd4000093005 R10: dffffc0000000000 R11: fffff94000093006 R12: dffffc0000000000 R13: dffffc0000000000 R14: ffffea0000498000 R15: ffffffff9a4086c8 FS: 00007f77ac12e6c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f77ab54b170 CR3: 0000000040b70000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> memfd_alloc_folio+0x1bd/0x370 mm/memfd.c:88 memfd_pin_folios+0xf10/0x1570 mm/gup.c:3750 udmabuf_pin_folios drivers/dma-buf/udmabuf.c:346 [inline] udmabuf_create+0x70e/0x10c0 drivers/dma-buf/udmabuf.c:443 udmabuf_ioctl_create drivers/dma-buf/udmabuf.c:495 [inline] udmabuf_ioctl+0x301/0x4e0 drivers/dma-buf/udmabuf.c:526 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Therefore, prevent the above crash by removing the VM_BUG_ON() as there is no need to crash the system in this situation and instead we could just fail the allocation request. Furthermore, as described above, the specific situation where this happens is when we try to pin memfd folios before they are faulted-in. Although, this is a valid thing to do, it is not the regular or the common use-case. Let us consider the following scenarios: 1) hugetlbfs_file_mmap() memfd_alloc_folio() hugetlb_fault() 2) memfd_alloc_folio() hugetlbfs_file_mmap() hugetlb_fault() 3) hugetlbfs_file_mmap() hugetlb_fault() alloc_hugetlb_folio() 3) is the most common use-case where first a memfd is allocated followed by mmap(), user writes/updates and then the relevant folios are pinned (memfd_pin_folios()). The BUG this patch is fixing occurs in 2) because we try to pin the folios before hugetlbfs_file_mmap() is called. So, in this situation we try to allocate the folios before pinning them but since we did not make any reservations, resv_huge_pages would be 0, leading to this issue. Link: https://lkml.kernel.org/r/20250626191116.1377761-1-vivek.kasireddy@intel.com Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak") Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Closes: https://syzkaller.appspot.com/bug?extid=a504cb5bae4fe117ba94 Closes: https://lore.kernel.org/all/677928b5.050a0220.3b53b0.004d.GAE@google.com/T/ Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm/vmalloc: leave lazy MMU mode on PTE mapping errorAlexander Gordeev
vmap_pages_pte_range() enters the lazy MMU mode, but fails to leave it in case an error is encountered. Link: https://lkml.kernel.org/r/20250623075721.2817094-1-agordeev@linux.ibm.com Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified") Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202506132017.T1l1l6ME-lkp@intel.com/ Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09vmscan: don't bother with debugfs_real_fops()Al Viro
... not when it's used only to check which file is used; debugfs_create_file_aux_num() allows to stash a number into debugfs entry and debugfs_get_aux_num() extracts it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Braino-spotted-by: Matthew Wilcox <willy@infradead.org> Link: https://lore.kernel.org/r/20250702211739.GE3406663@ZenIV Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-07secretmem: use SB_I_NOEXECChristian Brauner
Anonymous inodes may never ever be exectuable and the only way to enforce this is to raise SB_I_NOEXEC on the superblock which can never be unset. I've made the exec code yell at anyone who does not abide by this rule. For good measure also kill any pretense that device nodes are supported on the secretmem filesystem. > WARNING: fs/exec.c:119 at path_noexec+0x1af/0x200 fs/exec.c:118, CPU#1: syz-executor260/5835 > Modules linked in: > CPU: 1 UID: 0 PID: 5835 Comm: syz-executor260 Not tainted 6.16.0-rc4-next-20250703-syzkaller #0 PREEMPT(full) > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 > RIP: 0010:path_noexec+0x1af/0x200 fs/exec.c:118 > Code: 02 31 ff 48 89 de e8 f0 b1 89 ff d1 eb eb 07 e8 07 ad 89 ff b3 01 89 d8 5b 41 5e 41 5f 5d c3 cc cc cc cc cc e8 f2 ac 89 ff 90 <0f> 0b 90 e9 48 ff ff ff 44 89 f1 80 e1 07 80 c1 03 38 c1 0f 8c a6 > RSP: 0018:ffffc90003eefbd8 EFLAGS: 00010293 > RAX: ffffffff8235f22e RBX: ffff888072be0940 RCX: ffff88807763bc00 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 > RBP: 0000000000080000 R08: ffff88807763bc00 R09: 0000000000000003 > R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000011 > R13: 1ffff920007ddf90 R14: 0000000000000000 R15: dffffc0000000000 > FS: 000055556832d380(0000) GS:ffff888125d1e000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f21e34810d0 CR3: 00000000718a8000 CR4: 00000000003526f0 > Call Trace: > <TASK> > do_mmap+0xa43/0x10d0 mm/mmap.c:472 > vm_mmap_pgoff+0x31b/0x4c0 mm/util.c:579 > ksys_mmap_pgoff+0x51f/0x760 mm/mmap.c:607 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7f21e340a9f9 > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:00007ffd23ca3468 EFLAGS: 00000246 ORIG_RAX: 0000000000000009 > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f21e340a9f9 > RDX: 0000000000000000 RSI: 0000000000004000 RDI: 0000200000ff9000 > RBP: 00007f21e347d5f0 R08: 0000000000000003 R09: 0000000000000000 > R10: 0000000000000011 R11: 0000000000000246 R12: 0000000000000001 > R13: 431bde82d7b634db R14: 0000000000000001 R15: 0000000000000001 Link: https://lore.kernel.org/686ba948.a00a0220.c7b3.0080.GAE@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04Merge tag 'vfs-6.16-rc5.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix a regression caused by the anonymous inode rework. Making them regular files causes various places in the kernel to tip over starting with io_uring. Revert to the former status quo and port our assertion to be based on checking the inode so we don't lose the valuable VFS_*_ON_*() assertions that have already helped discover weird behavior our outright bugs. - Fix the the upper bound calculation in fuse_fill_write_pages() - Fix priority inversion issues in the eventpoll code - Make secretmen use anon_inode_make_secure_inode() to avoid bypassing the LSM layer - Fix a netfs hang due to missing case in final DIO read result collection - Fix a double put of the netfs_io_request struct - Provide some helpers to abstract out NETFS_RREQ_IN_PROGRESS flag wrangling - Fix infinite looping in netfs_wait_for_pause/request() - Fix a netfs ref leak on an extra subrequest inserted into a request's list of subreqs - Fix various cifs RPC callbacks to set NETFS_SREQ_NEED_RETRY if a subrequest fails retriably - Fix a cifs warning in the workqueue code when reconnecting a channel - Fix the updating of i_size in netfs to avoid a race between testing if we should have extended the file with a DIO write and changing i_size - Merge the places in netfs that update i_size on write - Fix coredump socket selftests * tag 'vfs-6.16-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: anon_inode: rework assertions netfs: Update tracepoints in a number of ways netfs: Renumber the NETFS_RREQ_* flags to make traces easier to read netfs: Merge i_size update functions netfs: Fix i_size updating smb: client: set missing retry flag in cifs_writev_callback() smb: client: set missing retry flag in cifs_readv_callback() smb: client: set missing retry flag in smb2_writev_callback() netfs: Fix ref leak on inserted extra subreq in write retry netfs: Fix looping in wait functions netfs: Provide helpers to perform NETFS_RREQ_IN_PROGRESS flag wangling netfs: Fix double put of request netfs: Fix hang due to missing case in final DIO read result collection eventpoll: Fix priority inversion problem fuse: fix fuse_fill_write_pages() upper bound calculation fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypass selftests/coredump: Fix "socket_detect_userspace_client" test failure
2025-07-04tree-wide: s/struct fileattr/struct file_kattr/gChristian Brauner
Now that we expose struct file_attr as our uapi struct rename all the internal struct to struct file_kattr to clearly communicate that it is a kernel internal struct. This is similar to struct mount_{k}attr and others. Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01docs: dma-api: replace consistent with coherentPetr Tesarik
For consistency, always use the term "coherent" when talking about memory that is not subject to CPU caching effects. The term "consistent" is a relic of a long-removed PCI DMA API (pci_alloc_consistent() and pci_free_consistent() functions). Signed-off-by: Petr Tesarik <ptesarik@suse.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20250627101015.1600042-3-ptesarik@suse.com
2025-06-25mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on writeSeongJae Park
memcg_path_store() assigns a newly allocated memory buffer to filter->memcg_path, without deallocating the previously allocated and assigned memory buffer. As a result, users can leak kernel memory by continuously writing a data to memcg_path DAMOS sysfs file. Fix the leak by deallocating the previously set memory buffer. Link: https://lkml.kernel.org/r/20250619183608.6647-2-sj@kernel.org Fixes: 7ee161f18b5d ("mm/damon/sysfs-schemes: implement filter directory") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> [6.3.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-25mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the ↵Hao Ge
percpu variable tag->counters When loading a module, as long as the module has memory allocation operations, kmemleak produces a false positive report that resembles the following: unreferenced object (percpu) 0x7dfd232a1650 (size 16): comm "modprobe", pid 1301, jiffies 4294940249 hex dump (first 16 bytes on cpu 2): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 0): kmemleak_alloc_percpu+0xb4/0xd0 pcpu_alloc_noprof+0x700/0x1098 load_module+0xd4/0x348 codetag_module_init+0x20c/0x450 codetag_load_module+0x70/0xb8 load_module+0xef8/0x1608 init_module_from_file+0xec/0x158 idempotent_init_module+0x354/0x608 __arm64_sys_finit_module+0xbc/0x150 invoke_syscall+0xd4/0x258 el0_svc_common.constprop.0+0xb4/0x240 do_el0_svc+0x48/0x68 el0_svc+0x40/0xf8 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x1ac/0x1b0 This is because the module can only indirectly reference alloc_tag_counters through the alloc_tag section, which misleads kmemleak. However, we don't have a kmemleak ignore interface for percpu allocations yet. So let's create one and invoke it for tag->counters. [gehao@kylinos.cn: fix build error when CONFIG_DEBUG_KMEMLEAK=n, s/igonore/ignore/] Link: https://lkml.kernel.org/r/20250620093102.2416767-1-hao.ge@linux.dev Link: https://lkml.kernel.org/r/20250619183154.2122608-1-hao.ge@linux.dev Fixes: 12ca42c23775 ("alloc_tag: allocate percpu counters for module tags dynamically") Signed-off-by: Hao Ge <gehao@kylinos.cn> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Suren Baghdasaryan <surenb@google.com> [lib/alloc_tag.c] Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-25mm/hugetlb: remove unnecessary holding of hugetlb_lockGe Yang
In isolate_or_dissolve_huge_folio(), after acquiring the hugetlb_lock, it is only for the purpose of obtaining the correct hstate, which is then passed to alloc_and_dissolve_hugetlb_folio(). alloc_and_dissolve_hugetlb_folio() itself also acquires the hugetlb_lock. We can have alloc_and_dissolve_hugetlb_folio() obtain the hstate by itself, so that isolate_or_dissolve_huge_folio() no longer needs to acquire the hugetlb_lock. In addition, we keep the folio_test_hugetlb() check within isolate_or_dissolve_huge_folio(). By doing so, we can avoid disrupting the normal path by vainly holding the hugetlb_lock. replace_free_hugepage_folios() has the same issue, and we should address it as well. Addresses a possible performance problem which was added by the hotfix 113ed54ad276 ("mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb folios"). Link: https://lkml.kernel.org/r/1748317010-16272-1-git-send-email-yangge1116@126.com Fixes: 113ed54ad276 ("mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb folios") Signed-off-by: Ge Yang <yangge1116@126.com> Suggested-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>