summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-07-04memcg: mm_update_next_owner: kill the "retry" logicOleg Nesterov
Add the new helper, try_to_set_owner(), which tries to update mm->owner once we see c->mm == mm. This way mm_update_next_owner() doesn't need to restart the list_for_each_entry/for_each_process loops from the very beginning if it races with exit/exec, it can just continue. Unlike the current code, try_to_set_owner() re-checks tsk->mm == mm before it drops tasklist_lock, so it doesn't need get/put_task_struct(). Link: https://lkml.kernel.org/r/20240626152924.GA17933@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jinliang Zheng <alexjlzheng@tencent.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: put struct task_struct::in_user_fault under CONFIG_MEMCG_V1Roman Gushchin
The struct task_struct's in_user_fault member is not used by the cgroup v2's memory controller, so it can be put under the CONFIG_MEMCG_V1 config option. To do so, mem_cgroup_enter_user_fault() and mem_cgroup_exit_user_fault() are moved under the CONFIG_MEMCG_V1 option as well. Link: https://lkml.kernel.org/r/20240628210317.272856-10-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: put struct task_struct::memcg_in_oom under CONFIG_MEMCG_V1Roman Gushchin
The memcg_in_oom field of the struct task_struct is not used by the cgroup v2's memory controller, so it can be happily compiled out if CONFIG_MEMCG_V1 is not set. Link: https://lkml.kernel.org/r/20240628210317.272856-9-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_nodeRoman Gushchin
Put memcg1-specific members of struct mem_cgroup_per_node under the CONFIG_MEMCG_V1 config option. Link: https://lkml.kernel.org/r/20240628210317.272856-8-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: put memcg1-specific struct mem_cgroup's members under CONFIG_MEMCG_V1Roman Gushchin
Put memcg1-specific members of struct mem_cgroup under the CONFIG_MEMCG_V1 config option. Also group them close to the end of struct mem_cgroup just before the dynamic per-node part. Link: https://lkml.kernel.org/r/20240628210317.272856-7-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: guard memcg1-specific fields accesses in mm/memcontrol.cRoman Gushchin
There are only few memcg1-specific struct mem_cgroup's members accesses left in mm/memcontrol.c. Let's guard them with the CONFIG_MEMCG_V1 config option. Link: https://lkml.kernel.org/r/20240628210317.272856-6-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: gather memcg1-specific fields initialization in memcg1_memcg_init()Roman Gushchin
Gather all memcg1-specific struct mem_cgroup's members initialization in a new memcg1_memcg_init() function, defined in mm/memcontrol-v1.c. Obviously, if CONFIG_MEMCG_V1 is not set, there is no need to initialize these fields, so the function becomes trivial. Link: https://lkml.kernel.org/r/20240628210317.272856-5-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: guard cgroup v1-specific code in mem_cgroup_print_oom_meminfo()Roman Gushchin
Put cgroup v1-specific code in mem_cgroup_print_oom_meminfo() under CONFIG_MEMCG_V1. Link: https://lkml.kernel.org/r/20240628210317.272856-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: factor out legacy socket memory accounting codeRoman Gushchin
Move out the legacy cgroup v1 socket memory accounting code into mm/memcontrol-v1.c. This commit introduces three new functions: memcg1_tcpmem_active(), memcg1_charge_skmem() and memcg1_uncharge_skmem(), which contain all cgroup v1-specific code and become trivial if CONFIG_MEMCG_V1 isn't set. Note, that !!memcg->tcpmem_pressure check in mem_cgroup_under_socket_pressure() can't be easily moved into memcontrol-v1.h without including memcontrol-v1.h from memcontrol.h which isn't a good idea, so it's better to just #ifdef it. Link: https://lkml.kernel.org/r/20240628210317.272856-3-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: move memcg_account_kmem() to memcontrol-v1.cRoman Gushchin
Patch series "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1". This patchset puts all cgroup v1's members of struct mem_cgroup, struct mem_cgroup_per_node and struct task_struct under the CONFIG_MEMCG_V1 config option. If cgroup v1 support is not required (and it's true for many cgroup users these days), it allows to save a bit of memory and compile out some code, some of which is on relatively hot paths. It also structures the code a bit better by grouping cgroup v1-specific stuff in one place. This patch (of 9): memcg_account_kmem() consists of a trivial statistics change via mod_memcg_state() call and a relatively large memcg1-specific part. Let's factor out the mod_memcg_state() call and move the rest into the mm/memcontrol-v1.c file. Also rename memcg_account_kmem() into memcg1_account_kmem() for consistency. Link: https://lkml.kernel.org/r/20240628210317.272856-1-roman.gushchin@linux.dev Link: https://lkml.kernel.org/r/20240628210317.272856-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: add swappiness= arg to memory.reclaimDan Schatzberg
Allow proactive reclaimers to submit an additional swappiness=<val> argument to memory.reclaim. This overrides the global or per-memcg swappiness setting for that reclaim attempt. For example: echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim will perform reclaim on the rootcg with a swappiness setting of 0 (no swap) regardless of the vm.swappiness sysctl setting. Userspace proactive reclaimers use the memory.reclaim interface to trigger reclaim. The memory.reclaim interface does not allow for any way to effect the balance of file vs anon during proactive reclaim. The only approach is to adjust the vm.swappiness setting. However, there are a few reasons we look to control the balance of file vs anon during proactive reclaim, separately from reactive reclaim: * Swapout should be limited to manage SSD write endurance. In near-OOM situations we are fine with lots of swap-out to avoid OOMs. As these are typically rare events, they have relatively little impact on write endurance. However, proactive reclaim runs continuously and so its impact on SSD write endurance is more significant. Therefore it is desireable to control swap-out for proactive reclaim separately from reactive reclaim * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap exhaustion. This makes sense if the swap exhaustion is triggered due to reactive reclaim but less so if it is triggered due to proactive reclaim (e.g. one could see OOMs when free memory is ample but anon is just particularly cold). Therefore, it's desireable to have proactive reclaim reduce or stop swap-out before the threshold at which OOM killing occurs. In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before writes to memory.reclaim[2]. This has been in production for nearly two years and has addressed our needs to control proactive vs reactive reclaim behavior but is still not ideal for a number of reasons: * vm.swappiness is a global setting, adjusting it can race/interfere with other system administration that wishes to control vm.swappiness. In our case, we need to disable Senpai before adjusting vm.swappiness. * vm.swappiness is stateful - so a crash or restart of Senpai can leave a misconfigured setting. This requires some additional management to record the "desired" setting and ensure Senpai always adjusts to it. With this patch, we avoid these downsides of adjusting vm.swappiness globally. [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 Link: https://lkml.kernel.org/r/20240103164841.2800183-3-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Suggested-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Yue Zhao <findns94@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: add defines for min/max swappinessDan Schatzberg
Patch series "Add swappiness argument to memory.reclaim", v6. This patch proposes augmenting the memory.reclaim interface with a swappiness=<val> argument that overrides the swappiness value for that instance of proactive reclaim. Userspace proactive reclaimers use the memory.reclaim interface to trigger reclaim. The memory.reclaim interface does not allow for any way to effect the balance of file vs anon during proactive reclaim. The only approach is to adjust the vm.swappiness setting. However, there are a few reasons we look to control the balance of file vs anon during proactive reclaim, separately from reactive reclaim: * Swapout should be limited to manage SSD write endurance. In near-OOM situations we are fine with lots of swap-out to avoid OOMs. As these are typically rare events, they have relatively little impact on write endurance. However, proactive reclaim runs continuously and so its impact on SSD write endurance is more significant. Therefore it is desireable to control swap-out for proactive reclaim separately from reactive reclaim * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap exhaustion. This makes sense if the swap exhaustion is triggered due to reactive reclaim but less so if it is triggered due to proactive reclaim (e.g. one could see OOMs when free memory is ample but anon is just particularly cold). Therefore, it's desireable to have proactive reclaim reduce or stop swap-out before the threshold at which OOM killing occurs. In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before writes to memory.reclaim[2]. This has been in production for nearly two years and has addressed our needs to control proactive vs reactive reclaim behavior but is still not ideal for a number of reasons: * vm.swappiness is a global setting, adjusting it can race/interfere with other system administration that wishes to control vm.swappiness. In our case, we need to disable Senpai before adjusting vm.swappiness. * vm.swappiness is stateful - so a crash or restart of Senpai can leave a misconfigured setting. This requires some additional management to record the "desired" setting and ensure Senpai always adjusts to it. With this patch, we avoid these downsides of adjusting vm.swappiness globally. Previously, this exact interface addition was proposed by Yosry[3]. In response, Roman proposed instead an interface to specify precise file/anon/slab reclaim amounts[4]. More recently Huan also proposed this as well[5] and others similarly questioned if this was the proper interface. Previous proposals sought to use this to allow proactive reclaimers to effectively perform a custom reclaim algorithm by issuing proactive reclaim with different settings to control file vs anon reclaim (e.g. to only reclaim anon from some applications). Responses argued that adjusting swappiness is a poor interface for custom reclaim. In contrast, I argue in favor of a swappiness setting not as a way to implement custom reclaim algorithms but rather to bias the balance of anon vs file due to differences of proactive vs reactive reclaim. In this context, swappiness is the existing interface for controlling this balance and this patch simply allows for it to be configured differently for proactive vs reactive reclaim. Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate for this prupose. Proactive reclaimers are un-aware of the relative age of file vs anon for a cgroup which makes it difficult to manage proactive reclaim of different memory pools. A proactive reclaimer would need some amount of anon reclaim attempts separate from the amount of file reclaim attempts which seems brittle given that it's difficult to observe the impact. [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/ [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/ [5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/ This patch (of 2): We use the constants 0 and 200 in a few places in the mm code when referring to the min and max swappiness. This patch adds MIN_SWAPPINESS and MAX_SWAPPINESS #defines to improve clarity. There are no functional changes. Link: https://lkml.kernel.org/r/20240103164841.2800183-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20240103164841.2800183-2-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yue Zhao <findns94@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04MAINTAINERS: add mm/memcontrol-v1.c/h to the list of maintained filesRoman Gushchin
Link: https://lkml.kernel.org/r/20240625005906.106920-15-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: put cgroup v1-specific code under a config optionRoman Gushchin
Put legacy cgroup v1 memory controller code under a new CONFIG_MEMCG_V1 config option. The option is turned off by default. Nobody except those who are still using cgroup v1 should turn it on. If the option is not set, memory controller can still be mounted under cgroup v1, but none of memcg-specific control files are present. Please note, that not all cgroup v1's memory controller code is guarded yet (but most of it), it's a subject for some follow-up work. Thanks to Michal Hocko for providing a better Kconfig option description. [roman.gushchin@linux.dev: better config option description provided by Michal] Link: https://lkml.kernel.org/r/ZnxXNtvqllc9CDoo@google.com Link: https://lkml.kernel.org/r/20240625005906.106920-14-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: group cgroup v1 memcg related declarationsRoman Gushchin
Group all cgroup v1-related declarations at the end of memcontrol.h and mm/memcontrol-v1.h with an intention to put them all together under a config option later on. It should make things easier to follow and maintain too. Link: https://lkml.kernel.org/r/20240625005906.106920-13-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: make memcg1_update_tree() staticRoman Gushchin
memcg1_update_tree() is not used outside of mm/memcontrol-v1.c anymore, define it as static and remove the declaration from the header file. Link: https://lkml.kernel.org/r/20240625005906.106920-12-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: move cgroup v1 interface files to memcontrol-v1.cRoman Gushchin
Move legacy cgroup v1 memory controller interfaces and corresponding code into memcontrol-v1.c. [roman.gushchin@linux.dev: move two functions] Link: https://lkml.kernel.org/r/20240704002712.2077812-1-roman.gushchin@linux.dev Link: https://lkml.kernel.org/r/20240625005906.106920-11-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: rename memcg_oom_recover()Roman Gushchin
Rename memcg_oom_recover() into memcg1_oom_recover() for consistency with other memory cgroup v1-related functions. Move the declaration in mm/memcontrol-v1.h to be nearby other memcg v1 oom handling functions. Link: https://lkml.kernel.org/r/20240625005906.106920-10-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.cRoman Gushchin
Cgroup v1 supports a complicated OOM handling in userspace mechanism, which is not supported by cgroup v2. Let's move the corresponding code into memcontrol-v1.c. Aside from mechanical code movement this patch introduces two new functions: memcg1_oom_prepare() and memcg1_oom_finish(). Those are implementing cgroup v1-specific parts of the common memcg OOM handling path. Link: https://lkml.kernel.org/r/20240625005906.106920-9-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: rename memcg_check_events()Roman Gushchin
Rename memcg_check_events() into memcg1_check_events() for consistency with other cgroup v1-specific functions. Link: https://lkml.kernel.org/r/20240625005906.106920-8-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: move legacy memcg event code into memcontrol-v1.cRoman Gushchin
Cgroup v1's memory controller contains a pretty complicated event notifications mechanism which is not used on cgroup v2. Let's move the corresponding code into memcontrol-v1.c. Please, note, that mem_cgroup_event_ratelimit() remains in memcontrol.c, otherwise it would require exporting too many details on memcg stats outside of memcontrol.c. Link: https://lkml.kernel.org/r/20240625005906.106920-7-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: rename charge move-related functionsRoman Gushchin
Rename exported function related to the charge move to have the memcg1_ prefix. Link: https://lkml.kernel.org/r/20240625005906.106920-6-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: move charge migration code to memcontrol-v1.cRoman Gushchin
Unlike the legacy cgroup v1 memory controller, cgroup v2 memory controller doesn't support moving charged pages between cgroups. It's a fairly large and complicated code which created a number of problems in the past. Let's move this code into memcontrol-v1.c. It shaves off 1k lines from memcontrol.c. It's also another step towards making the legacy memory controller code optionally compiled. Link: https://lkml.kernel.org/r/20240625005906.106920-5-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: rename soft limit reclaim-related functionsRoman Gushchin
Rename exported function related to the softlimit reclaim to have memcg1_ prefix. Link: https://lkml.kernel.org/r/20240625005906.106920-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: move soft limit reclaim code to memcontrol-v1.cRoman Gushchin
Soft limits are cgroup v1-specific and are not supported by cgroup v2, so let's move the corresponding code into memcontrol-v1.c. Aside from simple moving the code, this commits introduces a trivial memcg1_soft_limit_reset() function to reset soft limits and also moves the global soft limit tree initialization code into a new memcg1_init() function. It also moves corresponding declarations shared between memcontrol.c and memcontrol-v1.c into mm/memcontrol-v1.h. Link: https://lkml.kernel.org/r/20240625005906.106920-3-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: memcg: introduce memcontrol-v1.cRoman Gushchin
Patch series "mm: memcg: separate legacy cgroup v1 code and put under config option", v2. Cgroups v2 have been around for a while and many users have fully adopted them, so they never use cgroups v1 features and functionality. Yet they have to "pay" for the cgroup v1 support anyway: 1) the kernel binary contains an unused cgroup v1 code, 2) some code paths have additional checks which are not needed, 3) some common structures like task_struct and mem_cgroup contain unused cgroup v1-specific members. Cgroup v1's memory controller has a number of features that are not supported by cgroup v2 and their implementation is pretty much self contained. Most notably, these features are: soft limit reclaim, oom handling in userspace, complicated event notification system, charge migration. Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's intervened with generic and cgroup v2-specific code. It's a burden on developers and maintainers. This patchset aims to solve these problems by: 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the mm/memcontrol-v1.h header, 3) introducing the CONFIG_MEMCG_V1 config option, turned off by default, 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set. If CONFIG_MEMCG_V1 is not set, cgroup v1 memory controller is still available for mounting, however no memory-specific control knobs are present. This patch (of 14): This patch introduces the mm/memcontrol-v1.c source file which will be used for all legacy (cgroup v1) memory cgroup code. It also introduces mm/memcontrol-v1.h to keep declarations shared between mm/memcontrol.c and mm/memcontrol-v1.c. As of now, let's compile it if CONFIG_MEMCG is set, similar to mm/memcontrol.c. Later on it can be switched to use a separate config option, so that the legacy code won't be compiled if not required. Link: https://lkml.kernel.org/r/20240625005906.106920-1-roman.gushchin@linux.dev Link: https://lkml.kernel.org/r/20240625005906.106920-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm/ksm: optimize the chain()/chain_prune() interfacesChengming Zhou
Now the implementation of stable_node_dup() causes chain()/chain_prune() interfaces and usages are overcomplicated. Why? stable_node_dup() only find and return a candidate stable_node for sharing, so the users have to recheck using stable_node_dup_any() if any non-candidate stable_node exist. And try to ksm_get_folio() from it again. Actually, stable_node_dup() can just return a best stable_node as it can, then the users can check if it's a candidate for sharing or not. The code is simplified too and fewer corner cases: such as stable_node and stable_node_dup can't be NULL if returned tree_folio is not NULL. Link: https://lkml.kernel.org/r/20240621-b4-ksm-scan-optimize-v2-3-1c328aa9e30b@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm/ksm: don't waste time searching stable tree for fast changing pageChengming Zhou
The code flow in cmp_and_merge_page() is suboptimal for handling the ksm page and non-ksm page at the same time. For example: - ksm page 1. Mostly just return if this ksm page is not migrated and this rmap_item has been on the rmap hlist. Or we have to fix this rmap_item mapping. 2. But we absolutely don't need to checksum for this ksm page, since it can't change. - non-ksm page 1. First don't need to waste time searching stable tree if fast changing. 2. Should try to merge with zero page before search the stable tree. 3. Then search stable tree to find mergeable ksm page. This patch optimizes the code flow so the handling differences between ksm page and non-ksm page become clearer and more efficient too. Link: https://lkml.kernel.org/r/20240621-b4-ksm-scan-optimize-v2-2-1c328aa9e30b@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm/ksm: refactor out try_to_merge_with_zero_page()Chengming Zhou
Patch series "mm/ksm: cmp_and_merge_page() optimizations and cleanup", v2. This series mainly optimizes cmp_and_merge_page() to have more efficient separate code flow for ksm page and non-ksm anon page. - ksm page: don't need to calculate the checksum obviously. - anon page: don't need to search stable tree if changing fast and try to merge with zero page before searching ksm page on stable tree. Please see the patch-2 for details. Patch-3 is cleanup also a little optimization for the chain()/chain_prune interfaces, which made the stable_tree_search()/stable_tree_insert() over complex. I have done simple testing using "hackbench -g 1 -l 300000" (maybe I need to use a better workload) on my machine, have seen a little CPU usage decrease of ksmd and some improvements of cmp_and_merge_page() latency: We can see the latency of cmp_and_merge_page() when handling non-ksm anon pages has been improved. This patch (of 3): In preparation for later changes, refactor out a new function called try_to_merge_with_zero_page(), which tries to merge with zero page. Link: https://lkml.kernel.org/r/20240621-b4-ksm-scan-optimize-v2-0-1c328aa9e30b@linux.dev Link: https://lkml.kernel.org/r/20240621-b4-ksm-scan-optimize-v2-1-1c328aa9e30b@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04hugetlb: force allocating surplus hugepages on mempolicy allowed nodesAristeu Rozanski
When trying to allocate a hugepage with no reserved ones free, it may be allowed in case a number of overcommit hugepages was configured (using /proc/sys/vm/nr_overcommit_hugepages) and that number wasn't reached. This allows for a behavior of having extra hugepages allocated dynamically, if there're resources for it. Some sysadmins even prefer not reserving any hugepages and setting a big number of overcommit hugepages. But while attempting to allocate overcommit hugepages in a multi node system (either NUMA or mempolicy/cpuset) said allocations might randomly fail even when there're resources available for the allocation. This happens due to allowed_mems_nr() only accounting for the number of free hugepages in the nodes the current process belongs to and the surplus hugepage allocation is done so it can be allocated in any node. In case one or more of the requested surplus hugepages are allocated in a different node, the whole allocation will fail due allowed_mems_nr() returning a lower value. So allocate surplus hugepages in one of the nodes the current process belongs to. Easy way to reproduce this issue is to use a 2+ NUMA nodes system: # echo 0 >/proc/sys/vm/nr_hugepages # echo 1 >/proc/sys/vm/nr_overcommit_hugepages # numactl -m0 ./tools/testing/selftests/mm/map_hugetlb 2 Repeating the execution of map_hugetlb test application will eventually fail when the hugepage ends up allocated in a different node. [aris@ruivo.org: v2] Link: https://lkml.kernel.org/r/20240701212343.GG844599@cathedrallabs.org Link: https://lkml.kernel.org/r/20240621190050.mhxwb65zn37doegp@redhat.com Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: David Hildenbrand <david@redhat.com> Cc: Vishal Moola <vishal.moola@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm/damon/paddr: initialize nr_succeeded in __damon_pa_migrate_folio_list()SeongJae Park
The variable is supposed to be set via later migrate_pages() call. However, the function does not do that when CONFIG_MIGRATION is unset. Initialize the variable to zero. Link: https://lkml.kernel.org/r/20240701165332.47495-1-sj@kernel.org Fixes: 5311c0a2eee3 ("mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202406251102.GE07hqfQ-lkp@intel.com/ Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Hyeongtak Ji <hyeongtak.ji@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-04mm: refactor folio_undo_large_rmappable()Kefeng Wang
Folios of order <= 1 are not in deferred list, the check of order is added into folio_undo_large_rmappable() from commit 8897277acfef ("mm: support order-1 folios in the page cache"), but there is a repeated check for small folio (order 0) during each call of the folio_undo_large_rmappable(), so only keep folio_order() check inside the function. In addition, move all the checks into header file to save a function call for non-large-rmappable or empty deferred_list folio. Link: https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03nilfs2: fix kernel bug on rename operation of broken directoryRyusuke Konishi
Syzbot reported that in rename directory operation on broken directory on nilfs2, __block_write_begin_int() called to prepare block write may fail BUG_ON check for access exceeding the folio/page size. This is because nilfs_dotdot(), which gets parent directory reference entry ("..") of the directory to be moved or renamed, does not check consistency enough, and may return location exceeding folio/page size for broken directories. Fix this issue by checking required directory entries ("." and "..") in the first chunk of the directory in nilfs_dotdot(). Link: https://lkml.kernel.org/r/20240628165107.9006-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+d3abed1ad3d367fa2627@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d3abed1ad3d367fa2627 Fixes: 2ba466d74ed7 ("nilfs2: directory entry operations") Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/hugetlb_vmemmap: fix race with speculative PFN walkersYu Zhao
While investigating HVO for THPs [1], it turns out that speculative PFN walkers like compaction can race with vmemmap modifications, e.g., CPU 1 (vmemmap modifier) CPU 2 (speculative PFN walker) ------------------------------- ------------------------------ Allocates an LRU folio page1 Sees page1 Frees page1 Allocates a hugeTLB folio page2 (page1 being a tail of page2) Updates vmemmap mapping page1 get_page_unless_zero(page1) Even though page1->_refcount is zero after HVO, get_page_unless_zero() can still try to modify this read-only field, resulting in a crash. An independent report [2] confirmed this race. There are two discussed approaches to fix this race: 1. Make RO vmemmap RW so that get_page_unless_zero() can fail without triggering a PF. 2. Use RCU to make sure get_page_unless_zero() either sees zero page->_refcount through the old vmemmap or non-zero page->_refcount through the new one. The second approach is preferred here because: 1. It can prevent illegal modifications to struct page[] that has been HVO'ed; 2. It can be generalized, in a way similar to ZERO_PAGE(), to fix similar races in other places, e.g., arch_remove_memory() on x86 [3], which frees vmemmap mapping offlined struct page[]. While adding synchronize_rcu(), the goal is to be surgical, rather than optimized. Specifically, calls to synchronize_rcu() on the error handling paths can be coalesced, but it is not done for the sake of Simplicity: noticeably, this fix removes ~50% more lines than it adds. According to the hugetlb_optimize_vmemmap section in Documentation/admin-guide/sysctl/vm.rst, enabling HVO makes allocating or freeing hugeTLB pages "~2x slower than before". Having synchronize_rcu() on top makes those operations even worse, and this also affects the user interface /proc/sys/vm/nr_overcommit_hugepages. This is *very* hard to trigger: 1. Most hugeTLB use cases I know of are static, i.e., reserved at boot time, because allocating at runtime is not reliable at all. 2. On top of that, someone has to be very unlucky to get tripped over above, because the race window is so small -- I wasn't able to trigger it with a stress testing that does nothing but that (with THPs though). [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev/ [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@redhat.com/ Link: https://lkml.kernel.org/r/20240627222705.2974207-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03cachestat: do not flush stats in recency checkNhat Pham
syzbot detects that cachestat() is flushing stats, which can sleep, in its RCU read section (see [1]). This is done in the workingset_test_recent() step (which checks if the folio's eviction is recent). Move the stat flushing step to before the RCU read section of cachestat, and skip stat flushing during the recency check. [1]: https://lore.kernel.org/cgroups/000000000000f71227061bdf97e0@google.com/ Link: https://lkml.kernel.org/r/20240627201737.3506959-1-nphamcs@gmail.com Fixes: b00684722262 ("mm: workingset: move the stats flush into workingset_test_recent()") Signed-off-by: Nhat Pham <nphamcs@gmail.com> Reported-by: syzbot+b7f13b2d0cc156edf61a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/cgroups/000000000000f71227061bdf97e0@google.com/ Debugged-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: <stable@vger.kernel.org> [6.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/shmem: disable PMD-sized page cache if neededGavin Shan
For shmem files, it's possible that PMD-sized page cache can't be supported by xarray. For example, 512MB page cache on ARM64 when the base page size is 64KB can't be supported by xarray. It leads to errors as the following messages indicate when this sort of xarray entry is split. WARNING: CPU: 34 PID: 7578 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 \ nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject \ nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse xfs \ libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_net \ net_failover virtio_console virtio_blk failover dimlib virtio_mmio CPU: 34 PID: 7578 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x720 sp : ffff8000882af5f0 x29: ffff8000882af5f0 x28: ffff8000882af650 x27: ffff8000882af768 x26: 0000000000000cc0 x25: 000000000000000d x24: ffff00010625b858 x23: ffff8000882af650 x22: ffffffdfc0900000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0900000 x18: 0000000000000000 x17: 0000000000000000 x16: 0000018000000000 x15: 52f8004000000000 x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020 x11: 52f8000000000000 x10: 52f8e1c0ffff6000 x9 : ffffbeb9619a681c x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff00010b02ddb0 x5 : ffffbeb96395e378 x4 : 0000000000000000 x3 : 0000000000000cc0 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x720 truncate_inode_partial_folio+0xdc/0x160 shmem_undo_range+0x2bc/0x6a8 shmem_fallocate+0x134/0x430 vfs_fallocate+0x124/0x2e8 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 Fix it by disabling PMD-sized page cache when HPAGE_PMD_ORDER is larger than MAX_PAGECACHE_ORDER. As Matthew Wilcox pointed, the page cache in a shmem file isn't represented by a multi-index entry and doesn't have this limitation when the xarry entry is split until commit 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache"). Link: https://lkml.kernel.org/r/20240627003953.1262512-5-gshan@redhat.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Zhenyu Zhang <zhenyzha@redhat.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/filemap: skip to create PMD-sized page cache if neededGavin Shan
On ARM64, HPAGE_PMD_ORDER is 13 when the base page size is 64KB. The PMD-sized page cache can't be supported by xarray as the following error messages indicate. ------------[ cut here ]------------ WARNING: CPU: 35 PID: 7484 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \ nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \ nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm \ fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 \ sha1_ce virtio_net net_failover virtio_console virtio_blk failover \ dimlib virtio_mmio CPU: 35 PID: 7484 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x720 sp : ffff800087a4f6c0 x29: ffff800087a4f6c0 x28: ffff800087a4f720 x27: 000000001fffffff x26: 0000000000000c40 x25: 000000000000000d x24: ffff00010625b858 x23: ffff800087a4f720 x22: ffffffdfc0780000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0780000 x18: 000000001ff40000 x17: 00000000ffffffff x16: 0000018000000000 x15: 51ec004000000000 x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020 x11: 51ec000000000000 x10: 51ece1c0ffff8000 x9 : ffffbeb961a44d28 x8 : 0000000000000003 x7 : ffffffdfc0456420 x6 : ffff0000e1aa6eb8 x5 : 20bf08b4fe778fca x4 : ffffffdfc0456420 x3 : 0000000000000c40 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x720 truncate_inode_partial_folio+0xdc/0x160 truncate_inode_pages_range+0x1b4/0x4a8 truncate_pagecache_range+0x84/0xa0 xfs_flush_unmap_range+0x70/0x90 [xfs] xfs_file_fallocate+0xfc/0x4d8 [xfs] vfs_fallocate+0x124/0x2e8 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 Fix it by skipping to allocate PMD-sized page cache when its size is larger than MAX_PAGECACHE_ORDER. For this specific case, we will fall to regular path where the readahead window is determined by BDI's sysfs file (read_ahead_kb). Link: https://lkml.kernel.org/r/20240627003953.1262512-4-gshan@redhat.com Fixes: 4687fdbb805a ("mm/filemap: Support VM_HUGEPAGE for file mappings") Signed-off-by: Gavin Shan <gshan@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Zhenyu Zhang <zhenyzha@redhat.com> Cc: <stable@vger.kernel.org> [5.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/readahead: limit page cache size in page_cache_ra_order()Gavin Shan
In page_cache_ra_order(), the maximal order of the page cache to be allocated shouldn't be larger than MAX_PAGECACHE_ORDER. Otherwise, it's possible the large page cache can't be supported by xarray when the corresponding xarray entry is split. For example, HPAGE_PMD_ORDER is 13 on ARM64 when the base page size is 64KB. The PMD-sized page cache can't be supported by xarray. Link: https://lkml.kernel.org/r/20240627003953.1262512-3-gshan@redhat.com Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Zhenyu Zhang <zhenyzha@redhat.com> Cc: <stable@vger.kernel.org> [5.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarrayGavin Shan
Patch series "mm/filemap: Limit page cache size to that supported by xarray", v2. Currently, xarray can't support arbitrary page cache size. More details can be found from the WARN_ON() statement in xas_split_alloc(). In our test whose code is attached below, we hit the WARN_ON() on ARM64 system where the base page size is 64KB and huge page size is 512MB. The issue was reported long time ago and some discussions on it can be found here [1]. [1] https://www.spinics.net/lists/linux-xfs/msg75404.html In order to fix the issue, we need to adjust MAX_PAGECACHE_ORDER to one supported by xarray and avoid PMD-sized page cache if needed. The code changes are suggested by David Hildenbrand. PATCH[1] adjusts MAX_PAGECACHE_ORDER to that supported by xarray PATCH[2-3] avoids PMD-sized page cache in the synchronous readahead path PATCH[4] avoids PMD-sized page cache for shmem files if needed Test program ============ # cat test.c #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <string.h> #include <fcntl.h> #include <errno.h> #include <sys/syscall.h> #include <sys/mman.h> #define TEST_XFS_FILENAME "/tmp/data" #define TEST_SHMEM_FILENAME "/dev/shm/data" #define TEST_MEM_SIZE 0x20000000 int main(int argc, char **argv) { const char *filename; int fd = 0; void *buf = (void *)-1, *p; int pgsize = getpagesize(); int ret; if (pgsize != 0x10000) { fprintf(stderr, "64KB base page size is required\n"); return -EPERM; } system("echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled"); system("rm -fr /tmp/data"); system("rm -fr /dev/shm/data"); system("echo 1 > /proc/sys/vm/drop_caches"); /* Open xfs or shmem file */ filename = TEST_XFS_FILENAME; if (argc > 1 && !strcmp(argv[1], "shmem")) filename = TEST_SHMEM_FILENAME; fd = open(filename, O_CREAT | O_RDWR | O_TRUNC); if (fd < 0) { fprintf(stderr, "Unable to open <%s>\n", filename); return -EIO; } /* Extend file size */ ret = ftruncate(fd, TEST_MEM_SIZE); if (ret) { fprintf(stderr, "Error %d to ftruncate()\n", ret); goto cleanup; } /* Create VMA */ buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (buf == (void *)-1) { fprintf(stderr, "Unable to mmap <%s>\n", filename); goto cleanup; } fprintf(stdout, "mapped buffer at 0x%p\n", buf); ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE); if (ret) { fprintf(stderr, "Unable to madvise(MADV_HUGEPAGE)\n"); goto cleanup; } /* Populate VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_WRITE); if (ret) { fprintf(stderr, "Error %d to madvise(MADV_POPULATE_WRITE)\n", ret); goto cleanup; } /* Punch the file to enforce xarray split */ ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, TEST_MEM_SIZE - pgsize, pgsize); if (ret) fprintf(stderr, "Error %d to fallocate()\n", ret); cleanup: if (buf != (void *)-1) munmap(buf, TEST_MEM_SIZE); if (fd > 0) close(fd); return 0; } # gcc test.c -o test # cat /proc/1/smaps | grep KernelPageSize | head -n 1 KernelPageSize: 64 kB # ./test shmem : ------------[ cut here ]------------ WARNING: CPU: 17 PID: 5253 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \ nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \ nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set nf_tables rfkill nfnetlink vfat fat virtio_balloon \ drm fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 \ virtio_net sha1_ce net_failover failover virtio_console virtio_blk \ dimlib virtio_mmio CPU: 17 PID: 5253 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #12 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x720 sp : ffff80008a92f5b0 x29: ffff80008a92f5b0 x28: ffff80008a92f610 x27: ffff80008a92f728 x26: 0000000000000cc0 x25: 000000000000000d x24: ffff0000cf00c858 x23: ffff80008a92f610 x22: ffffffdfc0600000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0600000 x18: 0000000000000000 x17: 0000000000000000 x16: 0000018000000000 x15: 3374004000000000 x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020 x11: 3374000000000000 x10: 3374e1c0ffff6000 x9 : ffffb463a84c681c x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff00011c976ce0 x5 : ffffb463aa47e378 x4 : 0000000000000000 x3 : 0000000000000cc0 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x720 truncate_inode_partial_folio+0xdc/0x160 shmem_undo_range+0x2bc/0x6a8 shmem_fallocate+0x134/0x430 vfs_fallocate+0x124/0x2e8 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 This patch (of 4): The largest page cache order can be HPAGE_PMD_ORDER (13) on ARM64 with 64KB base page size. The xarray entry with this order can't be split as the following error messages indicate. ------------[ cut here ]------------ WARNING: CPU: 35 PID: 7484 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \ nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \ nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm \ fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 \ sha1_ce virtio_net net_failover virtio_console virtio_blk failover \ dimlib virtio_mmio CPU: 35 PID: 7484 Comm: test Kdump: loaded Tainted: G W 6.10.0-rc5-gavin+ #9 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x720 sp : ffff800087a4f6c0 x29: ffff800087a4f6c0 x28: ffff800087a4f720 x27: 000000001fffffff x26: 0000000000000c40 x25: 000000000000000d x24: ffff00010625b858 x23: ffff800087a4f720 x22: ffffffdfc0780000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0780000 x18: 000000001ff40000 x17: 00000000ffffffff x16: 0000018000000000 x15: 51ec004000000000 x14: 0000e00000000000 x13: 0000000000002000 x12: 0000000000000020 x11: 51ec000000000000 x10: 51ece1c0ffff8000 x9 : ffffbeb961a44d28 x8 : 0000000000000003 x7 : ffffffdfc0456420 x6 : ffff0000e1aa6eb8 x5 : 20bf08b4fe778fca x4 : ffffffdfc0456420 x3 : 0000000000000c40 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x720 truncate_inode_partial_folio+0xdc/0x160 truncate_inode_pages_range+0x1b4/0x4a8 truncate_pagecache_range+0x84/0xa0 xfs_flush_unmap_range+0x70/0x90 [xfs] xfs_file_fallocate+0xfc/0x4d8 [xfs] vfs_fallocate+0x124/0x2e8 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 Fix it by decreasing MAX_PAGECACHE_ORDER to the largest supported order by xarray. For this specific case, MAX_PAGECACHE_ORDER is dropped from 13 to 11 when CONFIG_BASE_SMALL is disabled. Link: https://lkml.kernel.org/r/20240627003953.1262512-1-gshan@redhat.com Link: https://lkml.kernel.org/r/20240627003953.1262512-2-gshan@redhat.com Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Gavin Shan <gshan@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Don Dutile <ddutile@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Zhenyu Zhang <zhenyzha@redhat.com> Cc: <stable@vger.kernel.org> [5.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/damon/core: merge regions aggressively when max_nr_regions is unmetSeongJae Park
DAMON keeps the number of regions under max_nr_regions by skipping regions split operations when doing so can make the number higher than the limit. It works well for preventing violation of the limit. But, if somehow the violation happens, it cannot recovery well depending on the situation. In detail, if the real number of regions having different access pattern is higher than the limit, the mechanism cannot reduce the number below the limit. In such a case, the system could suffer from high monitoring overhead of DAMON. The violation can actually happen. For an example, the user could reduce max_nr_regions while DAMON is running, to be lower than the current number of regions. Fix the problem by repeating the merge operations with increasing aggressiveness in kdamond_merge_regions() for the case, until the limit is met. [sj@kernel.org: increase regions merge aggressiveness while respecting min_nr_regions] Link: https://lkml.kernel.org/r/20240626164753.46270-1-sj@kernel.org [sj@kernel.org: ensure max threshold attempt for max_nr_regions violation] Link: https://lkml.kernel.org/r/20240627163153.75969-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240624175814.89611-1-sj@kernel.org Fixes: b9a6ac4e4ede ("mm/damon: adaptively adjust regions") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03Fix userfaultfd_api to return EINVAL as expectedAudra Mitchell
Currently if we request a feature that is not set in the Kernel config we fail silently and return all the available features. However, the man page indicates we should return an EINVAL. We need to fix this issue since we can end up with a Kernel warning should a program request the feature UFFD_FEATURE_WP_UNPOPULATED on a kernel with the config not set with this feature. [ 200.812896] WARNING: CPU: 91 PID: 13634 at mm/memory.c:1660 zap_pte_range+0x43d/0x660 [ 200.820738] Modules linked in: [ 200.869387] CPU: 91 PID: 13634 Comm: userfaultfd Kdump: loaded Not tainted 6.9.0-rc5+ #8 [ 200.877477] Hardware name: Dell Inc. PowerEdge R6525/0N7YGH, BIOS 2.7.3 03/30/2022 [ 200.885052] RIP: 0010:zap_pte_range+0x43d/0x660 Link: https://lkml.kernel.org/r/20240626130513.120193-1-audra@redhat.com Fixes: e06f1e1dd499 ("userfaultfd: wp: enabled write protection in userfaultfd API") Signed-off-by: Audra Mitchell <audra@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: vmalloc: check if a hash-index is in cpu_possible_maskUladzislau Rezki (Sony)
The problem is that there are systems where cpu_possible_mask has gaps between set CPUs, for example SPARC. In this scenario addr_to_vb_xa() hash function can return an index which accesses to not-possible and not setup CPU area using per_cpu() macro. This results in an oops on SPARC. A per-cpu vmap_block_queue is also used as hash table, incorrectly assuming the cpu_possible_mask has no gaps. Fix it by adjusting an index to a next possible CPU. Link: https://lkml.kernel.org/r/20240626140330.89836-1-urezki@gmail.com Fixes: 062eacf57ad9 ("mm: vmalloc: remove a global vmap_blocks xarray") Reported-by: Nick Bowler <nbowler@draconx.ca> Closes: https://lore.kernel.org/linux-kernel/ZntjIE6msJbF8zTa@MiWiFi-R3L-srv/T/ Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hailong.Liu <hailong.liu@oppo.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: prevent derefencing NULL ptr in pfn_section_valid()Waiman Long
Commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage") changed pfn_section_valid() to add a READ_ONCE() call around "ms->usage" to fix a race with section_deactivate() where ms->usage can be cleared. The READ_ONCE() call, by itself, is not enough to prevent NULL pointer dereference. We need to check its value before dereferencing it. Link: https://lkml.kernel.org/r/20240626001639.1350646-1-longman@redhat.com Fixes: 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage") Signed-off-by: Waiman Long <longman@redhat.com> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: page_ref: remove folio_try_get_rcu()Yang Shi
The below bug was reported on a non-SMP kernel: [ 275.267158][ T4335] ------------[ cut here ]------------ [ 275.267949][ T4335] kernel BUG at include/linux/page_ref.h:275! [ 275.268526][ T4335] invalid opcode: 0000 [#1] KASAN PTI [ 275.269001][ T4335] CPU: 0 PID: 4335 Comm: trinity-c3 Not tainted 6.7.0-rc4-00061-gefa7df3e3bb5 #1 [ 275.269787][ T4335] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 275.270679][ T4335] RIP: 0010:try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.272813][ T4335] RSP: 0018:ffffc90005dcf650 EFLAGS: 00010202 [ 275.273346][ T4335] RAX: 0000000000000246 RBX: ffffea00066e0000 RCX: 0000000000000000 [ 275.274032][ T4335] RDX: fffff94000cdc007 RSI: 0000000000000004 RDI: ffffea00066e0034 [ 275.274719][ T4335] RBP: ffffea00066e0000 R08: 0000000000000000 R09: fffff94000cdc006 [ 275.275404][ T4335] R10: ffffea00066e0037 R11: 0000000000000000 R12: 0000000000000136 [ 275.276106][ T4335] R13: ffffea00066e0034 R14: dffffc0000000000 R15: ffffea00066e0008 [ 275.276790][ T4335] FS: 00007fa2f9b61740(0000) GS:ffffffff89d0d000(0000) knlGS:0000000000000000 [ 275.277570][ T4335] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 275.278143][ T4335] CR2: 00007fa2f6c00000 CR3: 0000000134b04000 CR4: 00000000000406f0 [ 275.278833][ T4335] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 275.279521][ T4335] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 275.280201][ T4335] Call Trace: [ 275.280499][ T4335] <TASK> [ 275.280751][ T4335] ? die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434 arch/x86/kernel/dumpstack.c:447) [ 275.281087][ T4335] ? do_trap (arch/x86/kernel/traps.c:112 arch/x86/kernel/traps.c:153) [ 275.281463][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.281884][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.282300][ T4335] ? do_error_trap (arch/x86/kernel/traps.c:174) [ 275.282711][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.283129][ T4335] ? handle_invalid_op (arch/x86/kernel/traps.c:212) [ 275.283561][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.283990][ T4335] ? exc_invalid_op (arch/x86/kernel/traps.c:264) [ 275.284415][ T4335] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) [ 275.284859][ T4335] ? try_get_folio (include/linux/page_ref.h:275 (discriminator 3) mm/gup.c:79 (discriminator 3)) [ 275.285278][ T4335] try_grab_folio (mm/gup.c:148) [ 275.285684][ T4335] __get_user_pages (mm/gup.c:1297 (discriminator 1)) [ 275.286111][ T4335] ? __pfx___get_user_pages (mm/gup.c:1188) [ 275.286579][ T4335] ? __pfx_validate_chain (kernel/locking/lockdep.c:3825) [ 275.287034][ T4335] ? mark_lock (kernel/locking/lockdep.c:4656 (discriminator 1)) [ 275.287416][ T4335] __gup_longterm_locked (mm/gup.c:1509 mm/gup.c:2209) [ 275.288192][ T4335] ? __pfx___gup_longterm_locked (mm/gup.c:2204) [ 275.288697][ T4335] ? __pfx_lock_acquire (kernel/locking/lockdep.c:5722) [ 275.289135][ T4335] ? __pfx___might_resched (kernel/sched/core.c:10106) [ 275.289595][ T4335] pin_user_pages_remote (mm/gup.c:3350) [ 275.290041][ T4335] ? __pfx_pin_user_pages_remote (mm/gup.c:3350) [ 275.290545][ T4335] ? find_held_lock (kernel/locking/lockdep.c:5244 (discriminator 1)) [ 275.290961][ T4335] ? mm_access (kernel/fork.c:1573) [ 275.291353][ T4335] process_vm_rw_single_vec+0x142/0x360 [ 275.291900][ T4335] ? __pfx_process_vm_rw_single_vec+0x10/0x10 [ 275.292471][ T4335] ? mm_access (kernel/fork.c:1573) [ 275.292859][ T4335] process_vm_rw_core+0x272/0x4e0 [ 275.293384][ T4335] ? hlock_class (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228) [ 275.293780][ T4335] ? __pfx_process_vm_rw_core+0x10/0x10 [ 275.294350][ T4335] process_vm_rw (mm/process_vm_access.c:284) [ 275.294748][ T4335] ? __pfx_process_vm_rw (mm/process_vm_access.c:259) [ 275.295197][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1)) [ 275.295634][ T4335] __x64_sys_process_vm_readv (mm/process_vm_access.c:291) [ 275.296139][ T4335] ? syscall_enter_from_user_mode (kernel/entry/common.c:94 kernel/entry/common.c:112) [ 275.296642][ T4335] do_syscall_64 (arch/x86/entry/common.c:51 (discriminator 1) arch/x86/entry/common.c:82 (discriminator 1)) [ 275.297032][ T4335] ? __task_pid_nr_ns (include/linux/rcupdate.h:306 (discriminator 1) include/linux/rcupdate.h:780 (discriminator 1) kernel/pid.c:504 (discriminator 1)) [ 275.297470][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359) [ 275.297988][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.298389][ T4335] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4300 kernel/locking/lockdep.c:4359) [ 275.298906][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.299304][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.299703][ T4335] ? do_syscall_64 (arch/x86/include/asm/cpufeature.h:171 arch/x86/entry/common.c:97) [ 275.300115][ T4335] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) This BUG is the VM_BUG_ON(!in_atomic() && !irqs_disabled()) assertion in folio_ref_try_add_rcu() for non-SMP kernel. The process_vm_readv() calls GUP to pin the THP. An optimization for pinning THP instroduced by commit 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") calls try_grab_folio() to pin the THP, but try_grab_folio() is supposed to be called in atomic context for non-SMP kernel, for example, irq disabled or preemption disabled, due to the optimization introduced by commit e286781d5f2e ("mm: speculative page references"). The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") is not actually the root cause although it was bisected to. It just makes the problem exposed more likely. The follow up discussion suggested the optimization for non-SMP kernel may be out-dated and not worth it anymore [1]. So removing the optimization to silence the BUG. However calling try_grab_folio() in GUP slow path actually is unnecessary, so the following patch will clean this up. [1] https://lore.kernel.org/linux-mm/821cf1d6-92b9-4ac4-bacc-d8f2364ac14f@paulmck-laptop/ Link: https://lkml.kernel.org/r/20240625205350.1777481-1-yang@os.amperecomputing.com Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Oliver Sang <oliver.sang@intel.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03selftests/damon/damon_nr_regions: test online-tuned max_nr_regionsSeongJae Park
User could update max_nr_regions parameter while DAMON is running to a value that smaller than the current number of regions that DAMON is seeing. Such update could be done for reducing the monitoring overhead. In the case, DAMON should merge regions aggressively more than normal situation to ensure the new limit is successfully applied. Implement a kselftest to ensure that. Link: https://lkml.kernel.org/r/20240625180538.73134-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03_damon_sysfs: implement commit() for online parameters updateSeongJae Park
Users can update DAMON parameters while it is running, using 'commit' DAMON sysfs interface command. For testing the feature in future tests, implement a function for doing that on the test-purpose DAMON sysfs interface wrapper Python module. Link: https://lkml.kernel.org/r/20240625180538.73134-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03selftests/damon: implement test for min/max_nr_regionsSeongJae Park
Implement a kselftest for DAMON's {min,max}_nr_regions' parameters. The test ensures both the minimum and the maximum number of regions limit is respected even if the workload's real number of regions is less than the minimum or larger than the maximum limits. Link: https://lkml.kernel.org/r/20240625180538.73134-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03selftests/damon/_damon_sysfs: implement kdamonds stop functionSeongJae Park
Implement DAMON stop function on the test-purpose DAMON sysfs interface wrapper Python module, _damon_sysfs.py. This feature will be used by future DAMON tests that need to start/stop DAMON multiple times. Link: https://lkml.kernel.org/r/20240625180538.73134-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03selftests/damon: implement DAMOS tried regions testSeongJae Park
Implement a test for DAMOS tried regions command of DAMON sysfs interface. It ensures the expected number of monitoring regions are created using an artificial memory access pattern generator program. Link: https://lkml.kernel.org/r/20240625180538.73134-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03selftests/damon: implement a program for even-numbered memory regions accessSeongJae Park
To test schemes_tried_regions feature, we need to have a program having specific number of regions that having different access pattern. Existing artificial access pattern generator, 'access_memory', cannot be used for the purpose, since it accesses only one region at a given time. Extending it could be an option, but since the purpose and the implementation are pretty simple, implementing another one from the scratch is better. Implement such another artificial memory access program that allocates user-defined number/size regions and accesses even-numbered regions. Link: https://lkml.kernel.org/r/20240625180538.73134-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>