summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2023-05-17mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlockTetsuo Handa
commit 1007843a91909a4995ee78a538f62d8665705b66 upstream. syzbot is reporting circular locking dependency which involves zonelist_update_seq seqlock [1], for this lock is checked by memory allocation requests which do not need to be retried. One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler. CPU0 ---- __build_all_zonelists() { write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd // e.g. timer interrupt handler runs at this moment some_timer_func() { kmalloc(GFP_ATOMIC) { __alloc_pages_slowpath() { read_seqbegin(&zonelist_update_seq) { // spins forever because zonelist_update_seq.seqcount is odd } } } } // e.g. timer interrupt handler finishes write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even } This deadlock scenario can be easily eliminated by not calling read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation requests. But Michal Hocko does not know whether we should go with this approach. Another deadlock scenario which syzbot is reporting is a race between kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with port->lock held and printk() from __build_all_zonelists() with zonelist_update_seq held. CPU0 CPU1 ---- ---- pty_write() { tty_insert_flip_string_and_push_buffer() { __build_all_zonelists() { write_seqlock(&zonelist_update_seq); build_zonelists() { printk() { vprintk() { vprintk_default() { vprintk_emit() { console_unlock() { console_flush_all() { console_emit_next_record() { con->write() = serial8250_console_write() { spin_lock_irqsave(&port->lock, flags); tty_insert_flip_string() { tty_insert_flip_string_fixed_flag() { __tty_buffer_request_room() { tty_buffer_alloc() { kmalloc(GFP_ATOMIC | __GFP_NOWARN) { __alloc_pages_slowpath() { zonelist_iter_begin() { read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held } } } } } } } } spin_unlock_irqrestore(&port->lock, flags); // message is printed to console spin_unlock_irqrestore(&port->lock, flags); } } } } } } } } } write_sequnlock(&zonelist_update_seq); } } } This deadlock scenario can be eliminated by preventing interrupt context from calling kmalloc(GFP_ATOMIC) and preventing printk() from calling console_flush_all() while zonelist_update_seq.seqcount is odd. Since Petr Mladek thinks that __build_all_zonelists() can become a candidate for deferring printk() [2], let's address this problem by disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC) and disabling synchronous printk() in order to avoid console_flush_all() . As a side effect of minimizing duration of zonelist_update_seq.seqcount being odd by disabling synchronous printk(), latency at read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and __GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e. do not record unnecessary locking dependency) from interrupt context is still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC) inside write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq) section... Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp Fixes: 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation") Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2] Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com> Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Petr Mladek <pmladek@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Patrick Daly <quic_pdaly@quicinc.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-17writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbsBaokun Li
[ Upstream commit 1ba1199ec5747f475538c0d25a32804e5ba1dfde ] KASAN report null-ptr-deref: ================================================================== BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0 Write of size 8 at addr 0000000000000000 by task sync/943 CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461 Call Trace: <TASK> dump_stack_lvl+0x7f/0xc0 print_report+0x2ba/0x340 kasan_report+0xc4/0x120 kasan_check_range+0x1b7/0x2e0 __kasan_check_write+0x24/0x40 bdi_split_work_to_wbs+0x5c5/0x7b0 sync_inodes_sb+0x195/0x630 sync_inodes_one_sb+0x3a/0x50 iterate_supers+0x106/0x1b0 ksys_sync+0x98/0x160 [...] ================================================================== The race that causes the above issue is as follows: cpu1 cpu2 -------------------------|------------------------- inode_switch_wbs INIT_WORK(&isw->work, inode_switch_wbs_work_fn) queue_rcu_work(isw_wq, &isw->work) // queue_work async inode_switch_wbs_work_fn wb_put_many(old_wb, nr_switched) percpu_ref_put_many ref->data->release(ref) cgwb_release queue_work(cgwb_release_wq, &wb->release_work) // queue_work async &wb->release_work cgwb_release_workfn ksys_sync iterate_supers sync_inodes_one_sb sync_inodes_sb bdi_split_work_to_wbs kmalloc(sizeof(*work), GFP_ATOMIC) // alloc memory failed percpu_ref_exit ref->data = NULL kfree(data) wb_get(wb) percpu_ref_get(&wb->refcnt) percpu_ref_get_many(ref, 1) atomic_long_add(nr, &ref->data->count) atomic64_add(i, v) // trigger null-ptr-deref bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs. If the allocation of new work fails, the on-stack fallback will be used and the reference count of the current wb is increased afterwards. If cgroup writeback membership switches occur before getting the reference count and the current wb is released as old_wd, then calling wb_get() or wb_put() will trigger the null pointer dereference above. This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb() and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger this issue. For scenarios called via sync_inodes_sb(), originally commit 7fc5854f8c6e ("writeback: synchronize sync(2) against cgroup writeback membership switches") reduced the possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes easily reproducible again. To solve this problem, percpu_ref_exit() is called under RCU protection to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), and skip the current wb if wb_tryget() fails because the wb has already been shutdown. Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Brauner <brauner@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hou Tao <houtao1@huawei.com> Cc: yangerkun <yangerkun@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-04-26mm/khugepaged: check again on anon uffd-wp during isolationPeter Xu
commit dd47ac428c3f5f3bcabe845f36be870fe6c20784 upstream. Khugepaged collapse an anonymous thp in two rounds of scans. The 2nd round done in __collapse_huge_page_isolate() after hpage_collapse_scan_pmd(), during which all the locks will be released temporarily. It means the pgtable can change during this phase before 2nd round starts. It's logically possible some ptes got wr-protected during this phase, and we can errornously collapse a thp without noticing some ptes are wr-protected by userfault. e1e267c7928f wanted to avoid it but it only did that for the 1st phase, not the 2nd phase. Since __collapse_huge_page_isolate() happens after a round of small page swapins, we don't need to worry on any !present ptes - if it existed khugepaged will already bail out. So we only need to check present ptes with uffd-wp bit set there. This is something I found only but never had a reproducer, I thought it was one caused a bug in Muhammad's recent pagemap new ioctl work, but it turns out it's not the cause of that but an userspace bug. However this seems to still be a real bug even with a very small race window, still worth to have it fixed and copy stable. Link: https://lkml.kernel.org/r/20230405155120.3608140-1-peterx@redhat.com Fixes: e1e267c7928f ("khugepaged: skip collapse if uffd-wp detected") Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-20mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()Rongwei Wang
commit 6fe7d6b992113719e96744d974212df3fcddc76c upstream. The si->lock must be held when deleting the si from the available list. Otherwise, another thread can re-add the si to the available list, which can lead to memory corruption. The only place we have found where this happens is in the swapoff path. This case can be described as below: core 0 core 1 swapoff del_from_avail_list(si) waiting try lock si->lock acquire swap_avail_lock and re-add si into swap_avail_head acquire si->lock but missing si already being added again, and continuing to clear SWP_WRITEOK, etc. It can be easily found that a massive warning messages can be triggered inside get_swap_pages() by some special cases, for example, we call madvise(MADV_PAGEOUT) on blocks of touched memory concurrently, meanwhile, run much swapon-swapoff operations (e.g. stress-ng-swap). However, in the worst case, panic can be caused by the above scene. In swapoff(), the memory used by si could be kept in swap_info[] after turning off a swap. This means memory corruption will not be caused immediately until allocated and reset for a new swap in the swapon path. A panic message caused: (with CONFIG_PLIST_DEBUG enabled) ------------[ cut here ]------------ top: 00000000e58a3003, n: 0000000013e75cda, p: 000000008cd4451a prev: 0000000035b1e58a, n: 000000008cd4451a, p: 000000002150ee8d next: 000000008cd4451a, n: 000000008cd4451a, p: 000000008cd4451a WARNING: CPU: 21 PID: 1843 at lib/plist.c:60 plist_check_prev_next_node+0x50/0x70 Modules linked in: rfkill(E) crct10dif_ce(E)... CPU: 21 PID: 1843 Comm: stress-ng Kdump: ... 5.10.134+ Hardware name: Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) pc : plist_check_prev_next_node+0x50/0x70 lr : plist_check_prev_next_node+0x50/0x70 sp : ffff0018009d3c30 x29: ffff0018009d3c40 x28: ffff800011b32a98 x27: 0000000000000000 x26: ffff001803908000 x25: ffff8000128ea088 x24: ffff800011b32a48 x23: 0000000000000028 x22: ffff001800875c00 x21: ffff800010f9e520 x20: ffff001800875c00 x19: ffff001800fdc6e0 x18: 0000000000000030 x17: 0000000000000000 x16: 0000000000000000 x15: 0736076307640766 x14: 0730073007380731 x13: 0736076307640766 x12: 0730073007380731 x11: 000000000004058d x10: 0000000085a85b76 x9 : ffff8000101436e4 x8 : ffff800011c8ce08 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0017df9ed338 x4 : 0000000000000001 x3 : ffff8017ce62a000 x2 : ffff0017df9ed340 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: plist_check_prev_next_node+0x50/0x70 plist_check_head+0x80/0xf0 plist_add+0x28/0x140 add_to_avail_list+0x9c/0xf0 _enable_swap_info+0x78/0xb4 __do_sys_swapon+0x918/0xa10 __arm64_sys_swapon+0x20/0x30 el0_svc_common+0x8c/0x220 do_el0_svc+0x2c/0x90 el0_svc+0x1c/0x30 el0_sync_handler+0xa8/0xb0 el0_sync+0x148/0x180 irq event stamp: 2082270 Now, si->lock locked before calling 'del_from_avail_list()' to make sure other thread see the si had been deleted and SWP_WRITEOK cleared together, will not reinsert again. This problem exists in versions after stable 5.10.y. Link: https://lkml.kernel.org/r/20230404154716.23058-1-rongwei.wang@linux.alibaba.com Fixes: a2468cc9bfdff ("swap: choose swap device according to numa node") Tested-by: Yongchen Yin <wb-yyc939293@alibaba-inc.com> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Aaron Lu <aaron.lu@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-22mm/userfaultfd: propagate uffd-wp bit when PTE-mapping the huge zeropageDavid Hildenbrand
commit 42b2af2c9b7eede8ef21d0943f84d135e21a32a3 upstream. Currently, we'd lose the userfaultfd-wp marker when PTE-mapping a huge zeropage, resulting in the next write faults in the PMD range not triggering uffd-wp events. Various actions (partial MADV_DONTNEED, partial mremap, partial munmap, partial mprotect) could trigger this. However, most importantly, un-protecting a single sub-page from the userfaultfd-wp handler when processing a uffd-wp event will PTE-map the shared huge zeropage and lose the uffd-wp bit for the remainder of the PMD. Let's properly propagate the uffd-wp bit to the PMDs. #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <stdbool.h> #include <inttypes.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <poll.h> #include <pthread.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/ioctl.h> #include <linux/userfaultfd.h> static size_t pagesize; static int uffd; static volatile bool uffd_triggered; #define barrier() __asm__ __volatile__("": : :"memory") static void uffd_wp_range(char *start, size_t size, bool wp) { struct uffdio_writeprotect uffd_writeprotect; uffd_writeprotect.range.start = (unsigned long) start; uffd_writeprotect.range.len = size; if (wp) { uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP; } else { uffd_writeprotect.mode = 0; } if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) { fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno); exit(1); } } static void *uffd_thread_fn(void *arg) { static struct uffd_msg msg; ssize_t nread; while (1) { struct pollfd pollfd; int nready; pollfd.fd = uffd; pollfd.events = POLLIN; nready = poll(&pollfd, 1, -1); if (nready == -1) { fprintf(stderr, "poll() failed: %d\n", errno); exit(1); } nread = read(uffd, &msg, sizeof(msg)); if (nread <= 0) continue; if (msg.event != UFFD_EVENT_PAGEFAULT || !(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP)) { printf("FAIL: wrong uffd-wp event fired\n"); exit(1); } /* un-protect the single page. */ uffd_triggered = true; uffd_wp_range((char *)(uintptr_t)msg.arg.pagefault.address, pagesize, false); } return arg; } static int setup_uffd(char *map, size_t size) { struct uffdio_api uffdio_api; struct uffdio_register uffdio_register; pthread_t thread; uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY); if (uffd < 0) { fprintf(stderr, "syscall() failed: %d\n", errno); return -errno; } uffdio_api.api = UFFD_API; uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) { fprintf(stderr, "UFFDIO_API failed: %d\n", errno); return -errno; } if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) { fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n"); return -ENOSYS; } uffdio_register.range.start = (unsigned long) map; uffdio_register.range.len = size; uffdio_register.mode = UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) { fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno); return -errno; } pthread_create(&thread, NULL, uffd_thread_fn, NULL); return 0; } int main(void) { const size_t size = 4 * 1024 * 1024ull; char *map, *cur; pagesize = getpagesize(); map = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return -errno; } if (madvise(map, size, MADV_HUGEPAGE)) { fprintf(stderr, "MADV_HUGEPAGE failed\n"); return -errno; } if (setup_uffd(map, size)) return 1; /* Read the whole range, populating zeropages. */ madvise(map, size, MADV_POPULATE_READ); /* Write-protect the whole range. */ uffd_wp_range(map, size, true); /* Make sure uffd-wp triggers on each page. */ for (cur = map; cur < map + size; cur += pagesize) { uffd_triggered = false; barrier(); /* Trigger a write fault. */ *cur = 1; barrier(); if (!uffd_triggered) { printf("FAIL: uffd-wp did not trigger\n"); return 1; } } printf("PASS: uffd-wp triggered\n"); return 0; } Link: https://lkml.kernel.org/r/20230302175423.589164-1-david@redhat.com Fixes: e06f1e1dd499 ("userfaultfd: wp: enabled write protection in userfaultfd API") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11mm/thp: check and bail out if page in deferred queue alreadyYin Fengwei
commit 81e506bec9be1eceaf5a2c654e28ba5176ef48d8 upstream. Kernel build regression with LLVM was reported here: https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ with commit f35b5d7d676e ("mm: align larger anonymous mappings on THP boundaries"). And the commit f35b5d7d676e was reverted. It turned out the regression is related with madvise(MADV_DONTNEED) was used by ld.lld. But with none PMD_SIZE aligned parameter len. trace-bpfcc captured: 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 If the underneath physical page is THP, the madvise(MADV_DONTNEED) can trigger split_queue_lock contention raised significantly. perf showed following data: 14.85% 0.00% ld.lld [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe 11.52% entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_madvise do_madvise.part.0 zap_page_range unmap_single_vma unmap_page_range page_remove_rmap deferred_split_huge_page __lock_text_start native_queued_spin_lock_slowpath If THP can't be removed from rmap as whole THP, partial THP will be removed from rmap by removing sub-pages from rmap. Even the THP head page is added to deferred queue already, the split_queue_lock will be acquired and check whether the THP head page is in the queue already. Thus, the contention of split_queue_lock is raised. Before acquire split_queue_lock, check and bail out early if the THP head page is in the queue already. The checking without holding split_queue_lock could race with deferred_split_scan, but it doesn't impact the correctness here. Test result of building kernel with ld.lld: commit 7b5a0b664ebe (parent commit of f35b5d7d676e): time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 6:07.99 real, 26367.77 user, 5063.35 sys commit f35b5d7d676e: time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 7:22.15 real, 26235.03 user, 12504.55 sys commit f35b5d7d676e with the fixing patch: time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 6:08.49 real, 26520.15 user, 5047.91 sys Link: https://lkml.kernel.org/r/20221223135207.2275317-1-fengwei.yin@intel.com Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11mm: memcontrol: deprecate charge movingJohannes Weiner
commit da34a8484d162585e22ed8c1e4114aa2f60e3567 upstream. Charge moving mode in cgroup1 allows memory to follow tasks as they migrate between cgroups. This is, and always has been, a questionable thing to do - for several reasons. First, it's expensive. Pages need to be identified, locked and isolated from various MM operations, and reassigned, one by one. Second, it's unreliable. Once pages are charged to a cgroup, there isn't always a clear owner task anymore. Cache isn't moved at all, for example. Mapped memory is moved - but if trylocking or isolating a page fails, it's arbitrarily left behind. Frequent moving between domains may leave a task's memory scattered all over the place. Third, it isn't really needed. Launcher tasks can kick off workload tasks directly in their target cgroup. Using dedicated per-workload groups allows fine-grained policy adjustments - no need to move tasks and their physical pages between control domains. The feature was never forward-ported to cgroup2, and it hasn't been missed. Despite it being a niche usecase, the maintenance overhead of supporting it is enormous. Because pages are moved while they are live and subject to various MM operations, the synchronization rules are complicated. There are lock_page_memcg() in MM and FS code, which non-cgroup people don't understand. In some cases we've been able to shift code and cgroup API calls around such that we can rely on native locking as much as possible. But that's fragile, and sometimes we need to hold MM locks for longer than we otherwise would (pte lock e.g.). Mark the feature deprecated. Hopefully we can remove it soon. And backport into -stable kernels so that people who develop against earlier kernels are warned about this deprecation as early as possible. [akpm@linux-foundation.org: fix memory.rst underlining] Link: https://lkml.kernel.org/r/Y5COd+qXwk/S+n8N@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22Revert "mm: Always release pages to the buddy allocator in ↵Aaron Thompson
memblock_free_late()." commit 647037adcad00f2bab8828d3d41cd0553d41f3bd upstream. This reverts commit 115d9d77bb0f9152c60b6e8646369fa7f6167593. The pages being freed by memblock_free_late() have already been initialized, but if they are in the deferred init range, __free_one_page() might access nearby uninitialized pages when trying to coalesce buddies. This can, for example, trigger this BUG: BUG: unable to handle page fault for address: ffffe964c02580c8 RIP: 0010:__list_del_entry_valid+0x3f/0x70 <TASK> __free_one_page+0x139/0x410 __free_pages_ok+0x21d/0x450 memblock_free_late+0x8c/0xb9 efi_free_boot_services+0x16b/0x25c efi_enter_virtual_mode+0x403/0x446 start_kernel+0x678/0x714 secondary_startup_64_no_verify+0xd2/0xdb </TASK> A proper fix will be more involved so revert this change for the time being. Fixes: 115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().") Signed-off-by: Aaron Thompson <dev@aaront.org> Link: https://lore.kernel.org/r/20230207082151.1303-1-dev@aaront.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-15Fix page corruption caused by racy check in __free_pagesDavid Chen
commit 462a8e08e0e6287e5ce13187257edbf24213ed03 upstream. When we upgraded our kernel, we started seeing some page corruption like the following consistently: BUG: Bad page state in process ganesha.nfsd pfn:1304ca page:0000000022261c55 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x1304ca flags: 0x17ffffc0000000() raw: 0017ffffc0000000 ffff8a513ffd4c98 ffffeee24b35ec08 0000000000000000 raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount CPU: 0 PID: 15567 Comm: ganesha.nfsd Kdump: loaded Tainted: P B O 5.10.158-1.nutanix.20221209.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016 Call Trace: dump_stack+0x74/0x96 bad_page.cold+0x63/0x94 check_new_page_bad+0x6d/0x80 rmqueue+0x46e/0x970 get_page_from_freelist+0xcb/0x3f0 ? _cond_resched+0x19/0x40 __alloc_pages_nodemask+0x164/0x300 alloc_pages_current+0x87/0xf0 skb_page_frag_refill+0x84/0x110 ... Sometimes, it would also show up as corruption in the free list pointer and cause crashes. After bisecting the issue, we found the issue started from commit e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages"): if (put_page_testzero(page)) free_the_page(page, order); else if (!PageHead(page)) while (order-- > 0) free_the_page(page + (1 << order), order); So the problem is the check PageHead is racy because at this point we already dropped our reference to the page. So even if we came in with compound page, the page can already be freed and PageHead can return false and we will end up freeing all the tail pages causing double free. Fixes: e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages") Link: https://lore.kernel.org/lkml/BYAPR02MB448855960A9656EEA81141FC94D99@BYAPR02MB4488.namprd02.prod.outlook.com/ Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-15migrate: hugetlb: check for hugetlb shared PMD in node migrationMike Kravetz
[ Upstream commit 73bdf65ea74857d7fb2ec3067a3cec0e261b1462 ] migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to move pages shared with another process to a different node. page_mapcount > 1 is being used to determine if a hugetlb page is shared. However, a hugetlb page will have a mapcount of 1 if mapped by multiple processes via a shared PMD. As a result, hugetlb pages shared by multiple processes and mapped with a shared PMD can be moved by a process without CAP_SYS_NICE. To fix, check for a shared PMD if mapcount is 1. If a shared PMD is found consider the page shared. Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com Fixes: e2d8cf405525 ("migrate: add hugepage migration code to migrate_pages()") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-15mm/migration: return errno when isolate_huge_page failedMiaohe Lin
[ Upstream commit 7ce82f4c3f3ead13a9d9498768e3b1a79975c4d8 ] We might fail to isolate huge page due to e.g. the page is under migration which cleared HPageMigratable. We should return errno in this case rather than always return 1 which could confuse the user, i.e. the caller might think all of the memory is migrated while the hugetlb page is left behind. We make the prototype of isolate_huge_page consistent with isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page to isolate_hugetlb as suggested by Muchun to improve the readability. Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com Fixes: e8db67eb0ded ("mm: migrate: move_pages() supports thp migration") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Huang Ying <ying.huang@intel.com> Reported-by: kernel test robot <lkp@intel.com> (build error) Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 73bdf65ea748 ("migrate: hugetlb: check for hugetlb shared PMD in node migration") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-15mm/swapfile: add cond_resched() in get_swap_pages()Longlong Xia
commit 7717fc1a12f88701573f9ed897cc4f6699c661e3 upstream. The softlockup still occurs in get_swap_pages() under memory pressure. 64 CPU cores, 64GB memory, and 28 zram devices, the disksize of each zram device is 50MB with same priority as si. Use the stress-ng tool to increase memory pressure, causing the system to oom frequently. The plist_for_each_entry_safe() loops in get_swap_pages() could reach tens of thousands of times to find available space (extreme case: cond_resched() is not called in scan_swap_map_slots()). Let's add cond_resched() into get_swap_pages() when failed to find available space to avoid softlockup. Link: https://lkml.kernel.org/r/20230128094757.1060525-1-xialonglong1@huawei.com Signed-off-by: Longlong Xia <xialonglong1@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Chen Wandun <chenwandun@huawei.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-01panic: Consolidate open-coded panic_on_warn checksKees Cook
commit 79cc1ba7badf9e7a12af99695a557e9ce27ee967 upstream. Several run-time checkers (KASAN, UBSAN, KFENCE, KCSAN, sched) roll their own warnings, and each check "panic_on_warn". Consolidate this into a single function so that future instrumentation can be added in a single location. Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Gow <davidgow@google.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: Jann Horn <jannh@google.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: kasan-dev@googlegroups.com Cc: linux-mm@kvack.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20221117234328.594699-4-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01kasan: no need to unset panic_on_warn in end_report()Tiezhu Yang
commit e7ce7500375a63348e1d3a703c8d5003cbe3fea6 upstream. panic_on_warn is unset inside panic(), so no need to unset it before calling panic() in end_report(). Link: https://lkml.kernel.org/r/1644324666-15947-6-git-send-email-yangtiezhu@loongson.cn Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Reviewed-by: Marco Elver <elver@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Xuefeng Li <lixuefeng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-24mm/khugepaged: fix collapse_pte_mapped_thp() to allow anon_vmaHugh Dickins
commit ab0c3f1251b4670978fde0bd54161795a139b060 upstream. uprobe_write_opcode() uses collapse_pte_mapped_thp() to restore huge pmd, when removing a breakpoint from hugepage text: vma->anon_vma is always set in that case, so undo the prohibition. And MADV_COLLAPSE ought to be able to collapse some page tables in a vma which happens to have anon_vma set from CoWing elsewhere. Is anon_vma lock required? Almost not: if any page other than expected subpage of the non-anon huge page is found in the page table, collapse is aborted without making any change. However, it is possible that an anon page was CoWed from this extent in another mm or vma, in which case a concurrent lookup might look here: so keep it away while clearing pmd (but perhaps we shall go back to using pmd_lock() there in future). Note that collapse_pte_mapped_thp() is exceptional in freeing a page table without having cleared its ptes: I'm uneasy about that, and had thought pte_clear()ing appropriate; but exclusive i_mmap lock does fix the problem, and we would have to move the mmu_notification if clearing those ptes. What this fixes is not a dangerous instability. But I suggest Cc stable because uprobes "healing" has regressed in that way, so this should follow 8d3c106e19e8 into those stable releases where it was backported (and may want adjustment there - I'll supply backports as needed). Link: https://lkml.kernel.org/r/b740c9fb-edba-92ba-59fb-7a5592e5dfc@google.com Fixes: 8d3c106e19e8 ("mm/khugepaged: take the right locks for page table retraction") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-18mm: Always release pages to the buddy allocator in memblock_free_late().Aaron Thompson
[ Upstream commit 115d9d77bb0f9152c60b6e8646369fa7f6167593 ] If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages() only releases pages to the buddy allocator if they are not in the deferred range. This is correct for free pages (as defined by for_each_free_mem_pfn_range_in_zone()) because free pages in the deferred range will be initialized and released as part of the deferred init process. memblock_free_pages() is called by memblock_free_late(), which is used to free reserved ranges after memblock_free_all() has run. All pages in reserved ranges have been initialized at that point, and accordingly, those pages are not touched by the deferred init process. This means that currently, if the pages that memblock_free_late() intends to release are in the deferred range, they will never be released to the buddy allocator. They will forever be reserved. In addition, memblock_free_pages() calls kmsan_memblock_free_pages(), which is also correct for free pages but is not correct for reserved pages. KMSAN metadata for reserved pages is initialized by kmsan_init_shadow(), which runs shortly before memblock_free_all(). For both of these reasons, memblock_free_pages() should only be called for free pages, and memblock_free_late() should call __free_pages_core() directly instead. One case where this issue can occur in the wild is EFI boot on x86_64. The x86 EFI code reserves all EFI boot services memory ranges via memblock_reserve() and frees them later via memblock_free_late() (efi_reserve_boot_services() and efi_free_boot_services(), respectively). If any of those ranges happens to fall within the deferred init range, the pages will not be released and that memory will be unavailable. For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI: v6.2-rc2: # grep -E 'Node|spanned|present|managed' /proc/zoneinfo Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 178867 v6.2-rc2 + patch: # grep -E 'Node|spanned|present|managed' /proc/zoneinfo Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 222816 # +43,949 pages Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") Signed-off-by: Aaron Thompson <dev@aaront.org> Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1ba989-000000@us-west-2.amazonses.com Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-14mm, compaction: fix fast_isolate_around() to stay within boundariesNARIBAYASHI Akira
commit be21b32afe470c5ae98e27e49201158a47032942 upstream. Depending on the memory configuration, isolate_freepages_block() may scan pages out of the target range and causes panic. Panic can occur on systems with multiple zones in a single pageblock. The reason it is rare is that it only happens in special configurations. Depending on how many similar systems there are, it may be a good idea to fix this problem for older kernels as well. The problem is that pfn as argument of fast_isolate_around() could be out of the target range. Therefore we should consider the case where pfn < start_pfn, and also the case where end_pfn < pfn. This problem should have been addressd by the commit 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone") but there was an oversight. Case1: pfn < start_pfn <at memory compaction for node Y> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ end_pfn ^ start_pfn = cc->zone->zone_start_pfn pfn <---------> scanned range by "Scan After" Case2: end_pfn < pfn <at memory compaction for node X> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ pfn ^ end_pfn start_pfn <---------> scanned range by "Scan Before" It seems that there is no good reason to skip nr_isolated pages just after given pfn. So let perform simple scan from start to end instead of dividing the scan into "Before" and "After". Link: https://lkml.kernel.org/r/20221026112438.236336-1-a.naribayashi@fujitsu.com Fixes: 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone"). Signed-off-by: NARIBAYASHI Akira <a.naribayashi@fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-14mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb pageBaolin Wang
commit fac35ba763ed07ba93154c95ffc0c4a55023707f upstream. On some architectures (like ARM64), it can support CONT-PTE/PMD size hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and 1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified. So when looking up a CONT-PTE size hugetlb page by follow_page(), it will use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size hugetlb in follow_page_pte(). However this pte entry lock is incorrect for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get the correct lock, which is mm->page_table_lock. That means the pte entry of the CONT-PTE size hugetlb under current pte lock is unstable in follow_page_pte(), we can continue to migrate or poison the pte entry of the CONT-PTE size hugetlb, which can cause some potential race issues, even though they are under the 'pte lock'. For example, suppose thread A is trying to look up a CONT-PTE size hugetlb page by move_pages() syscall under the lock, however antoher thread B can migrate the CONT-PTE hugetlb page at the same time, which will cause thread A to get an incorrect page, if thread A also wants to do page migration, then data inconsistency error occurs. Moreover we have the same issue for CONT-PMD size hugetlb in follow_huge_pmd(). To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte() to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to get the correct pte entry lock to make the pte entry stable. Mike said: Support for CONT_PMD/_PTE was added with bb9dd3df8ee9 ("arm64: hugetlb: refactor find_num_contig()"). Patch series "Support for contiguous pte hugepages", v4. However, I do not believe these code paths were executed until migration support was added with 5480280d3f2d ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages") I would go with 5480280d3f2d for the Fixes: targe. Link: https://lkml.kernel.org/r/635f43bdd85ac2615a58405da82b4d33c6e5eb05.1662017562.git.baolin.wang@linux.alibaba.com Fixes: 5480280d3f2d ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-14mm/gup: fix gup_pud_range() for daxJohn Starks
commit fcd0ccd836ffad73d98a66f6fea7b16f735ea920 upstream. For dax pud, pud_huge() returns true on x86. So the function works as long as hugetlb is configured. However, dax doesn't depend on hugetlb. Commit 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax") fixed devmap-backed huge PMDs, but missed devmap-backed huge PUDs. Fix this as well. This fixes the below kernel panic: general protection fault, probably for non-canonical address 0x69e7c000cc478: 0000 [#1] SMP < snip > Call Trace: <TASK> get_user_pages_fast+0x1f/0x40 iov_iter_get_pages+0xc6/0x3b0 ? mempool_alloc+0x5d/0x170 bio_iov_iter_get_pages+0x82/0x4e0 ? bvec_alloc+0x91/0xc0 ? bio_alloc_bioset+0x19a/0x2a0 blkdev_direct_IO+0x282/0x480 ? __io_complete_rw_common+0xc0/0xc0 ? filemap_range_has_page+0x82/0xc0 generic_file_direct_write+0x9d/0x1a0 ? inode_update_time+0x24/0x30 __generic_file_write_iter+0xbd/0x1e0 blkdev_write_iter+0xb4/0x150 ? io_import_iovec+0x8d/0x340 io_write+0xf9/0x300 io_issue_sqe+0x3c3/0x1d30 ? sysvec_reschedule_ipi+0x6c/0x80 __io_queue_sqe+0x33/0x240 ? fget+0x76/0xa0 io_submit_sqes+0xe6a/0x18d0 ? __fget_light+0xd1/0x100 __x64_sys_io_uring_enter+0x199/0x880 ? __context_tracking_enter+0x1f/0x70 ? irqentry_exit_to_user_mode+0x24/0x30 ? irqentry_exit+0x1d/0x30 ? __context_tracking_exit+0xe/0x70 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7fc97c11a7be < snip > </TASK> ---[ end trace 48b2e0e67debcaeb ]--- RIP: 0010:internal_get_user_pages_fast+0x340/0x990 < snip > Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Link: https://lkml.kernel.org/r/1670392853-28252-1-git-send-email-ssengar@linux.microsoft.com Fixes: 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax") Signed-off-by: John Starks <jostarks@microsoft.com> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Cc: Jan Kara <jack@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-14memcg: fix possible use-after-free in memcg_write_event_control()Tejun Heo
commit 4a7ba45b1a435e7097ca0f79a847d0949d0eb088 upstream. memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> [3.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-14mm/khugepaged: invoke MMU notifiers in shmem/file collapse pathsJann Horn
commit f268f6cf875f3220afc77bdd0bf1bb136eb54db9 upstream. Any codepath that zaps page table entries must invoke MMU notifiers to ensure that secondary MMUs (like KVM) don't keep accessing pages which aren't mapped anymore. Secondary MMUs don't hold their own references to pages that are mirrored over, so failing to notify them can lead to page use-after-free. I'm marking this as addressing an issue introduced in commit f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of the security impact of this only came in commit 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP"), which actually omitted flushes for the removal of present PTEs, not just for the removal of empty page tables. Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [manual backport: this code was refactored from two copies into a common helper between 5.15 and 6.0] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-14mm/khugepaged: fix GUP-fast interaction by sending IPIJann Horn
commit 2ba99c5e08812494bc57f319fb562f527d9bacd8 upstream. Since commit 70cbc3cc78a99 ("mm: gup: fix the fast GUP race against THP collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to ensure that the page table was not removed by khugepaged in between. However, lockless_pages_from_mm() still requires that the page table is not concurrently freed. Fix it by sending IPIs (if the architecture uses semi-RCU-style page table freeing) before freeing/reusing page tables. Link: https://lkml.kernel.org/r/20221129154730.2274278-2-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-2-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-2-jannh@google.com Fixes: ba76149f47d8 ("thp: khugepaged") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [manual backport: two of the three places in khugepaged that can free ptes were refactored into a common helper between 5.15 and 6.0] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-14mm/khugepaged: take the right locks for page table retractionJann Horn
commit 8d3c106e19e8d251da31ff4cc7462e4565d65084 upstream. pagetable walks on address ranges mapped by VMAs can be done under the mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the VMA's address_space. Only one of these needs to be held, and it does not need to be held in exclusive mode. Under those circumstances, the rules for concurrent access to page table entries are: - Terminal page table entries (entries that don't point to another page table) can be arbitrarily changed under the page table lock, with the exception that they always need to be consistent for hardware page table walks and lockless_pages_from_mm(). This includes that they can be changed into non-terminal entries. - Non-terminal page table entries (which point to another page table) can not be modified; readers are allowed to READ_ONCE() an entry, verify that it is non-terminal, and then assume that its value will stay as-is. Retracting a page table involves modifying a non-terminal entry, so page-table-level locks are insufficient to protect against concurrent page table traversal; it requires taking all the higher-level locks under which it is possible to start a page walk in the relevant range in exclusive mode. The collapse_huge_page() path for anonymous THP already follows this rule, but the shmem/file THP path was getting it wrong, making it possible for concurrent rmap-based operations to cause corruption. Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [manual backport: this code was refactored from two copies into a common helper between 5.15 and 6.0] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-08v4l2: don't fall back to follow_pfn() if pin_user_pages_fast() failsLinus Torvalds
commit 6647e76ab623b2b3fb2efe03a86e9c9046c52c33 upstream. The V4L2_MEMORY_USERPTR interface is long deprecated and shouldn't be used (and is discouraged for any modern v4l drivers). And Seth Jenkins points out that the fallback to VM_PFNMAP/VM_IO is fundamentally racy and dangerous. Note that it's not even a case that should trigger, since any normal user pointer logic ends up just using the pin_user_pages_fast() call that does the proper page reference counting. That's not the problem case, only if you try to use special device mappings do you have any issues. Normally I'd just remove this during the merge window, but since Seth pointed out the problem cases, we really want to know as soon as possible if there are actually any users of this odd special case of a legacy interface. Neither Hans nor Mauro seem to think that such mis-uses of the old legacy interface should exist. As Mauro says: "See, V4L2 has actually 4 streaming APIs: - Kernel-allocated mmap (usually referred simply as just mmap); - USERPTR mmap; - read(); - dmabuf; The USERPTR is one of the oldest way to use it, coming from V4L version 1 times, and by far the least used one" And Hans chimed in on the USERPTR interface: "To be honest, I wouldn't mind if it goes away completely, but that's a bit of a pipe dream right now" but while removing this legacy interface entirely may be a pipe dream we can at least try to remove the unlikely (and actively broken) case of using special device mappings for USERPTR accesses. This replaces it with a WARN_ONCE() that we can remove once we've hopefully confirmed that no actual users exist. NOTE! Longer term, this means that a 'struct frame_vector' only ever contains proper page pointers, and all the games we have with converting them to pages can go away (grep for 'frame_vector_to_pages()' and the uses of 'vec->is_pfns'). But this is just the first step, to verify that this code really is all dead, and do so as quickly as possible. Reported-by: Seth Jenkins <sethjenkins@google.com> Acked-by: Hans Verkuil <hverkuil@xs4all.nl> Acked-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-02mm: vmscan: fix extreme overreclaim and swap floodsJohannes Weiner
commit f53af4285d775cd9a9a146fc438bd0a1bee1838a upstream. During proactive reclaim, we sometimes observe severe overreclaim, with several thousand times more pages reclaimed than requested. This trace was obtained from shrink_lruvec() during such an instance: prio:0 anon_cost:1141521 file_cost:7767 nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190) nr=[7161123 345 578 1111] While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it by swapping. These requests take over a minute, during which the write() to memory.reclaim is unkillably stuck inside the kernel. Digging into the source, this is caused by the proportional reclaim bailout logic. This code tries to resolve a fundamental conflict: to reclaim roughly what was requested, while also aging all LRUs fairly and in accordance to their size, swappiness, refault rates etc. The way it attempts fairness is that once the reclaim goal has been reached, it stops scanning the LRUs with the smaller remaining scan targets, and adjusts the remainder of the bigger LRUs according to how much of the smaller LRUs was scanned. It then finishes scanning that remainder regardless of the reclaim goal. This works fine if priority levels are low and the LRU lists are comparable in size. However, in this instance, the cgroup that is targeted by proactive reclaim has almost no files left - they've already been squeezed out by proactive reclaim earlier - and the remaining anon pages are hot. Anon rotations cause the priority level to drop to 0, which results in reclaim targeting all of anon (a lot) and all of file (almost nothing). By the time reclaim decides to bail, it has scanned most or all of the file target, and therefor must also scan most or all of the enormous anon target. This target is thousands of times larger than the reclaim goal, thus causing the overreclaim. The bailout code hasn't changed in years, why is this failing now? The most likely explanations are two other recent changes in anon reclaim: 1. Before the series starting with commit 5df741963d52 ("mm: fix LRU balancing effect of new transparent huge pages"), the VM was overall relatively reluctant to swap at all, even if swap was configured. This means the LRU balancing code didn't come into play as often as it does now, and mostly in high pressure situations where pronounced swap activity wouldn't be as surprising. 2. For historic reasons, shrink_lruvec() loops on the scan targets of all LRU lists except the active anon one, meaning it would bail if the only remaining pages to scan were active anon - even if there were a lot of them. Before the series starting with commit ccc5dc67340c ("mm/vmscan: make active/inactive ratio as 1:1 for anon lru"), most anon pages would live on the active LRU; the inactive one would contain only a handful of preselected reclaim candidates. After the series, anon gets aged similarly to file, and the inactive list is the default for new anon pages as well, making it often the much bigger list. As a result, the VM is now more likely to actually finish large anon targets than before. Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the larger LRU lists is made before bailing out on a met reclaim goal. This fixes the extreme overreclaim problem. Fairness is more subtle and harder to evaluate. No obvious misbehavior was observed on the test workload, in any case. Conceptually, fairness should primarily be a cumulative effect from regular, lower priority scans. Once the VM is in trouble and needs to escalate scan targets to make forward progress, fairness needs to take a backseat. This is also acknowledged by the myriad exceptions in get_scan_count(). This patch makes fairness decrease gradually, as it keeps fairness work static over increasing priority levels with growing scan targets. This should make more sense - although we may have to re-visit the exact values. Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-25mm: fs: initialize fsdata passed to write_begin/write_end interfaceAlexander Potapenko
commit 1468c6f4558b1bcd92aa0400f2920f9dc7588402 upstream. Functions implementing the a_ops->write_end() interface accept the `void *fsdata` parameter that is supposed to be initialized by the corresponding a_ops->write_begin() (which accepts `void **fsdata`). However not all a_ops->write_begin() implementations initialize `fsdata` unconditionally, so it may get passed uninitialized to a_ops->write_end(), resulting in undefined behavior. Fix this by initializing fsdata with NULL before the call to write_begin(), rather than doing so in all possible a_ops implementations. This patch covers only the following cases found by running x86 KMSAN under syzkaller: - generic_perform_write() - cont_expand_zero() and generic_cont_expand_simple() - page_symlink() Other cases of passing uninitialized fsdata may persist in the codebase. Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-25maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault()Alban Crequy
commit 8678ea06852cd1f819b870c773d43df888d15d46 upstream. If a page fault occurs while copying the first byte, this function resets one byte before dst. As a consequence, an address could be modified and leaded to kernel crashes if case the modified address was accessed later. Fixes: b58294ead14c ("maccess: allow architectures to provide kernel probing directly") Signed-off-by: Alban Crequy <albancrequy@linux.microsoft.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Francis Laniel <flaniel@linux.microsoft.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> [5.8] Link: https://lore.kernel.org/bpf/20221110085614.111213-2-albancrequy@linux.microsoft.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-16mm/memremap.c: map FS_DAX device memory as decryptedPankaj Gupta
commit 867400af90f1f953ff9e10b1b87ecaf9369a7eb8 upstream. virtio_pmem use devm_memremap_pages() to map the device memory. By default this memory is mapped as encrypted with SEV. Guest reboot changes the current encryption key and guest no longer properly decrypts the FSDAX device meta data. Mark the corresponding device memory region for FSDAX devices (mapped with memremap_pages) as decrypted to retain the persistent memory property. Link: https://lkml.kernel.org/r/20221102160728.3184016-1-pankaj.gupta@amd.com Fixes: b7b3c01b19159 ("mm/memremap_pages: support multiple ranges per invocation") Signed-off-by: Pankaj Gupta <pankaj.gupta@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-03mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pagesRik van Riel
commit 12df140f0bdfae5dcfc81800970dd7f6f632e00c upstream. The h->*_huge_pages counters are protected by the hugetlb_lock, but alloc_huge_page has a corner case where it can decrement the counter outside of the lock. This could lead to a corrupted value of h->resv_huge_pages, which we have observed on our systems. Take the hugetlb_lock before decrementing h->resv_huge_pages to avoid a potential race. Link: https://lkml.kernel.org/r/20221017202505.0e6a4fcd@imladris.surriel.com Fixes: a88c76954804 ("mm: hugetlb: fix hugepage memory leak caused by wrong reserve count") Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Glen McCready <gkmccready@meta.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-11-03mm/memory: add non-anonymous page check in the copy_present_page()Yuanzheng Song
The vma->anon_vma of the child process may be NULL because the entire vma does not contain anonymous pages. In this case, a BUG will occur when the copy_present_page() passes a copy of a non-anonymous page of that vma to the page_add_new_anon_rmap() to set up new anonymous rmap. ------------[ cut here ]------------ kernel BUG at mm/rmap.c:1044! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: CPU: 2 PID: 3617 Comm: test Not tainted 5.10.149 #1 Hardware name: linux,dummy-virt (DT) pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) pc : __page_set_anon_rmap+0xbc/0xf8 lr : __page_set_anon_rmap+0xbc/0xf8 sp : ffff800014c1b870 x29: ffff800014c1b870 x28: 0000000000000001 x27: 0000000010100073 x26: ffff1d65c517baa8 x25: ffff1d65cab0f000 x24: ffff1d65c416d800 x23: ffff1d65cab5f248 x22: 0000000020000000 x21: 0000000000000001 x20: 0000000000000000 x19: fffffe75970023c0 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc3096d5fb858 x8 : 0000000000000000 x7 : 0000000000000011 x6 : ffff5a5c9089c000 x5 : 0000000000020000 x4 : ffff5a5c9089c000 x3 : ffffc3096d200000 x2 : ffffc3096e8d0000 x1 : ffff1d65ca3da740 x0 : 0000000000000000 Call trace: __page_set_anon_rmap+0xbc/0xf8 page_add_new_anon_rmap+0x1e0/0x390 copy_pte_range+0xd00/0x1248 copy_page_range+0x39c/0x620 dup_mmap+0x2e0/0x5a8 dup_mm+0x78/0x140 copy_process+0x918/0x1a20 kernel_clone+0xac/0x638 __do_sys_clone+0x78/0xb0 __arm64_sys_clone+0x30/0x40 el0_svc_common.constprop.0+0xb0/0x308 do_el0_svc+0x48/0xb8 el0_svc+0x24/0x38 el0_sync_handler+0x160/0x168 el0_sync+0x180/0x1c0 Code: 97f8ff85 f9400294 17ffffeb 97f8ff82 (d4210000) ---[ end trace a972347688dc9bd4 ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception SMP: stopping secondary CPUs Kernel Offset: 0x43095d200000 from 0xffff800010000000 PHYS_OFFSET: 0xffffe29a80000000 CPU features: 0x08200022,61806082 Memory Limit: none ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]--- This problem has been fixed by the commit <fb3d824d1a46> ("mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()"), but still exists in the linux-5.10.y branch. This patch is not applicable to this version because of the large version differences. Therefore, fix it by adding non-anonymous page check in the copy_present_page(). Cc: stable@vger.kernel.org Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes") Signed-off-by: Yuanzheng Song <songyuanzheng@huawei.com> Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26mm: hugetlb: fix UAF in hugetlb_handle_userfaultLiu Shixin
commit 958f32ce832ba781ac20e11bb2d12a9352ea28fc upstream. The vma_lock and hugetlb_fault_mutex are dropped before handling userfault and reacquire them again after handle_userfault(), but reacquire the vma_lock could lead to UAF[1,2] due to the following race, hugetlb_fault hugetlb_no_page /*unlock vma_lock */ hugetlb_handle_userfault handle_userfault /* unlock mm->mmap_lock*/ vm_mmap_pgoff do_mmap mmap_region munmap_vma_range /* clean old vma */ /* lock vma_lock again <--- UAF */ /* unlock vma_lock */ Since the vma_lock will unlock immediately after hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix the issue. [1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/ [2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/ Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reported-by: syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com Reported-by: Liu Zixian <liuzixian4@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-26mm/mmap: undo ->mmap() when arch_validate_flags() failsCarlos Llamas
commit deb0f6562884b5b4beb883d73e66a7d3a1b96d99 upstream. Commit c462ac288f2c ("mm: Introduce arch_validate_flags()") added a late check in mmap_region() to let architectures validate vm_flags. The check needs to happen after calling ->mmap() as the flags can potentially be modified during this callback. If arch_validate_flags() check fails we unmap and free the vma. However, the error path fails to undo the ->mmap() call that previously succeeded and depending on the specific ->mmap() implementation this translates to reference increments, memory allocations and other operations what will not be cleaned up. There are several places (mainly device drivers) where this is an issue. However, one specific example is bpf_map_mmap() which keeps count of the mappings in map->writecnt. The count is incremented on ->mmap() and then decremented on vm_ops->close(). When arch_validate_flags() fails this count is off since bpf_map_mmap_close() is never called. One can reproduce this issue in arm64 devices with MTE support. Here the vm_flags are checked to only allow VM_MTE if VM_MTE_ALLOWED has been set previously. From userspace then is enough to pass the PROT_MTE flag to mmap() syscall to trigger the arch_validate_flags() failure. The following program reproduces this issue: #include <stdio.h> #include <unistd.h> #include <linux/unistd.h> #include <linux/bpf.h> #include <sys/mman.h> int main(void) { union bpf_attr attr = { .map_type = BPF_MAP_TYPE_ARRAY, .key_size = sizeof(int), .value_size = sizeof(long long), .max_entries = 256, .map_flags = BPF_F_MMAPABLE, }; int fd; fd = syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr)); mmap(NULL, 4096, PROT_WRITE | PROT_MTE, MAP_SHARED, fd, 0); return 0; } By manually adding some log statements to the vm_ops callbacks we can confirm that when passing PROT_MTE to mmap() the map->writecnt is off upon ->release(): With PROT_MTE flag: root@debian:~# ./bpf-test [ 111.263874] bpf_map_write_active_inc: map=9 writecnt=1 [ 111.288763] bpf_map_release: map=9 writecnt=1 Without PROT_MTE flag: root@debian:~# ./bpf-test [ 157.816912] bpf_map_write_active_inc: map=10 writecnt=1 [ 157.830442] bpf_map_write_active_dec: map=10 writecnt=0 [ 157.832396] bpf_map_release: map=10 writecnt=0 This patch fixes the above issue by calling vm_ops->close() when the arch_validate_flags() check fails, after this we can proceed to unmap and free the vma on the error path. Link: https://lkml.kernel.org/r/20220930003844.1210987-1-cmllamas@google.com Fixes: c462ac288f2c ("mm: Introduce arch_validate_flags()") Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Liam Howlett <liam.howlett@oracle.com> Cc: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> [5.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-15mm: gup: fix the fast GUP race against THP collapseYang Shi
commit 70cbc3cc78a997d8247b50389d37c4e1736019da upstream. Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer sufficient to handle concurrent GUP-fast in all cases, it only handles traditional IPI-based GUP-fast correctly. On architectures that send an IPI broadcast on TLB flush, it works as expected. But on the architectures that do not use IPI to broadcast TLB flush, it may have the below race: CPU A CPU B THP collapse fast GUP gup_pmd_range() <-- see valid pmd gup_pte_range() <-- work on pte pmdp_collapse_flush() <-- clear pmd and flush __collapse_huge_page_isolate() check page pinned <-- before GUP bump refcount pin the page check PTE <-- no change __collapse_huge_page_copy() copy data to huge page ptep_clear() install huge pmd for the huge page return the stale page discard the stale page The race can be fixed by checking whether PMD is changed or not after taking the page pin in fast GUP, just like what it does for PTE. If the PMD is changed it means there may be parallel THP collapse, so GUP should back off. Also update the stale comment about serializing against fast GUP in khugepaged. Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-05mm: fix madivse_pageout mishandling on non-LRU pageMinchan Kim
commit 58d426a7ba92870d489686dfdb9d06b66815a2ab upstream. MADV_PAGEOUT tries to isolate non-LRU pages and gets a warning from isolate_lru_page below. Fix it by checking PageLRU in advance. ------------[ cut here ]------------ trying to isolate tail page WARNING: CPU: 0 PID: 6175 at mm/folio-compat.c:158 isolate_lru_page+0x130/0x140 Modules linked in: CPU: 0 PID: 6175 Comm: syz-executor.0 Not tainted 5.18.12 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:isolate_lru_page+0x130/0x140 Link: https://lore.kernel.org/linux-mm/485f8c33.2471b.182d5726afb.Coremail.hantianshuo@iie.ac.cn/ Link: https://lkml.kernel.org/r/20220908151204.762596-1-minchan@kernel.org Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: 韩天ç`• <hantianshuo@iie.ac.cn> Suggested-by: Yang Shi <shy828301@gmail.com> Acked-by: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-05mm/migrate_device.c: flush TLB while holding PTLAlistair Popple
commit 60bae73708963de4a17231077285bd9ff2f41c44 upstream. When clearing a PTE the TLB should be flushed whilst still holding the PTL to avoid a potential race with madvise/munmap/etc. For example consider the following sequence: CPU0 CPU1 ---- ---- migrate_vma_collect_pmd() pte_unmap_unlock() madvise(MADV_DONTNEED) -> zap_pte_range() pte_offset_map_lock() [ PTE not present, TLB not flushed ] pte_unmap_unlock() [ page is still accessible via stale TLB ] flush_tlb_range() In this case the page may still be accessed via the stale TLB entry after madvise returns. Fix this by flushing the TLB while holding the PTL. Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages") Link: https://lkml.kernel.org/r/9f801e9d8d830408f2ca27821f606e09aa856899.1662078528.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reported-by: Nadav Amit <nadav.amit@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: huang ying <huang.ying.caritas@gmail.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-05mm: prevent page_frag_alloc() from corrupting the memoryMaurizio Lombardi
commit dac22531bbd4af2426c4e29e05594415ccfa365d upstream. A number of drivers call page_frag_alloc() with a fragment's size > PAGE_SIZE. In low memory conditions, __page_frag_cache_refill() may fail the order 3 cache allocation and fall back to order 0; In this case, the cache will be smaller than the fragment, causing memory corruptions. Prevent this from happening by checking if the newly allocated cache is large enough for the fragment; if not, the allocation will fail and page_frag_alloc() will return NULL. Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com Fixes: b63ae8ca096d ("mm/net: Rename and move page fragment handling from net/ to mm/") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Cc: Chen Lin <chen45464546@163.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-10-05mm/page_alloc: fix race condition between build_all_zonelists and page ↵Mel Gorman
allocation commit 3d36424b3b5850bd92f3e89b953a430d7cfc88ef upstream. Patrick Daly reported the following problem; NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation [0] - ZONE_MOVABLE [1] - ZONE_NORMAL [2] - NULL For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent memory_offline operation removes the last page from ZONE_MOVABLE, build_all_zonelists() & build_zonerefs_node() will update node_zonelists as shown below. Only populated zones are added. NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation [0] - ZONE_NORMAL [1] - NULL [2] - NULL The race is simple -- page allocation could be in progress when a memory hot-remove operation triggers a zonelist rebuild that removes zones. The allocation request will still have a valid ac->preferred_zoneref that is now pointing to NULL and triggers an OOM kill. This problem probably always existed but may be slightly easier to trigger due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") which distinguishes between zones that are completely unpopulated versus zones that have valid pages not managed by the buddy allocator (e.g. reserved, memblock, ballooning etc). Memory hotplug had multiple stages with timing considerations around managed/present page updates, the zonelist rebuild and the zone span updates. As David Hildenbrand puts it memory offlining adjusts managed+present pages of the zone essentially in one go. If after the adjustments, the zone is no longer populated (present==0), we rebuild the zone lists. Once that's done, we try shrinking the zone (start+spanned pages) -- which results in zone_start_pfn == 0 if there are no more pages. That happens *after* rebuilding the zonelists via remove_pfn_range_from_zone(). The only requirement to fix the race is that a page allocation request identifies when a zonelist rebuild has happened since the allocation request started and no page has yet been allocated. Use a seqlock_t to track zonelist updates with a lockless read-side of the zonelist and protecting the rebuild and update of the counter with a spinlock. [akpm@linux-foundation.org: make zonelist_update_seq static] Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.net Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Patrick Daly <quic_pdaly@quicinc.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> [4.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-28mm/slub: fix to return errno if kmalloc() failsChao Yu
commit 7e9c323c52b379d261a72dc7bd38120a761a93cd upstream. In create_unique_id(), kmalloc(, GFP_KERNEL) can fail due to out-of-memory, if it fails, return errno correctly rather than triggering panic via BUG_ON(); kernel BUG at mm/slub.c:5893! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Call trace: sysfs_slab_add+0x258/0x260 mm/slub.c:5973 __kmem_cache_create+0x60/0x118 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x19c/0x31c mm/slab_common.c:335 kmem_cache_create+0x1c/0x28 mm/slab_common.c:390 f2fs_kmem_cache_create fs/f2fs/f2fs.h:2766 [inline] f2fs_init_xattr_caches+0x78/0xb4 fs/f2fs/xattr.c:808 f2fs_fill_super+0x1050/0x1e0c fs/f2fs/super.c:4149 mount_bdev+0x1b8/0x210 fs/super.c:1400 f2fs_mount+0x44/0x58 fs/f2fs/super.c:4512 legacy_get_tree+0x30/0x74 fs/fs_context.c:610 vfs_get_tree+0x40/0x140 fs/super.c:1530 do_new_mount+0x1dc/0x4e4 fs/namespace.c:3040 path_mount+0x358/0x914 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __arm64_sys_mount+0x2f8/0x408 fs/namespace.c:3568 Cc: <stable@kernel.org> Fixes: 81819f0fc8285 ("SLUB core") Reported-by: syzbot+81684812ea68216e08c5@syzkaller.appspotmail.com Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Chao Yu <chao.yu@oppo.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-20mm: Fix TLB flush for not-first PFNMAP mappings in unmap_region()Jann Horn
This is a stable-specific patch. I botched the stable-specific rewrite of commit b67fbebd4cf98 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"): As Hugh pointed out, unmap_region() actually operates on a list of VMAs, and the variable "vma" merely points to the first VMA in that list. So if we want to check whether any of the VMAs we're operating on is PFNMAP or MIXEDMAP, we have to iterate through the list and check each VMA. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-15Revert "mm: kmemleak: take a full lowmem check in kmemleak_*_phys()"Yee Lee
This reverts commit 23c2d497de21f25898fbea70aeb292ab8acc8c94. Commit 23c2d497de21 ("mm: kmemleak: take a full lowmem check in kmemleak_*_phys()") brought false leak alarms on some archs like arm64 that does not init pfn boundary in early booting. The final solution lands on linux-6.0: commit 0c24e061196c ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA"). Revert this commit before linux-6.0. The original issue of invalid PA can be mitigated by additional check in devicetree. The false alarm report is as following: Kmemleak output: (Qemu/arm64) unreferenced object 0xffff0000c0170a00 (size 128): comm "swapper/0", pid 1, jiffies 4294892404 (age 126.208s) hex dump (first 32 bytes): 62 61 73 65 00 00 00 00 00 00 00 00 00 00 00 00 base............ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<(____ptrval____)>] __kmalloc_track_caller+0x1b0/0x2e4 [<(____ptrval____)>] kstrdup_const+0x8c/0xc4 [<(____ptrval____)>] kvasprintf_const+0xbc/0xec [<(____ptrval____)>] kobject_set_name_vargs+0x58/0xe4 [<(____ptrval____)>] kobject_add+0x84/0x100 [<(____ptrval____)>] __of_attach_node_sysfs+0x78/0xec [<(____ptrval____)>] of_core_init+0x68/0x104 [<(____ptrval____)>] driver_init+0x28/0x48 [<(____ptrval____)>] do_basic_setup+0x14/0x28 [<(____ptrval____)>] kernel_init_freeable+0x110/0x178 [<(____ptrval____)>] kernel_init+0x20/0x1a0 [<(____ptrval____)>] ret_from_fork+0x10/0x20 This pacth is also applicable to linux-5.17.y/linux-5.18.y/linux-5.19.y Cc: <stable@vger.kernel.org> Signed-off-by: Yee Lee <yee.lee@mediatek.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-08mm: pagewalk: Fix race between unmap and page walkerSteven Price
[ Upstream commit 8782fb61cc848364e1e1599d76d3c9dd58a1cc06 ] The mmap lock protects the page walker from changes to the page tables during the walk. However a read lock is insufficient to protect those areas which don't have a VMA as munmap() detaches the VMAs before downgrading to a read lock and actually tearing down PTEs/page tables. For users of walk_page_range() the solution is to simply call pte_hole() immediately without checking the actual page tables when a VMA is not present. We now never call __walk_page_range() without a valid vma. For walk_page_range_novma() the locking requirements are tightened to require the mmap write lock to be taken, and then walking the pgd directly with 'no_vma' set. This in turn means that all page walkers either have a valid vma, or it's that special 'novma' case for page table debugging. As a result, all the odd '(!walk->vma && !walk->no_vma)' tests can be removed. Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Steven Price <steven.price@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-05mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuseJann Horn
commit 2555283eb40df89945557273121e9393ef9b542b upstream. anon_vma->degree tracks the combined number of child anon_vmas and VMAs that use the anon_vma as their ->anon_vma. anon_vma_clone() then assumes that for any anon_vma attached to src->anon_vma_chain other than src->anon_vma, it is impossible for it to be a leaf node of the VMA tree, meaning that for such VMAs ->degree is elevated by 1 because of a child anon_vma, meaning that if ->degree equals 1 there are no VMAs that use the anon_vma as their ->anon_vma. This assumption is wrong because the ->degree optimization leads to leaf nodes being abandoned on anon_vma_clone() - an existing anon_vma is reused and no new parent-child relationship is created. So it is possible to reuse an anon_vma for one VMA while it is still tied to another VMA. This is an issue because is_mergeable_anon_vma() and its callers assume that if two VMAs have the same ->anon_vma, the list of anon_vmas attached to the VMAs is guaranteed to be the same. When this assumption is violated, vma_merge() can merge pages into a VMA that is not attached to the corresponding anon_vma, leading to dangling page->mapping pointers that will be dereferenced during rmap walks. Fix it by separately tracking the number of child anon_vmas and the number of VMAs using the anon_vma as their ->anon_vma. Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Cc: stable@kernel.org Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-05mm: Force TLB flush for PFNMAP mappings before unlink_file_vma()Jann Horn
commit b67fbebd4cf980aecbcc750e1462128bffe8ae15 upstream. Some drivers rely on having all VMAs through which a PFN might be accessible listed in the rmap for correctness. However, on X86, it was possible for a VMA with stale TLB entries to not be listed in the rmap. This was fixed in mainline with commit b67fbebd4cf9 ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"), but that commit relies on preceding refactoring in commit 18ba064e42df3 ("mmu_gather: Let there be one tlb_{start,end}_vma() implementation") and commit 1e9fdf21a4339 ("mmu_gather: Remove per arch tlb_{start,end}_vma()"). This patch provides equivalent protection without needing that refactoring, by forcing a TLB flush between removing PTEs in unmap_vmas() and the call to unlink_file_vma() in free_pgtables(). [This is a stable-specific rewrite of the upstream commit!] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31mm/hugetlb: fix hugetlb not supporting softdirty trackingDavid Hildenbrand
commit f96f7a40874d7c746680c0b9f57cef2262ae551f upstream. Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2. I observed that hugetlb does not support/expect write-faults in shared mappings that would have to map the R/O-mapped page writable -- and I found two case where we could currently get such faults and would erroneously map an anon page into a shared mapping. Reproducers part of the patches. I propose to backport both fixes to stable trees. The first fix needs a small adjustment. This patch (of 2): Staring at hugetlb_wp(), one might wonder where all the logic for shared mappings is when stumbling over a write-protected page in a shared mapping. In fact, there is none, and so far we thought we could get away with that because e.g., mprotect() should always do the right thing and map all pages directly writable. Looks like we were wrong: -------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <sys/mman.h> #define HUGETLB_SIZE (2 * 1024 * 1024u) static void clear_softdirty(void) { int fd = open("/proc/self/clear_refs", O_WRONLY); const char *ctrl = "4"; int ret; if (fd < 0) { fprintf(stderr, "open(clear_refs) failed\n"); exit(1); } ret = write(fd, ctrl, strlen(ctrl)); if (ret != strlen(ctrl)) { fprintf(stderr, "write(clear_refs) failed\n"); exit(1); } close(fd); } int main(int argc, char **argv) { char *map; int fd; fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT); if (!fd) { fprintf(stderr, "open() failed\n"); return -errno; } if (ftruncate(fd, HUGETLB_SIZE)) { fprintf(stderr, "ftruncate() failed\n"); return -errno; } map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return -errno; } *map = 0; if (mprotect(map, HUGETLB_SIZE, PROT_READ)) { fprintf(stderr, "mmprotect() failed\n"); return -errno; } clear_softdirty(); if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) { fprintf(stderr, "mmprotect() failed\n"); return -errno; } *map = 0; return 0; } -------------------------------------------------------------------------- Above test fails with SIGBUS when there is only a single free hugetlb page. # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test Bus error (core dumped) And worse, with sufficient free hugetlb pages it will map an anonymous page into a shared mapping, for example, messing up accounting during unmap and breaking MAP_SHARED semantics: # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test # cat /proc/meminfo | grep HugePages_ HugePages_Total: 2 HugePages_Free: 1 HugePages_Rsvd: 18446744073709551615 HugePages_Surp: 0 Reason in this particular case is that vma_wants_writenotify() will return "true", removing VM_SHARED in vma_set_page_prot() to map pages write-protected. Let's teach vma_wants_writenotify() that hugetlb does not support softdirty tracking. Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> [3.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31mm/huge_memory.c: use helper function migration_entry_to_page()Miaohe Lin
[ Upstream commit a44f89dc6c5f8ba70240b81a570260d29d04bcb0 ] It's more recommended to use helper function migration_entry_to_page() to get the page via migration entry. We can also enjoy the PageLocked() check there. Link: https://lkml.kernel.org/r/20210318122722.13135-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Thomas Hellstrm (Intel) <thomas_os@shipmail.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: yuleixzhang <yulei.kernel@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-21mm/mmap.c: fix missing call to vm_unacct_memory in mmap_regionMiaohe Lin
[ Upstream commit 7f82f922319ede486540e8746769865b9508d2c2 ] Since the beginning, charged is set to 0 to avoid calling vm_unacct_memory twice because vm_unacct_memory will be called by above unmap_region. But since commit 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces"), unmap_region doesn't call vm_unacct_memory anymore. So charged shouldn't be set to 0 now otherwise the calling to paired vm_unacct_memory will be missed and leads to imbalanced account. Link: https://lkml.kernel.org/r/20220618082027.43391-1-linmiaohe@huawei.com Fixes: 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-21mm/mremap: hold the rmap lock in write mode when moving page table entries.Aneesh Kumar K.V
commit 97113eb39fa7972722ff490b947d8af023e1f6a2 upstream. To avoid a race between rmap walk and mremap, mremap does take_rmap_locks(). The lock was taken to ensure that rmap walk don't miss a page table entry due to PTE moves via move_pagetables(). The kernel does further optimization of this lock such that if we are going to find the newly added vma after the old vma, the rmap lock is not taken. This is because rmap walk would find the vmas in the same order and if we don't find the page table attached to older vma we would find it with the new vma which we would iterate later. As explained in commit eb66ae030829 ("mremap: properly flush TLB before releasing the page") mremap is special in that it doesn't take ownership of the page. The optimized version for PUD/PMD aligned mremap also doesn't hold the ptl lock. This can result in stale TLB entries as show below. This patch updates the rmap locking requirement in mremap to handle the race condition explained below with optimized mremap:: Optmized PMD move CPU 1 CPU 2 CPU 3 mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one mmap_write_lock_killable() addr = old_addr lock(pte_ptl) lock(pmd_ptl) pmd = *old_pmd pmd_clear(old_pmd) flush_tlb_range(old_addr) *new_pmd = pmd *new_addr = 10; and fills TLB with new addr and old pfn unlock(pmd_ptl) ptep_clear_flush() old pfn is free. Stale TLB entry Optimized PUD move also suffers from a similar race. Both the above race condition can be fixed if we force mremap path to take rmap lock. Link: https://lkml.kernel.org/r/20210616045239.370802-7-aneesh.kumar@linux.ibm.com Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions") Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions") Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@mail.gmail.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [patch rewritten for backport since the code was refactored since] Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-21mm: Add kvrealloc()Dave Chinner
commit de2860f4636256836450c6543be744a50118fc66 upstream. During log recovery of an XFS filesystem with 64kB directory buffers, rebuilding a buffer split across two log records results in a memory allocation warning from krealloc like this: xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff) XFS (dm-0): Unmounting Filesystem XFS (dm-0): Mounting V5 Filesystem XFS (dm-0): Starting recovery (logdev: internal) ------------[ cut here ]------------ WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40 ..... RIP: 0010:get_page_from_freelist+0xdee/0xe40 Call Trace: ? complete+0x3f/0x50 __alloc_pages+0x16f/0x300 alloc_pages+0x87/0x110 kmalloc_order+0x2c/0x90 kmalloc_order_trace+0x1d/0x90 __kmalloc_track_caller+0x215/0x270 ? xlog_recover_add_to_cont_trans+0x63/0x1f0 krealloc+0x54/0xb0 xlog_recover_add_to_cont_trans+0x63/0x1f0 xlog_recovery_process_trans+0xc1/0xd0 xlog_recover_process_ophdr+0x86/0x130 xlog_recover_process_data+0x9f/0x160 xlog_recover_process+0xa2/0x120 xlog_do_recovery_pass+0x40b/0x7d0 ? __irq_work_queue_local+0x4f/0x60 ? irq_work_queue+0x3a/0x50 xlog_do_log_recovery+0x70/0x150 xlog_do_recover+0x38/0x1d0 xlog_recover+0xd8/0x170 xfs_log_mount+0x181/0x300 xfs_mountfs+0x4a1/0x9b0 xfs_fs_fill_super+0x3c0/0x7b0 get_tree_bdev+0x171/0x270 ? suffix_kstrtoint.constprop.0+0xf0/0xf0 xfs_fs_get_tree+0x15/0x20 vfs_get_tree+0x24/0xc0 path_mount+0x2f5/0xaf0 __x64_sys_mount+0x108/0x140 do_syscall_64+0x3a/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae Essentially, we are taking a multi-order allocation from kmem_alloc() (which has an open coded no fail, no warn loop) and then reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is then triggering the above warning. This is a regression caused by converting this code from an open coded no fail/no warn reallocation loop to using __GFP_NOFAIL. What we actually need here is kvrealloc(), so that if contiguous page allocation fails we fall back to vmalloc() and we don't get nasty warnings happening in XFS. Fixes: 771915c4f688 ("xfs: remove kmem_realloc()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-03page_alloc: fix invalid watermark check on a negative valueJaewon Kim
commit 9282012fc0aa248b77a69f5eb802b67c5a16bb13 upstream. There was a report that a task is waiting at the throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was increasing. This is a bug where zone_watermark_fast returns true even when the free is very low. The commit f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast") changed the watermark fast to consider highatomic reserve. But it did not handle a negative value case which can be happened when reserved_highatomic pageblock is bigger than the actual free. If watermark is considered as ok for the negative value, allocating contexts for order-0 will consume all free pages without direct reclaim, and finally free page may become depleted except highatomic free. Then allocating contexts may fall into throttle_direct_reclaim. This symptom may easily happen in a system where wmark min is low and other reclaimers like kswapd does not make free pages quickly. Handle the negative case by using MIN. Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com Fixes: f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast") Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Reported-by: GyeongHwan Hong <gh21.hong@samsung.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: <stable@vger.kerenl.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-29mm/mempolicy: fix uninit-value in mpol_rebind_policy()Wang Cheng
commit 018160ad314d75b1409129b2247b614a9f35894c upstream. mpol_set_nodemask()(mm/mempolicy.c) does not set up nodemask when pol->mode is MPOL_LOCAL. Check pol->mode before access pol->w.cpuset_mems_allowed in mpol_rebind_policy()(mm/mempolicy.c). BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:352 [inline] BUG: KMSAN: uninit-value in mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 mpol_rebind_policy mm/mempolicy.c:352 [inline] mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 cpuset_change_task_nodemask kernel/cgroup/cpuset.c:1711 [inline] cpuset_attach+0x787/0x15e0 kernel/cgroup/cpuset.c:2278 cgroup_migrate_execute+0x1023/0x1d20 kernel/cgroup/cgroup.c:2515 cgroup_migrate kernel/cgroup/cgroup.c:2771 [inline] cgroup_attach_task+0x540/0x8b0 kernel/cgroup/cgroup.c:2804 __cgroup1_procs_write+0x5cc/0x7a0 kernel/cgroup/cgroup-v1.c:520 cgroup1_tasks_write+0x94/0xb0 kernel/cgroup/cgroup-v1.c:539 cgroup_file_write+0x4c2/0x9e0 kernel/cgroup/cgroup.c:3852 kernfs_fop_write_iter+0x66a/0x9f0 fs/kernfs/file.c:296 call_write_iter include/linux/fs.h:2162 [inline] new_sync_write fs/read_write.c:503 [inline] vfs_write+0x1318/0x2030 fs/read_write.c:590 ksys_write+0x28b/0x510 fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __x64_sys_write+0xdb/0x120 fs/read_write.c:652 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was created at: slab_post_alloc_hook mm/slab.h:524 [inline] slab_alloc_node mm/slub.c:3251 [inline] slab_alloc mm/slub.c:3259 [inline] kmem_cache_alloc+0x902/0x11c0 mm/slub.c:3264 mpol_new mm/mempolicy.c:293 [inline] do_set_mempolicy+0x421/0xb70 mm/mempolicy.c:853 kernel_set_mempolicy mm/mempolicy.c:1504 [inline] __do_sys_set_mempolicy mm/mempolicy.c:1510 [inline] __se_sys_set_mempolicy+0x44c/0xb60 mm/mempolicy.c:1507 __x64_sys_set_mempolicy+0xd8/0x110 mm/mempolicy.c:1507 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae KMSAN: uninit-value in mpol_rebind_task (2) https://syzkaller.appspot.com/bug?id=d6eb90f952c2a5de9ea718a1b873c55cb13b59dc This patch seems to fix below bug too. KMSAN: uninit-value in mpol_rebind_mm (2) https://syzkaller.appspot.com/bug?id=f2fecd0d7013f54ec4162f60743a2b28df40926b The uninit-value is pol->w.cpuset_mems_allowed in mpol_rebind_policy(). When syzkaller reproducer runs to the beginning of mpol_new(), mpol_new() mm/mempolicy.c do_mbind() mm/mempolicy.c kernel_mbind() mm/mempolicy.c `mode` is 1(MPOL_PREFERRED), nodes_empty(*nodes) is `true` and `flags` is 0. Then mode = MPOL_LOCAL; ... policy->mode = mode; policy->flags = flags; will be executed. So in mpol_set_nodemask(), mpol_set_nodemask() mm/mempolicy.c do_mbind() kernel_mbind() pol->mode is 4 (MPOL_LOCAL), that `nodemask` in `pol` is not initialized, which will be accessed in mpol_rebind_policy(). Link: https://lkml.kernel.org/r/20220512123428.fq3wofedp6oiotd4@ppc.localdomain Signed-off-by: Wang Cheng <wanngchenng@gmail.com> Reported-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com> Tested-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>