summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2020-03-27mm/hmm: remove pgmap checking for devmap pagesJason Gunthorpe
The checking boils down to some racy check if the pagemap is still available or not. Instead of checking this, rely entirely on the notifiers, if a pagemap is destroyed then all pages that belong to it must be removed from the tables and the notifiers triggered. Link: https://lore.kernel.org/r/20200327200021.29372-2-jgg@ziepe.ca Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: check the device private page owner in hmm_range_fault()Christoph Hellwig
hmm_range_fault() will succeed for any kind of device private memory, even if it doesn't belong to the calling entity. While nouveau has some crude checks for that, they are broken because they assume nouveau is the only user of device private memory. Fix this by passing in an expected pgmap owner in the hmm_range_fault structure. If a device_private page is found and doesn't match the owner then it is treated as an non-present and non-faultable page. This prevents a bug in amdgpu, where it doesn't know how to handle device_private pages, but hmm_range_fault would return them anyhow. Fixes: 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using ZONE_DEVICE") Link: https://lore.kernel.org/r/20200316193216.920734-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm: simplify device private page handling in hmm_range_faultChristoph Hellwig
Remove the HMM_PFN_DEVICE_PRIVATE flag, no driver has ever set this flag on input, and the only place that uses it on output can be trivially changed to use is_device_private_page(). This removes the ability to request that device_private pages are faulted back into system memory. Link: https://lore.kernel.org/r/20200316193216.920734-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm: handle multiple owners of device private pages in migrate_vmaChristoph Hellwig
Add a new src_owner field to struct migrate_vma. If the field is set, only device private pages with page->pgmap->owner equal to that field are migrated. If the field is not set only "normal" pages are migrated. Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU") Link: https://lore.kernel.org/r/20200316193216.920734-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26memremap: add an owner field to struct dev_pagemapChristoph Hellwig
Add a new opaque owner field to struct dev_pagemap, which will allow the hmm and migrate_vma code to identify who owns ZONE_DEVICE memory, and refuse to work on mappings not owned by the calling entity. Link: https://lore.kernel.org/r/20200316193216.920734-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm: merge hmm_vma_do_fault into into hmm_vma_walk_hole_Christoph Hellwig
There is no good reason for this split, as it just obsfucates the flow. Link: https://lore.kernel.org/r/20200316135310.899364-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: don't handle the non-fault case in hmm_vma_walk_hole_()Christoph Hellwig
Setting a pfns entry to NONE before returning -EBUSY is a bug that will cause corruption of the input flags on the next loop. There is just a single caller using hmm_vma_walk_hole_() for the non-fault case. Use hmm_pfns_fill() to fill the whole pfn array with zeroes in the only caller for the non-fault case and remove the non-fault path from hmm_vma_walk_hole_(). This avoids setting NONE before returning -EBUSY. Also rename the function to hmm_vma_fault() to better describe what it does. Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Link: https://lore.kernel.org/r/20200316135310.899364-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: simplify hmm_vma_walk_hugetlb_entry()Christoph Hellwig
Remove the rather confusing goto label and just handle the fault case directly in the branch checking for it. Link: https://lore.kernel.org/r/20200316135310.899364-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: remove the unused HMM_FAULT_ALLOW_RETRY flagChristoph Hellwig
The HMM_FAULT_ALLOW_RETRY isn't used anywhere in the tree. Remove it and the weird -EAGAIN handling where handle_mm_fault() drops the mmap_sem. Link: https://lore.kernel.org/r/20200316135310.899364-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: do not check pmd_protnone twice in hmm_vma_handle_pmd()Jason Gunthorpe
pmd_to_hmm_pfn_flags() already checks it and makes the cpu flags 0. If no fault is requested then the pfns should be returned with the not valid flags. It should not unconditionally fault if faulting is not requested. Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handlingJason Gunthorpe
Currently if a special PTE is encountered hmm_range_fault() immediately returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing uses). EFAULT should only be returned after testing with hmm_pte_need_fault(). Also pte_devmap() and pte_special() are exclusive, and there is no need to check IS_ENABLED, pte_special() is stubbed out to return false on unsupported architectures. Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pagesJason Gunthorpe
hmm_range_fault() should never return 0 if the caller requested a valid page, but the pfns output for that page would be HMM_PFN_ERROR. hmm_pte_need_fault() must always be called before setting HMM_PFN_ERROR to detect if the page is in faulting mode or not. Fix two cases in hmm_vma_walk_pmd() and reorganize some of the duplicated code. Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Fixes: da4c3c735ea4 ("mm/hmm/mirror: helper to snapshot CPU page table") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: reorganize how !pte_present is handled in hmm_vma_handle_pte()Jason Gunthorpe
The intention with this code is to determine if the caller required the pages to be valid, and if so, then take some action to make them valid. The action varies depending on the page type. In all cases, if the caller doesn't ask for the page, then hmm_range_fault() should not return an error. Revise the implementation to be clearer, and fix some bugs: - hmm_pte_need_fault() must always be called before testing fault or write_fault otherwise the defaults of false apply and the if()'s don't work. This was missed on the is_migration_entry() branch - -EFAULT should not be returned unless hmm_pte_need_fault() indicates fault is required - ie snapshotting should not fail. - For !pte_present() the cpu_flags are always 0, except in the special case of is_device_private_entry(), calling pte_to_hmm_pfn_flags() is confusing. Reorganize the flow so that it always follows the pattern of calling hmm_pte_need_fault() and then checking fault || write_fault. Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: add missing call to hmm_range_need_fault() before returning EFAULTJason Gunthorpe
All return paths that do EFAULT must call hmm_range_need_fault() to determine if the user requires this page to be valid. If the page cannot be made valid if the user later requires it, due to vma flags in this case, then the return should be HMM_PFN_ERROR. Fixes: a3e0d41c2b1f ("mm/hmm: improve driver API to work and wait over a range") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: add missing pfns set to hmm_vma_walk_pmd()Jason Gunthorpe
All success exit paths from the walker functions must set the pfns array. A migration entry with no required fault is a HMM_PFN_NONE return, just like the pte case. Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: do not call hmm_vma_walk_hole() while holding a spinlockJason Gunthorpe
This eventually calls into handle_mm_fault() which is a sleeping function. Release the lock first. hmm_vma_walk_hole() does not touch the contents of the PUD, so it does not need the lock. Fixes: 3afc423632a1 ("mm: pagewalk: add p4d_entry() and pgd_entry()") Cc: Steven Price <steven.price@arm.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/hmm: add missing unmaps of the ptep during hmm_vma_handle_pte()Jason Gunthorpe
Many of the direct returns of error skipped doing the pte_unmap(). All non zero exit paths must unmap the pte. The pte_unmap() is split unnaturally like this because some of the error exit paths trigger a sleep and must release the lock before sleeping. Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Fixes: 53f5c3f489ec ("mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()") Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-03-26mm/mremap: Add comment explaining the untagging behaviour of mremap()Will Deacon
Commit dcde237319e6 ("mm: Avoid creating virtual address aliases in brk()/mmap()/mremap()") changed mremap() so that only the 'old' address is untagged, leaving the 'new' address in the form it was passed from userspace. This prevents the unexpected creation of aliasing virtual mappings in userspace, but looks a bit odd when you read the code. Add a comment justifying the untagging behaviour in mremap(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-25mm: docs: Fix a comment in process_vm_rw_coreBernd Edlinger
This removes a duplicate "a" in the comment in process_vm_rw_core. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-24mm: Add vmf_insert_pfn_xxx_prot() for huge page-table entriesThomas Hellstrom (VMware)
For graphics drivers needing to modify the page-protection, add huge page-table entries counterparts to vmf_insert_pfn_prot(). Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2020-03-24mm: Split huge pages on write-notify or COWThomas Hellstrom (VMware)
The functions wp_huge_pmd() and wp_huge_pud() currently relies on the huge_fault() callback to split huge page table entries if needed. However for module users that requires export of the split_huge_xxx() functionality which may be undesired. Instead split pre-existing huge page-table entries on VM_FAULT_FALLBACK return. We currently only do COW and write-notify on the PTE level, so if the huge_fault() handler returns VM_FAULT_FALLBACK on wp faults, split the huge pages and page-table entries. Also do this for huge PUDs if there is no huge_fault() handler and the vma is not anonymous, similar to how it's done for PMDs. Note that fs/dax.c still does the splitting in the huge_fault() handler, but as huge_fault() A follow-up patch can remove the dax.c split_huge_pmd() if needed. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2020-03-24mm: Introduce vma_is_special_hugeThomas Hellstrom (VMware)
For VM_PFNMAP and VM_MIXEDMAP vmas that want to support transhuge pages and -page table entries, introduce vma_is_special_huge() that takes the same codepaths as vma_is_dax(). The use of "special" follows the definition in memory.c, vm_normal_page(): "Special" mappings do not wish to be associated with a "struct page" (either it doesn't exist, or it exists but they don't want to touch it) For PAGE_SIZE pages, "special" is determined per page table entry to be able to deal with COW pages. But since we don't have huge COW pages, we can classify a vma as either "special huge" or "normal huge". Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>
2020-03-21x86/mm: split vmalloc_sync_all()Joerg Roedel
Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the vunmap() code-path. While this change was necessary to maintain correctness on x86-32-pae kernels, it also adds additional cycles for architectures that don't need it. Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported severe performance regressions in micro-benchmarks because it now also calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But the vmalloc_sync_all() implementation on x86-64 is only needed for newly created mappings. To avoid the unnecessary work on x86-64 and to gain the performance back, split up vmalloc_sync_all() into two functions: * vmalloc_sync_mappings(), and * vmalloc_sync_unmappings() Most call-sites to vmalloc_sync_all() only care about new mappings being synchronized. The only exception is the new call-site added in the above mentioned commit. Shile Zhang directed us to a report of an 80% regression in reaim throughput. Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES] Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/ Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm, slub: prevent kmalloc_node crashes and memory leaksVlastimil Babka
Sachin reports [1] a crash in SLUB __slab_alloc(): BUG: Kernel NULL pointer dereference on read at 0x000073b0 Faulting instruction address: 0xc0000000003d55f4 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1 NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000 REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004844 XER: 00000000 CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1 GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500 GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620 GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000 GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000 GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002 GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122 GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8 GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180 NIP ___slab_alloc+0x1f4/0x760 LR __slab_alloc+0x34/0x60 Call Trace: ___slab_alloc+0x334/0x760 (unreliable) __slab_alloc+0x34/0x60 __kmalloc_node+0x110/0x490 kvmalloc_node+0x58/0x110 mem_cgroup_css_online+0x108/0x270 online_css+0x48/0xd0 cgroup_apply_control_enable+0x2ec/0x4d0 cgroup_mkdir+0x228/0x5f0 kernfs_iop_mkdir+0x90/0xf0 vfs_mkdir+0x110/0x230 do_mkdirat+0xb0/0x1a0 system_call+0x5c/0x68 This is a PowerPC platform with following NUMA topology: available: 2 nodes (0-1) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 size: 35247 MB node 1 free: 30907 MB node distances: node 0 1 0: 10 40 1: 40 10 possible numa nodes: 0-31 This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node" [2] which effectively calls kmalloc_node for each possible node. SLUB however only allocates kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on node_to_mem_node to return such valid node for other nodes since commit a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node"). This is however not true in this configuration where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up accessing non-allocated kmem_cache_node. A related issue was reported by Bharata (originally by Ramachandran) [3] where a similar PowerPC configuration, but with mainline kernel without patch [2] ends up allocating large amounts of pages by kmalloc-1k kmalloc-512. This seems to have the same underlying issue with node_to_mem_node() not behaving as expected, and might probably also lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4]. This patch should fix both issues by not relying on node_to_mem_node() anymore and instead simply falling back to NUMA_NO_NODE, when kmalloc_node(node) is attempted for a node that's not online, or has no usable memory. The "usable memory" condition is also changed from node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly the condition that SLUB uses to allocate kmem_cache_node structures. The check in get_partial() is removed completely, as the checks in ___slab_alloc() are now sufficient to prevent get_partial() being reached with an invalid node. [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ [2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/ [3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/ [4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/ Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christopher Lameter <cl@linux.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz Debugged-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm/mmu_notifier: silence PROVE_RCU_LIST warningsQian Cai
It is safe to traverse mm->notifier_subscriptions->list either under SRCU read lock or mm->notifier_subscriptions->lock using hlist_for_each_entry_rcu(). Silence the PROVE_RCU_LIST false positives, for example, WARNING: suspicious RCU usage ----------------------------- mm/mmu_notifier.c:484 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by libvirtd/802: #0: ffff9321e3f58148 (&mm->mmap_sem#2){++++}, at: do_mprotect_pkey+0xe1/0x3e0 #1: ffffffff91ae6160 (mmu_notifier_invalidate_range_start){+.+.}, at: change_p4d_range+0x5fa/0x800 #2: ffffffff91ae6e08 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x178/0x460 stack backtrace: CPU: 7 PID: 802 Comm: libvirtd Tainted: G I 5.6.0-rc6-next-20200317+ #2 Hardware name: HP ProLiant BL460c Gen8, BIOS I31 11/02/2014 Call Trace: dump_stack+0xa4/0xfe lockdep_rcu_suspicious+0xeb/0xf5 __mmu_notifier_invalidate_range_start+0x3ff/0x460 change_p4d_range+0x746/0x800 change_protection+0x1df/0x300 mprotect_fixup+0x245/0x3e0 do_mprotect_pkey+0x23b/0x3e0 __x64_sys_mprotect+0x51/0x70 do_syscall_64+0x91/0xae8 entry_SYSCALL_64_after_hwframe+0x49/0xb3 Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Link: http://lkml.kernel.org/r/20200317175640.2047-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm: do not allow MADV_PAGEOUT for CoW pagesMichal Hocko
Jann has brought up a very interesting point [1]. While shared pages are excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed that way. This can lead to all sorts of hard to debug problems. E.g. performance problems outlined by Daniel [2]. There are runtime environments where there is a substantial memory shared among security domains via CoW memory and a easy to reclaim way of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation in for the parent process which might be more privileged or even open side channel attacks. The feasibility of the latter is not really clear to me TBH but there is no real reason for exposure at this stage. It seems there is no real use case to depend on reclaiming CoW memory via madvise at this stage so it is much easier to simply disallow it and this is what this patch does. Put it simply MADV_{PAGEOUT,COLD} can operate only on the exclusively owned memory which is a straightforward semantic. [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm, memcg: throttle allocators based on ancestral memory.highChris Down
Prior to this commit, we only directly check the affected cgroup's memory.high against its usage. However, it's possible that we are being reclaimed as a result of hitting an ancestor memory.high and should be penalised based on that, instead. This patch changes memory.high overage throttling to use the largest overage in its ancestors when considering how many penalty jiffies to charge. This makes sure that we penalise poorly behaving cgroups in the same way regardless of at what level of the hierarchy memory.high was breached. Fixes: 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high") Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> [5.4.x+] Link: http://lkml.kernel.org/r/8cd132f84bd7e16cdb8fde3378cdbf05ba00d387.1584036142.git.chris@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm, memcg: fix corruption on 64-bit divisor in memory.high throttlingChris Down
Commit 0e4b01df8659 had a bunch of fixups to use the right division method. However, it seems that after all that it still wasn't right -- div_u64 takes a 32-bit divisor. The headroom is still large (2^32 pages), so on mundane systems you won't hit this, but this should definitely be fixed. Fixes: 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high") Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: <stable@vger.kernel.org> [5.4.x+] Link: http://lkml.kernel.org/r/80780887060514967d414b3cd91f9a316a16ab98.1584036142.git.chris@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP caseBaoquan He
In section_deactivate(), pfn_to_page() doesn't work any more after ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case. It causes a hot remove failure: kernel BUG at mm/page_alloc.c:4806! invalid opcode: 0000 [#1] SMP PTI CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G W 5.5.0-next-20200205+ #340 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: kacpi_hotplug acpi_hotplug_work_fn RIP: 0010:free_pages+0x85/0xa0 Call Trace: __remove_pages+0x99/0xc0 arch_remove_memory+0x23/0x4d try_remove_memory+0xc8/0x130 __remove_memory+0xa/0x11 acpi_memory_device_remove+0x72/0x100 acpi_bus_trim+0x55/0x90 acpi_device_hotplug+0x2eb/0x3d0 acpi_hotplug_work_fn+0x1a/0x30 process_one_work+0x1a7/0x370 worker_thread+0x30/0x380 kthread+0x112/0x130 ret_from_fork+0x35/0x40 Let's move the ->section_mem_map resetting after depopulate_section_memmap() to fix it. [akpm@linux-foundation.org: remove unneeded initialization, per David] Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_eventChunguang Xu
An eventfd monitors multiple memory thresholds of the cgroup, closes them, the kernel deletes all events related to this eventfd. Before all events are deleted, another eventfd monitors the memory threshold of this cgroup, leading to a crash: BUG: kernel NULL pointer dereference, address: 0000000000000004 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 800000033058e067 P4D 800000033058e067 PUD 3355ce067 PMD 0 Oops: 0002 [#1] SMP PTI CPU: 2 PID: 14012 Comm: kworker/2:6 Kdump: loaded Not tainted 5.6.0-rc4 #3 Hardware name: LENOVO 20AWS01K00/20AWS01K00, BIOS GLET70WW (2.24 ) 05/21/2014 Workqueue: events memcg_event_remove RIP: 0010:__mem_cgroup_usage_unregister_event+0xb3/0x190 RSP: 0018:ffffb47e01c4fe18 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff8bb223a8a000 RCX: 0000000000000001 RDX: 0000000000000001 RSI: ffff8bb22fb83540 RDI: 0000000000000001 RBP: ffffb47e01c4fe48 R08: 0000000000000000 R09: 0000000000000010 R10: 000000000000000c R11: 071c71c71c71c71c R12: ffff8bb226aba880 R13: ffff8bb223a8a480 R14: 0000000000000000 R15: 0000000000000000 FS:  0000000000000000(0000) GS:ffff8bb242680000(0000) knlGS:0000000000000000 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000004 CR3: 000000032c29c003 CR4: 00000000001606e0 Call Trace: memcg_event_remove+0x32/0x90 process_one_work+0x172/0x380 worker_thread+0x49/0x3f0 kthread+0xf8/0x130 ret_from_fork+0x35/0x40 CR2: 0000000000000004 We can reproduce this problem in the following ways: 1. We create a new cgroup subdirectory and a new eventfd, and then we monitor multiple memory thresholds of the cgroup through this eventfd. 2. closing this eventfd, and __mem_cgroup_usage_unregister_event () will be called multiple times to delete all events related to this eventfd. The first time __mem_cgroup_usage_unregister_event() is called, the kernel will clear all items related to this eventfd in thresholds-> primary. Since there is currently only one eventfd, thresholds-> primary becomes empty, so the kernel will set thresholds-> primary and hresholds-> spare to NULL. If at this time, the user creates a new eventfd and monitor the memory threshold of this cgroup, kernel will re-initialize thresholds-> primary. Then when __mem_cgroup_usage_unregister_event () is called for the second time, because thresholds-> primary is not empty, the system will access thresholds-> spare, but thresholds-> spare is NULL, which will trigger a crash. In general, the longer it takes to delete all events related to this eventfd, the easier it is to trigger this problem. The solution is to check whether the thresholds associated with the eventfd has been cleared when deleting the event. If so, we do nothing. [akpm@linux-foundation.org: fix comment, per Kirill] Fixes: 907860ed381a ("cgroups: make cftype.unregister_event() void-returning") Signed-off-by: Chunguang Xu <brookxu@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/077a6f67-aefa-4591-efec-f2f3af2b0b02@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-18mm: slub: be more careful about the double cmpxchg of freelistLinus Torvalds
This is just a cleanup addition to Jann's fix to properly update the transaction ID for the slub slowpath in commit fd4d9c7d0c71 ("mm: slub: add missing TID bump.."). The transaction ID is what protects us against any concurrent accesses, but we should really also make sure to make the 'freelist' comparison itself always use the same freelist value that we then used as the new next free pointer. Jann points out that if we do all of this carefully, we could skip the transaction ID update for all the paths that only remove entries from the lists, and only update the TID when adding entries (to avoid the ABA issue with cmpxchg and list handling re-adding a previously seen value). But this patch just does the "make sure to cmpxchg the same value we used" rather than then try to be clever. Acked-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-18mm: slub: add missing TID bump in kmem_cache_alloc_bulk()Jann Horn
When kmem_cache_alloc_bulk() attempts to allocate N objects from a percpu freelist of length M, and N > M > 0, it will first remove the M elements from the percpu freelist, then call ___slab_alloc() to allocate the next element and repopulate the percpu freelist. ___slab_alloc() can re-enable IRQs via allocate_slab(), so the TID must be bumped before ___slab_alloc() to properly commit the freelist head change. Fix it by unconditionally bumping c->tid when entering the slowpath. Cc: stable@vger.kernel.org Fixes: ebe909e0fdb3 ("slub: improve bulk alloc strategy") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-16kernfs: Add removed_size out param for simple_xattr_setDaniel Xu
This helps set up size accounting in the next commit. Without this out param, it's difficult to find out the removed xattr size without taking a lock for longer and walking the xattr linked list twice. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: "It looks like a decent sized set of fixes, but a lot of these are one liner off-by-one and similar type changes: 1) Fix netlink header pointer to calcular bad attribute offset reported to user. From Pablo Neira Ayuso. 2) Don't double clear PHY interrupts when ->did_interrupt is set, from Heiner Kallweit. 3) Add missing validation of various (devlink, nl802154, fib, etc.) attributes, from Jakub Kicinski. 4) Missing *pos increments in various netfilter seq_next ops, from Vasily Averin. 5) Missing break in of_mdiobus_register() loop, from Dajun Jin. 6) Don't double bump tx_dropped in veth driver, from Jiang Lidong. 7) Work around FMAN erratum A050385, from Madalin Bucur. 8) Make sure ARP header is pulled early enough in bonding driver, from Eric Dumazet. 9) Do a cond_resched() during multicast processing of ipvlan and macvlan, from Mahesh Bandewar. 10) Don't attach cgroups to unrelated sockets when in interrupt context, from Shakeel Butt. 11) Fix tpacket ring state management when encountering unknown GSO types. From Willem de Bruijn. 12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend() only in the suspend context. From Heiner Kallweit" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits) net: systemport: fix index check to avoid an array out of bounds access tc-testing: add ETS scheduler to tdc build configuration net: phy: fix MDIO bus PM PHY resuming net: hns3: clear port base VLAN when unload PF net: hns3: fix RMW issue for VLAN filter switch net: hns3: fix VF VLAN table entries inconsistent issue net: hns3: fix "tc qdisc del" failed issue taprio: Fix sending packets without dequeueing them net: mvmdio: avoid error message for optional IRQ net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register net: memcg: fix lockdep splat in inet_csk_accept() s390/qeth: implement smarter resizing of the RX buffer pool s390/qeth: refactor buffer pool code s390/qeth: use page pointers to manage RX buffer pool seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed net/packet: tpacket_rcv: do not increment ring index on drop sxgbe: Fix off by one in samsung driver strncpy size arg net: caif: Add lockdep expression to RCU traversal primitive MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer ...
2020-03-10net: memcg: late association of sock to memcgShakeel Butt
If a TCP socket is allocated in IRQ context or cloned from unassociated (i.e. not associated to a memcg) in IRQ context then it will remain unassociated for its whole life. Almost half of the TCPs created on the system are created in IRQ context, so, memory used by such sockets will not be accounted by the memcg. This issue is more widespread in cgroup v1 where network memory accounting is opt-in but it can happen in cgroup v2 if the source socket for the cloning was created in root memcg. To fix the issue, just do the association of the sockets at the accept() time in the process context and then force charge the memory buffer already used and reserved by the socket. Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10cgroup: memcg: net: do not associate sock with unrelated cgroupShakeel Butt
We are testing network memory accounting in our setup and noticed inconsistent network memory usage and often unrelated cgroups network usage correlates with testing workload. On further inspection, it seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in irq context specially for cgroup v1. mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context and kind of assumes that this can only happen from sk_clone_lock() and the source sock object has already associated cgroup. However in cgroup v1, where network memory accounting is opt-in, the source sock can be unassociated with any cgroup and the new cloned sock can get associated with unrelated interrupted cgroup. Cgroup v2 can also suffer if the source sock object was created by process in the root cgroup or if sk_alloc() is called in irq context. The fix is to just do nothing in interrupt. WARNING: Please note that about half of the TCP sockets are allocated from the IRQ context, so, memory used by such sockets will not be accouted by the memcg. The stack trace of mem_cgroup_sk_alloc() from IRQ-context: CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1 Hardware name: ... Call Trace: <IRQ> dump_stack+0x57/0x75 mem_cgroup_sk_alloc+0xe9/0xf0 sk_clone_lock+0x2a7/0x420 inet_csk_clone_lock+0x1b/0x110 tcp_create_openreq_child+0x23/0x3b0 tcp_v6_syn_recv_sock+0x88/0x730 tcp_check_req+0x429/0x560 tcp_v6_rcv+0x72d/0xa40 ip6_protocol_deliver_rcu+0xc9/0x400 ip6_input+0x44/0xd0 ? ip6_protocol_deliver_rcu+0x400/0x400 ip6_rcv_finish+0x71/0x80 ipv6_rcv+0x5b/0xe0 ? ip6_sublist_rcv+0x2e0/0x2e0 process_backlog+0x108/0x1e0 net_rx_action+0x26b/0x460 __do_softirq+0x104/0x2a6 do_softirq_own_stack+0x2a/0x40 </IRQ> do_softirq.part.19+0x40/0x50 __local_bh_enable_ip+0x51/0x60 ip6_finish_output2+0x23d/0x520 ? ip6table_mangle_hook+0x55/0x160 __ip6_finish_output+0xa1/0x100 ip6_finish_output+0x30/0xd0 ip6_output+0x73/0x120 ? __ip6_finish_output+0x100/0x100 ip6_xmit+0x2e3/0x600 ? ipv6_anycast_cleanup+0x50/0x50 ? inet6_csk_route_socket+0x136/0x1e0 ? skb_free_head+0x1e/0x30 inet6_csk_xmit+0x95/0xf0 __tcp_transmit_skb+0x5b4/0xb20 __tcp_send_ack.part.60+0xa3/0x110 tcp_send_ack+0x1d/0x20 tcp_rcv_state_process+0xe64/0xe80 ? tcp_v6_connect+0x5d1/0x5f0 tcp_v6_do_rcv+0x1b1/0x3f0 ? tcp_v6_do_rcv+0x1b1/0x3f0 __release_sock+0x7f/0xd0 release_sock+0x30/0xa0 __inet_stream_connect+0x1c3/0x3b0 ? prepare_to_wait+0xb0/0xb0 inet_stream_connect+0x3b/0x60 __sys_connect+0x101/0x120 ? __sys_getsockopt+0x11b/0x140 __x64_sys_connect+0x1a/0x20 do_syscall_64+0x51/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The stack trace of mem_cgroup_sk_alloc() from IRQ-context: Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking") Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-06mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabledVlastimil Babka
Commit cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC") fixed memory hotplug with debug_pagealloc enabled, where onlining a page goes through page freeing, which removes the direct mapping. Some arches don't like when the page is not mapped in the first place, so generic_online_page() maps it first. This is somewhat wasteful, but better than special casing page freeing fast paths. The commit however missed that DEBUG_PAGEALLOC configured doesn't mean it's actually enabled. One has to test debug_pagealloc_enabled() since 031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime configurable"), or alternatively debug_pagealloc_enabled_static() since 8e57f8acbbd1 ("mm, debug_pagealloc: don't rely on static keys too early"), but this is not done. As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled will crash: Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 0000000000000000 TEID: 0000000000000483 Fault in home space mode while using kernel ASCE. AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d Oops: 0004 ilc:2 [#1] SMP CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased) Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100 0000000000000001 0000000000000000 0000000000000002 0000000000000100 0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000 000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20 Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1 0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3 >0000001ecd281b9e: 94fb5006 ni 6(%r5),251 0000001ecd281ba2: 41505008 la %r5,8(%r5) 0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e 0000001ecd281bac: 1a07 ar %r0,%r7 0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e Call Trace: [<0000001ecd281b9e>] __kernel_map_pages+0x166/0x188 [<0000001ecd4d9516>] online_pages_range+0xf6/0x128 [<0000001ecd2a8186>] walk_system_ram_range+0x7e/0xd8 [<0000001ecda28aae>] online_pages+0x2fe/0x3f0 [<0000001ecd7d02a6>] memory_subsys_online+0x8e/0xc0 [<0000001ecd7add42>] device_online+0x5a/0xc8 [<0000001ecd7d0430>] state_store+0x88/0x118 [<0000001ecd5b9f62>] kernfs_fop_write+0xc2/0x200 [<0000001ecd5064b6>] vfs_write+0x176/0x1e0 [<0000001ecd50676a>] ksys_write+0xa2/0x100 [<0000001ecda315d4>] system_call+0xd8/0x2c8 Fix this by checking debug_pagealloc_enabled_static() before calling kernel_map_pages(). Backports for kernel before 5.5 should use debug_pagealloc_enabled() instead. Also add comments. Fixes: cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC") Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-06mm/z3fold.c: do not include rwlock.h directlySebastian Andrzej Siewior
rwlock.h should not be included directly. Instead linux/splinlock.h should be included. One thing it does is to break the RT build. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200224133631.1510569-1-bigeasy@linutronix.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-06mm: avoid data corruption on CoW fault into PFN-mapped VMAKirill A. Shutemov
Jeff Moyer has reported that one of xfstests triggers a warning when run on DAX-enabled filesystem: WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50 ... wp_page_copy+0x98c/0xd50 (unreliable) do_wp_page+0xd8/0xad0 __handle_mm_fault+0x748/0x1b90 handle_mm_fault+0x120/0x1f0 __do_page_fault+0x240/0xd70 do_page_fault+0x38/0xd0 handle_page_fault+0x10/0x30 The warning happens on failed __copy_from_user_inatomic() which tries to copy data into a CoW page. This happens because of race between MADV_DONTNEED and CoW page fault: CPU0 CPU1 handle_mm_fault() do_wp_page() wp_page_copy() do_wp_page() madvise(MADV_DONTNEED) zap_page_range() zap_pte_range() ptep_get_and_clear_full() <TLB flush> __copy_from_user_inatomic() sees empty PTE and fails WARN_ON_ONCE(1) clear_page() The solution is to re-try __copy_from_user_inatomic() under PTL after checking that PTE is matches the orig_pte. The second copy attempt can still fail, like due to non-readable PTE, but there's nothing reasonable we can do about, except clearing the CoW page. Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Jeff Moyer <jmoyer@redhat.com> Cc: <stable@vger.kernel.org> Cc: Justin He <Justin.He@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-06mm: fix possible PMD dirty bit lost in set_pmd_migration_entry()Huang Ying
In set_pmd_migration_entry(), pmdp_invalidate() is used to change PMD atomically. But the PMD is read before that with an ordinary memory reading. If the THP (transparent huge page) is written between the PMD reading and pmdp_invalidate(), the PMD dirty bit may be lost, and cause data corruption. The race window is quite small, but still possible in theory, so need to be fixed. The race is fixed via using the return value of pmdp_invalidate() to get the original content of PMD, which is a read/modify/write atomic operation. So no THP writing can occur in between. The race has been introduced when the THP migration support is added in the commit 616b8371539a ("mm: thp: enable thp migration in generic path"). But this fix depends on the commit d52605d7cb30 ("mm: do not lose dirty and accessed bits in pmdp_invalidate()"). So it's easy to be backported after v4.16. But the race window is really small, so it may be fine not to backport the fix at all. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Link: http://lkml.kernel.org/r/20200220075220.2327056-1-ying.huang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-06mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking ↵Mel Gorman
page tables prot_numa : A user reported a bug against a distribution kernel while running a : proprietary workload described as "memory intensive that is not swapping" : that is expected to apply to mainline kernels. The workload is : read/write/modifying ranges of memory and checking the contents. They : reported that within a few hours that a bad PMD would be reported followed : by a memory corruption where expected data was all zeros. A partial : report of the bad PMD looked like : : [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2) : [ 5195.341184] ------------[ cut here ]------------ : [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35! : .... : [ 5195.410033] Call Trace: : [ 5195.410471] [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930 : [ 5195.410716] [<ffffffff811d4be8>] change_prot_numa+0x18/0x30 : [ 5195.410918] [<ffffffff810adefe>] task_numa_work+0x1fe/0x310 : [ 5195.411200] [<ffffffff81098322>] task_work_run+0x72/0x90 : [ 5195.411246] [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2 : [ 5195.411494] [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40 : [ 5195.411739] [<ffffffff815e56af>] retint_user+0x8/0x10 : : Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD : was a false detection. The bug does not trigger if automatic NUMA : balancing or transparent huge pages is disabled. : : The bug is due a race in change_pmd_range between a pmd_trans_huge and : pmd_nond_or_clear_bad check without any locks held. During the : pmd_trans_huge check, a parallel protection update under lock can have : cleared the PMD and filled it with a prot_numa entry between the transhuge : check and the pmd_none_or_clear_bad check. : : While this could be fixed with heavy locking, it's only necessary to make : a copy of the PMD on the stack during change_pmd_range and avoid races. A : new helper is created for this as the check if quite subtle and the : existing similar helpful is not suitable. This passed 154 hours of : testing (usually triggers between 20 minutes and 24 hours) without : detecting bad PMDs or corruption. A basic test of an autonuma-intensive : workload showed no significant change in behaviour. Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is still open, and we have reports of Linpack test reporting bad residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD. host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7) .... host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512 host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096 The issue is observed on a v4.18-based distribution kernel, but the race window is expected to be applicable to mainline kernels, as well. [akpm@linux-foundation.org: fix comment typo, per Rafael] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-24Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull tmpfs fix from Al Viro: "Regression from fs_parse series this cycle..." * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: tmpfs: deny and force are not huge mount options
2020-02-21Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "It's all straightforward apart from the changes to mmap()/mremap() in relation to their handling of address arguments from userspace with non-zero tag bits in the upper byte. The change to brk() is necessary to fix a nasty user-visible regression in malloc(), but we tightened up mmap() and mremap() at the same time because they also allow the user to create virtual aliases by accident. It's much less likely than brk() to matter in practice, but enforcing the principle of "don't permit the creation of mappings using tagged addresses" leads to a straightforward ABI without having to worry about the "but what if a crazy program did foo?" aspect of things. Summary: - Fix regression in malloc() caused by ignored address tags in brk() - Add missing brackets around argument to untagged_addr() macro - Fix clang build when using binutils assembler - Fix silly typo in virtual memory map documentation" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: mm: Avoid creating virtual address aliases in brk()/mmap()/mremap() docs: arm64: fix trivial spelling enought to enough in memory.rst arm64: memory: Add missing brackets to untagged_addr() macro arm64: lse: Fix LSE atomics with LLVM
2020-02-21mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEMWei Yang
When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page() doesn't work before sparse_init_one_section() is called. This leads to a crash when hotplug memory: BUG: unable to handle page fault for address: 0000000006400000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G W 5.5.0-next-20200205+ #343 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: kacpi_hotplug acpi_hotplug_work_fn RIP: 0010:__memset+0x24/0x30 Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3 RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87 RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000 RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000 RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000 R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000 R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280 FS: 0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: sparse_add_section+0x1c9/0x26a __add_pages+0xbf/0x150 add_pages+0x12/0x60 add_memory_resource+0xc8/0x210 __add_memory+0x62/0xb0 acpi_memory_device_add+0x13f/0x300 acpi_bus_attach+0xf6/0x200 acpi_bus_scan+0x43/0x90 acpi_device_hotplug+0x275/0x3d0 acpi_hotplug_work_fn+0x1a/0x30 process_one_work+0x1a7/0x370 worker_thread+0x30/0x380 kthread+0x112/0x130 ret_from_fork+0x35/0x40 We should use memmap as it did. On x86 the impact is limited to x86_32 builds, or x86_64 configurations that override the default setting for SPARSEMEM_VMEMMAP. Other memory hotplug archs (arm64, ia64, and ppc) also default to SPARSEMEM_VMEMMAP=y. [dan.j.williams@intel.com: changelog update] {rppt@linux.ibm.com: changelog update] Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-21mm/vmscan.c: don't round up scan size for online memory cgroupGavin Shan
Commit 68600f623d69 ("mm: don't miss the last page because of round-off error") makes the scan size round up to @denominator regardless of the memory cgroup's state, online or offline. This affects the overall reclaiming behavior: the corresponding LRU list is eligible for reclaiming only when its size logically right shifted by @sc->priority is bigger than zero in the former formula. For example, the inactive anonymous LRU list should have at least 0x4000 pages to be eligible for reclaiming when we have 60/12 for swappiness/priority and without taking scan/rotation ratio into account. After the roundup is applied, the inactive anonymous LRU list becomes eligible for reclaiming when its size is bigger than or equal to 0x1000 in the same condition. (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1 ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1 aarch64 has 512MB huge page size when the base page size is 64KB. The memory cgroup that has a huge page is always eligible for reclaiming in that case. The reclaiming is likely to stop after the huge page is reclaimed, meaing the further iteration on @sc->priority and the silbing and child memory cgroups will be skipped. The overall behaviour has been changed. This fixes the issue by applying the roundup to offlined memory cgroups only, to give more preference to reclaim memory from offlined memory cgroup. It sounds reasonable as those memory is unlikedly to be used by anyone. The issue was found by starting up 8 VMs on a Ampere Mustang machine, which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB memory. It took 264 seconds for all VMs to be completely up and 784MB swap is consumed after that. With this patch applied, it took 236 seconds and 60MB swap to do same thing. So there is 10% performance improvement for my case. Note that KSM is disable while THP is enabled in the testing. total used free shared buff/cache available Mem: 16196 10065 2049 16 4081 3749 Swap: 8175 784 7391 total used free shared buff/cache available Mem: 16196 11324 3656 24 1215 2936 Swap: 8175 60 8115 Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> [4.20+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-21mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps()Vasily Averin
for_each_mem_cgroup() increases css reference counter for memory cgroup and requires to use mem_cgroup_iter_break() if the walk is cancelled. Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com Fixes: 0a4465d34028 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-21mm/swapfile.c: fix a comment in sys_swapon()Christoph Hellwig
claim_swapfile now always takes i_rwsem. Link: http://lkml.kernel.org/r/20200114161225.309792-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-20mm/memremap_pages: Introduce memremap_compat_align()Dan Williams
The "sub-section memory hotplug" facility allows memremap_pages() users like libnvdimm to compensate for hardware platforms like x86 that have a section size larger than their hardware memory mapping granularity. The compensation that sub-section support affords is being tolerant of physical memory resources shifting by units smaller (64MiB on x86) than the memory-hotplug section size (128 MiB). Where the platform physical-memory mapping granularity is limited by the number and capability of address-decode-registers in the memory controller. While the sub-section support allows memremap_pages() to operate on sub-section (2MiB) granularity, the Power architecture may still require 16MiB alignment on "!radix_enabled()" platforms. In order for libnvdimm to be able to detect and manage this per-arch limitation, introduce memremap_compat_align() as a common minimum alignment across all driver-facing memory-mapping interfaces, and let Power override it to 16MiB in the "!radix_enabled()" case. The assumption / requirement for 16MiB to be a viable memremap_compat_align() value is that Power does not have platforms where its equivalent of address-decode-registers never hardware remaps a persistent memory resource on smaller than 16MiB boundaries. Note that I tried my best to not add a new Kconfig symbol, but header include entanglements defeated the #ifndef memremap_compat_align design pattern and the need to export it defeats the __weak design pattern for arch overrides. Based on an initial patch by Aneesh. Link: http://lore.kernel.org/r/CAPcyv4gBGNP95APYaBcsocEa50tQj9b5h__83vgngjq3ouGX_Q@mail.gmail.com Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reported-by: Jeff Moyer <jmoyer@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2020-02-20mm: Avoid creating virtual address aliases in brk()/mmap()/mremap()Catalin Marinas
Currently the arm64 kernel ignores the top address byte passed to brk(), mmap() and mremap(). When the user is not aware of the 56-bit address limit or relies on the kernel to return an error, untagging such pointers has the potential to create address aliases in user-space. Passing a tagged address to munmap(), madvise() is permitted since the tagged pointer is expected to be inside an existing mapping. The current behaviour breaks the existing glibc malloc() implementation which relies on brk() with an address beyond 56-bit to be rejected by the kernel. Remove untagging in the above functions by partially reverting commit ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk"). In addition, update the arm64 tagged-address-abi.rst document accordingly. Link: https://bugzilla.redhat.com/1797052 Fixes: ce18d171cb73 ("mm: untag user pointers in mmap/munmap/mremap/brk") Cc: <stable@vger.kernel.org> # 5.4.x- Cc: Florian Weimer <fweimer@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reported-by: Victor Stinner <vstinner@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-02-18tmpfs: deny and force are not huge mount optionsHugh Dickins
5.6-rc1 commit 2710c957a8ef ("fs_parse: get rid of ->enums") regressed the huge tmpfs mount options to an earlier state: "deny" and "force" are not valid there, and can crash the kernel. Delete those lines. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>