summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-02-24mm,fs,dax: change ->pmd_fault to ->huge_faultDave Jiang
Patch series "1G transparent hugepage support for device dax", v2. The following series implements support for 1G trasparent hugepage on x86 for device dax. The bulk of the code was written by Mathew Wilcox a while back supporting transparent 1G hugepage for fs DAX. I have forward ported the relevant bits to 4.10-rc. The current submission has only the necessary code to support device DAX. Comments from Dan Williams: So the motivation and intended user of this functionality mirrors the motivation and users of 1GB page support in hugetlbfs. Given expected capacities of persistent memory devices an in-memory database may want to reduce tlb pressure beyond what they can already achieve with 2MB mappings of a device-dax file. We have customer feedback to that effect as Willy mentioned in his previous version of these patches [1]. [1]: https://lkml.org/lkml/2016/1/31/52 Comments from Nilesh @ Oracle: There are applications which have a process model; and if you assume 10,000 processes attempting to mmap all the 6TB memory available on a server; we are looking at the following: processes : 10,000 memory : 6TB pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB pmd @ 2M page size: 120,000 / 512 = ~240GB pud @ 1G page size: 240GB / 512 = ~480MB As you can see with 2M pages, this system will use up an exorbitant amount of DRAM to hold the page tables; but the 1G pages finally brings it down to a reasonable level. Memory sizes will keep increasing; so this number will keep increasing. An argument can be made to convert the applications from process model to thread model, but in the real world that may not be always practical. Hopefully this helps explain the use case where this is valuable. This patch (of 3): In preparation for adding the ability to handle PUD pages, convert vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The vm_fault structure is extended to include a union of the different page table pointers that may be needed, and three flag bits are reserved to indicate which type of pointer is in the union. [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()] Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path] Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, page_alloc: use static global work_struct for draining per-cpu pagesMel Gorman
As suggested by Vlastimil Babka and Tejun Heo, this patch uses a static work_struct to co-ordinate the draining of per-cpu pages on the workqueue. Only one task can drain at a time but this is better than the previous scheme that allowed multiple tasks to send IPIs at a time. One consideration is whether parallel requests should synchronise against each other. This patch does not synchronise for a global drain as the common case for such callers is expected to be multiple parallel direct reclaimers competing for pages when the watermark is close to min. Draining the per-cpu list is unlikely to make much progress and serialising the drain is of dubious merit. Drains are synchonrised for callers such as memory hotplug and CMA that care about the drain being complete when the function returns. Link: http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Suggested-by: Tejun Heo <tj@kernel.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, page_alloc: don't check cpuset allowed twice in fast-pathVlastimil Babka
Since commit 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator") we replace a NULL nodemask with cpuset_current_mems_allowed in the fast path, so that get_page_from_freelist() filters nodes allowed by the cpuset via for_next_zone_zonelist_nodemask(). In that case it's pointless to additionaly check __cpuset_zone_allowed() in each iteration, which we can avoid by not adding ALLOC_CPUSET to alloc_flags in that scenario. This saves some cycles in the allocator fast path on systems with one or more non-root cpuset configured. In the slow path, ALLOC_CPUSET is reset according to __alloc_pages_slowpath(). Without configured cpusets, this code is disabled by a static key. Link: http://lkml.kernel.org/r/20170124150511.5710-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, page_alloc: remove redundant checks from alloc fastpathVlastimil Babka
The allocation fast path contains two similar checks for zoneref->zone being NULL, where zoneref points either to the first zone in the zonelist, or to the preferred zone. These can be NULL either due to empty zonelist, or no zone being compatible with given nodemask or task's cpuset. These checks are unnecessary, because the zonelist walks in first_zones_zonelist() and get_page_from_freelist() handle a NULL starting zoneref->zone or preferred_zoneref->zone safely. It's safe to fallback to __alloc_pages_slowpath() where we also have the check early enough. Link: http://lkml.kernel.org/r/20170124150511.5710-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24zram: remove waitqueue for IO doneMinchan Kim
zram_reset_device() waits for ongoing writepage pages to be completed by zram->refcount logic. However, it's pointless because before the reset, we prevent further opening of zram by zram->claim and flush all of pending IO by fsync_bdev so there should be no pending IO at the zram_reset_device(). So let's remove that code which is even broken due to the lack of wake_up elsewhere. Link: http://lkml.kernel.org/r/1485145031-11661-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: fix comments for mmap_init()seokhoon.yoon
mmap_init() is no longer associated with VMA slab. So fix it. Link: http://lkml.kernel.org/r/1485182601-9294-1-git-send-email-iamyooon@gmail.com Signed-off-by: seokhoon.yoon <iamyooon@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmfDave Jiang
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, page_alloc: only use per-cpu allocator for irq-safe requestsMel Gorman
Many workloads that allocate pages are not handling an interrupt at a time. As allocation requests may be from IRQ context, it's necessary to disable/enable IRQs for every page allocation. This cost is the bulk of the free path but also a significant percentage of the allocation path. This patch alters the locking and checks such that only irq-safe allocation requests use the per-cpu allocator. All others acquire the irq-safe zone->lock and allocate from the buddy allocator. It relies on disabling preemption to safely access the per-cpu structures. It could be slightly modified to avoid soft IRQs using it but it's not clear it's worthwhile. This modification may slow allocations from IRQ context slightly but the main gain from the per-cpu allocator is that it scales better for allocations from multiple contexts. There is an implicit assumption that intensive allocations from IRQ contexts on multiple CPUs from a single NUMA node are rare and that the fast majority of scaling issues are encountered in !IRQ contexts such as page faulting. It's worth noting that this patch is not required for a bulk page allocator but it significantly reduces the overhead. The following is results from a page allocator micro-benchmark. Only order-0 is interesting as higher orders do not use the per-cpu allocator 4.10.0-rc2 4.10.0-rc2 vanilla irqsafe-v1r5 Amean alloc-odr0-1 287.15 ( 0.00%) 219.00 ( 23.73%) Amean alloc-odr0-2 221.23 ( 0.00%) 183.23 ( 17.18%) Amean alloc-odr0-4 187.00 ( 0.00%) 151.38 ( 19.05%) Amean alloc-odr0-8 167.54 ( 0.00%) 132.77 ( 20.75%) Amean alloc-odr0-16 156.00 ( 0.00%) 123.00 ( 21.15%) Amean alloc-odr0-32 149.00 ( 0.00%) 118.31 ( 20.60%) Amean alloc-odr0-64 138.77 ( 0.00%) 116.00 ( 16.41%) Amean alloc-odr0-128 145.00 ( 0.00%) 118.00 ( 18.62%) Amean alloc-odr0-256 136.15 ( 0.00%) 125.00 ( 8.19%) Amean alloc-odr0-512 147.92 ( 0.00%) 121.77 ( 17.68%) Amean alloc-odr0-1024 147.23 ( 0.00%) 126.15 ( 14.32%) Amean alloc-odr0-2048 155.15 ( 0.00%) 129.92 ( 16.26%) Amean alloc-odr0-4096 164.00 ( 0.00%) 136.77 ( 16.60%) Amean alloc-odr0-8192 166.92 ( 0.00%) 138.08 ( 17.28%) Amean alloc-odr0-16384 159.00 ( 0.00%) 138.00 ( 13.21%) Amean free-odr0-1 165.00 ( 0.00%) 89.00 ( 46.06%) Amean free-odr0-2 113.00 ( 0.00%) 63.00 ( 44.25%) Amean free-odr0-4 99.00 ( 0.00%) 54.00 ( 45.45%) Amean free-odr0-8 88.00 ( 0.00%) 47.38 ( 46.15%) Amean free-odr0-16 83.00 ( 0.00%) 46.00 ( 44.58%) Amean free-odr0-32 80.00 ( 0.00%) 44.38 ( 44.52%) Amean free-odr0-64 72.62 ( 0.00%) 43.00 ( 40.78%) Amean free-odr0-128 78.00 ( 0.00%) 42.00 ( 46.15%) Amean free-odr0-256 80.46 ( 0.00%) 57.00 ( 29.16%) Amean free-odr0-512 96.38 ( 0.00%) 64.69 ( 32.88%) Amean free-odr0-1024 107.31 ( 0.00%) 72.54 ( 32.40%) Amean free-odr0-2048 108.92 ( 0.00%) 78.08 ( 28.32%) Amean free-odr0-4096 113.38 ( 0.00%) 82.23 ( 27.48%) Amean free-odr0-8192 112.08 ( 0.00%) 82.85 ( 26.08%) Amean free-odr0-16384 110.38 ( 0.00%) 81.92 ( 25.78%) Amean total-odr0-1 452.15 ( 0.00%) 308.00 ( 31.88%) Amean total-odr0-2 334.23 ( 0.00%) 246.23 ( 26.33%) Amean total-odr0-4 286.00 ( 0.00%) 205.38 ( 28.19%) Amean total-odr0-8 255.54 ( 0.00%) 180.15 ( 29.50%) Amean total-odr0-16 239.00 ( 0.00%) 169.00 ( 29.29%) Amean total-odr0-32 229.00 ( 0.00%) 162.69 ( 28.96%) Amean total-odr0-64 211.38 ( 0.00%) 159.00 ( 24.78%) Amean total-odr0-128 223.00 ( 0.00%) 160.00 ( 28.25%) Amean total-odr0-256 216.62 ( 0.00%) 182.00 ( 15.98%) Amean total-odr0-512 244.31 ( 0.00%) 186.46 ( 23.68%) Amean total-odr0-1024 254.54 ( 0.00%) 198.69 ( 21.94%) Amean total-odr0-2048 264.08 ( 0.00%) 208.00 ( 21.24%) Amean total-odr0-4096 277.38 ( 0.00%) 219.00 ( 21.05%) Amean total-odr0-8192 279.00 ( 0.00%) 220.92 ( 20.82%) Amean total-odr0-16384 269.38 ( 0.00%) 219.92 ( 18.36%) This is the alloc, free and total overhead of allocating order-0 pages in batches of 1 page up to 16384 pages. Avoiding disabling/enabling overhead massively reduces overhead. Alloc overhead is roughly reduced by 14-20% in most cases. The free path is reduced by 26-46% and the total reduction is significant. Many users require zeroing of pages from the page allocator which is the vast cost of allocation. Hence, the impact on a basic page faulting benchmark is not that significant 4.10.0-rc2 4.10.0-rc2 vanilla irqsafe-v1r5 Hmean page_test 656632.98 ( 0.00%) 675536.13 ( 2.88%) Hmean brk_test 3845502.67 ( 0.00%) 3867186.94 ( 0.56%) Stddev page_test 10543.29 ( 0.00%) 4104.07 ( 61.07%) Stddev brk_test 33472.36 ( 0.00%) 15538.39 ( 53.58%) CoeffVar page_test 1.61 ( 0.00%) 0.61 ( 62.15%) CoeffVar brk_test 0.87 ( 0.00%) 0.40 ( 53.84%) Max page_test 666513.33 ( 0.00%) 678640.00 ( 1.82%) Max brk_test 3882800.00 ( 0.00%) 3887008.66 ( 0.11%) This is from aim9 and the most notable outcome is that fault variability is reduced by the patch. The headline improvement is small as the overall fault cost, zeroing, page table insertion etc dominate relative to disabling/enabling IRQs in the per-cpu allocator. Similarly, little benefit was seen on networking benchmarks both localhost and between physical server/clients where other costs dominate. It's possible that this will only be noticable on very high speed networks. Jesper Dangaard Brouer independently tested this with a separate microbenchmark from https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench Micro-benchmarked with [1] page_bench02: modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \ rmmod page_bench02 ; dmesg --notime | tail -n 4 Compared to baseline: 213 cycles(tsc) 53.417 ns - against this : 184 cycles(tsc) 46.056 ns - Saving : -29 cycles - Very close to expected 27 cycles saving [see below [2]] Micro benchmarking via time_bench_sample[3], we get the cost of these operations: time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.232 ns (step:0) time_bench: Type:spin_lock_unlock Per elem: 33 cycles(tsc) 8.334 ns (step:0) time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0) time_bench: Type:irqsave_before_lock Per elem: 57 cycles(tsc) 14.344 ns (step:0) time_bench: Type:spin_lock_unlock_irq Per elem: 34 cycles(tsc) 8.560 ns (step:0) time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0) time_bench: Type:local_BH_disable_enable Per elem: 19 cycles(tsc) 4.920 ns (step:0) time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0) time_bench: Type:local_irq_save_restore Per elem: 38 cycles(tsc) 9.665 ns (step:0) [Mel's patch removes a ^^^^^^^^^^^^^^^^] ^^^^^^^^^ expected saving - preempt cost time_bench: Type:preempt_disable_enable Per elem: 11 cycles(tsc) 2.794 ns (step:0) [adds a preempt ^^^^^^^^^^^^^^^^^^^^^^] ^^^^^^^^^ adds this cost time_bench: Type:funcion_call_cost Per elem: 6 cycles(tsc) 1.689 ns (step:0) time_bench: Type:func_ptr_call_cost Per elem: 11 cycles(tsc) 2.767 ns (step:0) time_bench: Type:page_alloc_put Per elem: 211 cycles(tsc) 52.803 ns (step:0) Thus, expected improvement is: 38-11 = 27 cycles. [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/] Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, page_alloc: do not depend on cpu hotplug locks inside the allocatorMichal Hocko
Dmitry has reported the following lockdep splat lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753 __mutex_lock_common kernel/locking/mutex.c:521 [inline] mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621 pcpu_alloc+0xbda/0x1280 mm/percpu.c:896 __alloc_percpu+0x24/0x30 mm/percpu.c:1075 smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44 cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136 cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493 _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057 do_cpu_up+0x73/0xa0 kernel/cpu.c:1087 cpu_up+0x18/0x20 kernel/cpu.c:1095 smp_init+0xe9/0xee kernel/smp.c:564 kernel_init_freeable+0x439/0x690 init/main.c:1010 kernel_init+0x13/0x180 init/main.c:941 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433 cpu_hotplug_begin cpu_hotplug.lock pcpu_alloc pcpu_alloc_mutex get_online_cpus+0x62/0x90 kernel/cpu.c:248 drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385 __alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline] __alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778 __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980 __alloc_pages include/linux/gfp.h:426 [inline] __alloc_pages_node include/linux/gfp.h:439 [inline] alloc_pages_node include/linux/gfp.h:453 [inline] pcpu_alloc_pages mm/percpu-vm.c:93 [inline] pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282 pcpu_alloc+0xe01/0x1280 mm/percpu.c:998 __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062 bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline] array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99 find_and_alloc_map kernel/bpf/syscall.c:34 [inline] map_create kernel/bpf/syscall.c:188 [inline] SYSC_bpf kernel/bpf/syscall.c:870 [inline] SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827 entry_SYSCALL_64_fastpath+0x1f/0xc2 pcpu_alloc pcpu_alloc_mutex drain_all_pages get_online_cpus cpu_hotplug.lock cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304 _cpu_up+0xca/0x2a0 kernel/cpu.c:1011 do_cpu_up+0x73/0xa0 kernel/cpu.c:1087 cpu_up+0x18/0x20 kernel/cpu.c:1095 smp_init+0xe9/0xee kernel/smp.c:564 kernel_init_freeable+0x439/0x690 init/main.c:1010 kernel_init+0x13/0x180 init/main.c:941 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433 cpu_hotplug_begin cpu_hotplug.lock Pulling cpu hotplug locks inside the page allocator is just too dangerous. Let's remove the dependency by dropping get_online_cpus() from drain_all_pages. This is not so simple though because now we do not have a protection against cpu hotplug which means 2 things: - the work item might be executed on a different cpu in worker from unbound pool so it doesn't run on pinned on the cpu - we have to make sure that we do not race with page_alloc_cpu_dead calling drain_pages_zone Disabling preemption in drain_local_pages_wq will solve the first problem drain_local_pages will determine its local CPU from the WQ context which will be stable after that point, page_alloc_cpu_dead is pinned to the CPU already. The later condition is achieved by disabling IRQs in drain_pages_zone. Fixes: mm, page_alloc: drain per-cpu pages from workqueue context Link: http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, page_alloc: drain per-cpu pages from workqueue contextMel Gorman
The per-cpu page allocator can be drained immediately via drain_all_pages() which sends IPIs to every CPU. In the next patch, the per-cpu allocator will only be used for interrupt-safe allocations which prevents draining it from IPI context. This patch uses workqueues to drain the per-cpu lists instead. This is slower but no slowdown during intensive reclaim was measured and the paths that use drain_all_pages() are not that sensitive to performance. This is particularly true as the path would only be triggered when reclaim is failing. It also makes a some sense to avoid storming a machine with IPIs when it's under memory pressure. Arguably, it should be further adjusted so that only one caller at a time is draining pages but it's beyond the scope of the current patch. Link: http://lkml.kernel.org/r/20170123153906.3122-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, page_alloc: split alloc_pages_nodemask()Mel Gorman
alloc_pages_nodemask does a number of preperation steps that determine what zones can be used for the allocation depending on a variety of factors. This is fine but a hypothetical caller that wanted multiple order-0 pages has to do the preparation steps multiple times. This patch structures __alloc_pages_nodemask such that it's relatively easy to build a bulk order-0 page allocator. There is no functional change. Link: http://lkml.kernel.org/r/20170123153906.3122-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, page_alloc: split buffered_rmqueue()Mel Gorman
Patch series "Use per-cpu allocator for !irq requests and prepare for a bulk allocator", v5. This series is motivated by a conversation led by Jesper Dangaard Brouer at the last LSF/MM proposing a generic page pool for DMA-coherent pages. Part of his motivation was due to the overhead of allocating multiple order-0 that led some drivers to use high-order allocations and splitting them. This is very slow in some cases. The first two patches in this series restructure the page allocator such that it is relatively easy to introduce an order-0 bulk page allocator. A patch exists to do that and has been handed over to Jesper until an in-kernel users is created. The third patch prevents the per-cpu allocator being drained from IPI context as that can potentially corrupt the list after patch four is merged. The final patch alters the per-cpu alloctor to make it exclusive to !irq requests. This cuts allocation/free overhead by roughly 30%. Performance tests from both Jesper and me are included in the patch. This patch (of 4): buffered_rmqueue removes a page from a given zone and uses the per-cpu list for order-0. This is fine but a hypothetical caller that wanted multiple order-0 pages has to disable/reenable interrupts multiple times. This patch structures buffere_rmqueue such that it's relatively easy to build a bulk order-0 page allocator. There is no functional change. [mgorman@techsingularity.net: failed per-cpu refill may blow up] Link: http://lkml.kernel.org/r/20170124112723.mshmgwq2ihxku2um@techsingularity.net Link: http://lkml.kernel.org/r/20170123153906.3122-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: vmscan: move dirty pages out of the way until they're flushedJohannes Weiner
We noticed a performance regression when moving hadoop workloads from 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout activity initiated by kswapd as well as frequent bursts of allocation stalls and direct reclaim scans. Even lowering the dirty ratios to the equivalent of less than 1% of memory would not eliminate the issue, suggesting that dirty pages concentrate where the scanner is looking. This can be traced back to recent efforts of thrash avoidance. Where 3.10 would not detect refaulting pages and continuously supply clean cache to the inactive list, a thrashing workload on 4.0+ will detect and activate refaulting pages right away, distilling used-once pages on the inactive list much more effectively. This is by design, and it makes sense for clean cache. But for the most part our workload's cache faults are refaults and its use-once cache is from streaming writes. We end up with most of the inactive list dirty, and we don't go after the active cache as long as we have use-once pages around. But waiting for writes to avoid reclaiming clean cache that *might* refault is a bad trade-off. Even if the refaults happen, reads are faster than writes. Before getting bogged down on writeback, reclaim should first look at *all* cache in the system, even active cache. To accomplish this, activate pages that are dirty or under writeback when they reach the end of the inactive LRU. The pages are marked for immediate reclaim, meaning they'll get moved back to the inactive LRU tail as soon as they're written back and become reclaimable. But in the meantime, by reducing the inactive list to only immediately reclaimable pages, we allow the scanner to deactivate and refill the inactive list with clean cache from the active list tail to guarantee forward progress. [hannes@cmpxchg.org: update comment] Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: vmscan: only write dirty pages that the scanner has seen twiceJohannes Weiner
Dirty pages can easily reach the end of the LRU while there are still clean pages to reclaim around. Don't let kswapd write them back just because there are a lot of them. It costs more CPU to find the clean pages, but that's almost certainly better than to disrupt writeback from the flushers with LRU-order single-page writes from reclaim. And the flushers have been woken up by that point, so we spend IO capacity on flushing and CPU capacity on finding the clean cache. Only start writing dirty pages if they have cycled around the LRU twice now and STILL haven't been queued on the IO device. It's possible that the dirty pages are so sparsely distributed across different bdis, inodes, memory cgroups, that the flushers take forever to get to the ones we want reclaimed. Once we see them twice on the LRU, we know that's the quicker way to find them, so do LRU writeback. Link: http://lkml.kernel.org/r/20170123181641.23938-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: vmscan: remove old flusher wakeup from direct reclaim pathJohannes Weiner
Direct reclaim has been replaced by kswapd reclaim in pretty much all common memory pressure situations, so this code most likely doesn't accomplish the described effect anymore. The previous patch wakes up flushers for all reclaimers when we encounter dirty pages at the tail end of the LRU. Remove the crufty old direct reclaim invocation. Link: http://lkml.kernel.org/r/20170123181641.23938-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: vmscan: kick flushers when we encounter dirty pages on the LRUJohannes Weiner
Memory pressure can put dirty pages at the end of the LRU without anybody running into dirty limits. Don't start writing individual pages from kswapd while the flushers might be asleep. Unlike the old direct reclaim flusher wakeup (removed in the next patch) that flushes the number of pages just scanned, this patch wakes the flushers for all outstanding dirty pages. That seemed to perform better in a synthetic test that pushes dirty pages to the end of the LRU and into reclaim, because we know LRU aging outstrips writeback already, and this way we give younger dirty pages a headstart rather than wait until reclaim runs into them as well. It also means less plugging and risk of exhausting the struct request pool from reclaim. There is a concern that this will cause temporary files that used to get dirtied and truncated before writeback to now get written to disk under memory pressure. If this turns out to be a real problem, we'll have to revisit this and tame the reclaim flusher wakeups. [hannes@cmpxchg.org: mention dirty expiration as a condition] Link: http://lkml.kernel.org/r/20170126174739.GA30636@cmpxchg.org Link: http://lkml.kernel.org/r/20170123181641.23938-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: vmscan: scan dirty pages even in laptop modeJohannes Weiner
Patch series "mm: vmscan: fix kswapd writeback regression". We noticed a regression on multiple hadoop workloads when moving from 3.10 to 4.0 and 4.6, which involves kswapd getting tangled up in page writeout, causing direct reclaim herds that also don't make progress. I tracked it down to the thrash avoidance efforts after 3.10 that make the kernel better at keeping use-once cache and use-many cache sorted on the inactive and active list, with more aggressive protection of the active list as long as there is inactive cache. Unfortunately, our workload's use-once cache is mostly from streaming writes. Waiting for writes to avoid potential reloads in the future is not a good tradeoff. These patches do the following: 1. Wake the flushers when kswapd sees a lump of dirty pages. It's possible to be below the dirty background limit and still have cache velocity push them through the LRU. So start a-flushin'. 2. Let kswapd only write pages that have been rotated twice. This makes sure we really tried to get all the clean pages on the inactive list before resorting to horrible LRU-order writeback. 3. Move rotating dirty pages off the inactive list. Instead of churning or waiting on page writeback, we'll go after clean active cache. This might lead to thrashing, but in this state memory demand outstrips IO speed anyway, and reads are faster than writes. Mel backported the series to 4.10-rc5 with one minor conflict and ran a couple of tests on it. Mix of read/write random workload didn't show anything interesting. Write-only database didn't show much difference in performance but there were slight reductions in IO -- probably in the noise. simoop did show big differences although not as big as Mel expected. This is Chris Mason's workload that similate the VM activity of hadoop. Mel won't go through the full details but over the samples measured during an hour it reported 4.10.0-rc5 4.10.0-rc5 vanilla johannes-v1r1 Amean p50-Read 21346531.56 ( 0.00%) 21697513.24 ( -1.64%) Amean p95-Read 24700518.40 ( 0.00%) 25743268.98 ( -4.22%) Amean p99-Read 27959842.13 ( 0.00%) 28963271.11 ( -3.59%) Amean p50-Write 1138.04 ( 0.00%) 989.82 ( 13.02%) Amean p95-Write 1106643.48 ( 0.00%) 12104.00 ( 98.91%) Amean p99-Write 1569213.22 ( 0.00%) 36343.38 ( 97.68%) Amean p50-Allocation 85159.82 ( 0.00%) 79120.70 ( 7.09%) Amean p95-Allocation 204222.58 ( 0.00%) 129018.43 ( 36.82%) Amean p99-Allocation 278070.04 ( 0.00%) 183354.43 ( 34.06%) Amean final-p50-Read 21266432.00 ( 0.00%) 21921792.00 ( -3.08%) Amean final-p95-Read 24870912.00 ( 0.00%) 26116096.00 ( -5.01%) Amean final-p99-Read 28147712.00 ( 0.00%) 29523968.00 ( -4.89%) Amean final-p50-Write 1130.00 ( 0.00%) 977.00 ( 13.54%) Amean final-p95-Write 1033216.00 ( 0.00%) 2980.00 ( 99.71%) Amean final-p99-Write 1517568.00 ( 0.00%) 32672.00 ( 97.85%) Amean final-p50-Allocation 86656.00 ( 0.00%) 78464.00 ( 9.45%) Amean final-p95-Allocation 211712.00 ( 0.00%) 116608.00 ( 44.92%) Amean final-p99-Allocation 287232.00 ( 0.00%) 168704.00 ( 41.27%) The latencies are actually completely horrific in comparison to 4.4 (and 4.10-rc5 is worse than 4.9 according to historical data for reasons Mel hasn't analysed yet). Still, 95% of write latency (p95-write) is halved by the series and allocation latency is way down. Direct reclaim activity is one fifth of what it was according to vmstats. Kswapd activity is higher but this is not necessarily surprising. Kswapd efficiency is unchanged at 99% (99% of pages scanned were reclaimed) but direct reclaim efficiency went from 77% to 99% In the vanilla kernel, 627MB of data was written back from reclaim context. With the series, no data was written back. With or without the patch, pages are being immediately reclaimed after writeback completes. However, with the patch, only 1/8th of the pages are reclaimed like this. This patch (of 5): We have an elaborate dirty/writeback throttling mechanism inside the reclaim scanner, but for that to work the pages have to go through shrink_page_list() and get counted for what they are. Otherwise, we mess up the LRU order and don't match reclaim speed to writeback. Especially during deactivation, there is never a reason to skip dirty pages; nothing is even trying to write them out from there. Don't mess up the LRU order for nothing, shuffle these pages along. Link: http://lkml.kernel.org/r/20170123181641.23938-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24userfaultfd: non-cooperative: selftest: enable REMOVE event test for shmemMike Rapoport
Now when madvise(MADV_REMOVE) notifies uffd reader, we should verify that appliciation actually sees zeros at the removed range. Link: http://lkml.kernel.org/r/1484814154-1557-4-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24userfaultfd: non-cooperative: add madvise() event for MADV_REMOVE requestMike Rapoport
When a page is removed from a shared mapping, the uffd reader should be notified, so that it won't attempt to handle #PF events for the removed pages. We can reuse the UFFD_EVENT_REMOVE because from the uffd monitor point of view, the semantices of madvise(MADV_DONTNEED) and madvise(MADV_REMOVE) is exactly the same. Link: http://lkml.kernel.org/r/1484814154-1557-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24userfaultfd: non-cooperative: rename *EVENT_MADVDONTNEED to *EVENT_REMOVEMike Rapoport
Patch series "userfaultfd: non-cooperative: add madvise() event for MADV_REMOVE request". These patches add notification of madvise(MADV_REMOVE) event to non-cooperative userfaultfd monitor. The first pacth renames EVENT_MADVDONTNEED to EVENT_REMOVE along with relevant functions and structures. Using _REMOVE instead of _MADVDONTNEED describes the event semantics more clearly and I hope it's not too late for such change in the ABI. This patch (of 3): The UFFD_EVENT_MADVDONTNEED purpose is to notify uffd monitor about removal of certain range from address space tracked by userfaultfd. Hence, UFFD_EVENT_REMOVE seems to better reflect the operation semantics. Respectively, 'madv_dn' field of uffd_msg is renamed to 'remove' and the madvise_userfault_dontneed callback is renamed to userfaultfd_remove. Link: http://lkml.kernel.org/r/1484814154-1557-2-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24memblock: embed memblock type name within struct memblock_typeHeiko Carstens
Provide the name of each memblock type with struct memblock_type. This allows to get rid of the function memblock_type_name() and duplicating the type names in __memblock_dump_all(). The only memblock_type usage out of mm/memblock.c seems to be arch/s390/kernel/crash_dump.c. While at it, give it a name. Link: http://lkml.kernel.org/r/20170120123456.46508-4-heiko.carstens@de.ibm.com Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24memblock: also dump physmem list within __memblock_dump_allHeiko Carstens
Since commit 70210ed950b5 ("mm/memblock: add physical memory list") the memblock structure knows about a physical memory list. The physical memory list should also be dumped if memblock_dump_all() is called in case memblock_debug is switched on. This makes debugging a bit easier. Link: http://lkml.kernel.org/r/20170120123456.46508-3-heiko.carstens@de.ibm.com Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24memblock: let memblock_type_name know about physmem typeHeiko Carstens
Since commit 70210ed950b5 ("mm/memblock: add physical memory list") the memblock structure knows about a physical memory list. memblock_type_name() should return "physmem" instead of "unknown" if the name of the physmem memblock_type is being asked for. Link: http://lkml.kernel.org/r/20170120123456.46508-2-heiko.carstens@de.ibm.com Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm/memory_hotplug.c: unexport __remove_pages()Andrew Morton
It has no modular callers. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: validate device_hotplug is held for memory hotplugDan Williams
mem_hotplug_begin() assumes that it can set mem_hotplug.active_writer and run the hotplug process without racing another thread. Validate this assumption with a lockdep assertion. Link: http://lkml.kernel.org/r/148693886229.16345.1770484669403334689.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin, done}Dan Williams
The mem_hotplug_{begin,done} lock coordinates with {get,put}_online_mems() to hold off "readers" of the current state of memory from new hotplug actions. mem_hotplug_begin() expects exclusive access, via the device_hotplug lock, to set mem_hotplug.active_writer. Calling mem_hotplug_begin() without locking device_hotplug can lead to corrupting mem_hotplug.refcount and missed wakeups / soft lockups. [dan.j.williams@intel.com: v2] Link: http://lkml.kernel.org/r/148728203365.38457.17804568297887708345.stgit@dwillia2-desk3.amr.corp.intel.com Link: http://lkml.kernel.org/r/148693885680.16345.17802627926777862337.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: f931ab479dd2 ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, oom: header nodemask is NULL when cpusets are disabledDavid Rientjes
Commit 82e7d3abec86 ("oom: print nodemask in the oom report") implicitly sets the allocation nodemask to cpuset_current_mems_allowed when there is no effective mempolicy. cpuset_current_mems_allowed is only effective when cpusets are enabled, which is also printed by dump_header(), so setting the nodemask to cpuset_current_mems_allowed is redundant and prevents debugging issues where ac->nodemask is not set properly in the page allocator. This provides better debugging output since cpuset_print_current_mems_allowed() is already provided. [rientjes@google.com: newline per Hillf] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701200158300.88321@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701191454470.2381@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm/ksm: improve deduplication of zero pages with colouringClaudio Imbrenda
Some architectures have a set of zero pages (coloured zero pages) instead of only one zero page, in order to improve the cache performance. In those cases, the kernel samepage merger (KSM) would merge all the allocated pages that happen to be filled with zeroes to the same deduplicated page, thus losing all the advantages of coloured zero pages. This behaviour is noticeable when a process accesses large arrays of allocated pages containing zeroes. A test I conducted on s390 shows that there is a speed penalty when KSM merges such pages, compared to not merging them or using actual zero pages from the start without breaking the COW. This patch fixes this behaviour. When coloured zero pages are present, the checksum of a zero page is calculated during initialisation, and compared with the checksum of the current canditate during merging. In case of a match, the normal merging routine is used to merge the page with the correct coloured zero page, which ensures the candidate page is checked to be equal to the target zero page. A sysfs entry is also added to toggle this behaviour, since it can potentially introduce performance regressions, especially on architectures without coloured zero pages. The default value is disabled, for backwards compatibility. With this patch, the performance with KSM is the same as with non COW-broken actual zero pages, which is also the same as without KSM. [akpm@linux-foundation.org: make zero_checksum and ksm_use_zero_pages __read_mostly, per Andrea] [imbrenda@linux.vnet.ibm.com: documentation for coloured zero pages deduplication] Link: http://lkml.kernel.org/r/1484927522-1964-1-git-send-email-imbrenda@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1484850953-23941-1-git-send-email-imbrenda@linux.vnet.ibm.com Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24cris: use generic current.hDavidlohr Bueso
Given that the arch does not add its own implementations, simply use the asm-generic/current.h (generic-y) header instead of duplicating code. Link: http://lkml.kernel.org/r/1485992878-4780-3-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds
Pull sparc updates from David Miller: 1) Support multiple huge page sizes, from Nitin Gupta. 2) Improve boot time on large memory configurations, from Pavel Tatashin. 3) Make BRK handling more consistent and documented, from Vijay Kumar. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc64: Fix build error in flush_tsb_user_page sparc64: memblock resizes are not handled properly sparc64: use latency groups to improve add_node_ranges speed sparc64: Add 64K page size support sparc64: Multi-page size support Documentation/sparc: Steps for sending break on sunhv console sparc64: Send break twice from console to return to boot prom sparc64: Migrate hvcons irq to panicked cpu sparc64: Set cpu state to offline when stopped sunvdc: Add support for setting physical sector size sparc64: fix for user probes in high memory sparc: topology_64.h: Fix condition for including cpudata.h sparc32: mm: srmmu: add __ro_after_init to sparc32_cachetlb_ops structures
2017-02-24Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull md updates from Shaohua Li: "Mainly fixes bugs and improves performance: - Improve scalability for raid1 from Coly - Improve raid5-cache read performance, disk efficiency and IO pattern from Song and me - Fix a race condition of disk hotplug for linear from Coly - A few cleanup patches from Ming and Byungchul - Fix a memory leak from Neil - Fix WRITE SAME IO failure from me - Add doc for raid5-cache from me" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits) md/raid1: fix write behind issues introduced by bio_clone_bioset_partial md/raid1: handle flush request correctly md/linear: shutup lockdep warnning md/raid1: fix a use-after-free bug RAID1: avoid unnecessary spin locks in I/O barrier code RAID1: a new I/O barrier implementation to remove resync window md/raid5: Don't reinvent the wheel but use existing llist API md: fast clone bio in bio_clone_mddev() md: remove unnecessary check on mddev md/raid1: use bio_clone_bioset_partial() in case of write behind md: fail if mddev->bio_set can't be created block: introduce bio_clone_bioset_partial() md: disable WRITE SAME if it fails in underlayer disks md/raid5-cache: exclude reclaiming stripes in reclaim check md/raid5-cache: stripe reclaim only counts valid stripes MD: add doc for raid5-cache Documentation: move MD related doc into a separate dir md: ensure md devices are freed before module is unloaded. md/r5cache: improve journal device efficiency md/r5cache: enable chunk_aligned_read with write back cache ...
2017-02-24Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates and fixes from Jens Axboe: - NVMe updates and fixes that missed the first pull request. This includes bug fixes, and support for autonomous power management. - Fix from Christoph for missing clear of the request payload, causing a problem with (at least) the storvsc driver. - Further fixes for the queue/bdi life time issues from Jan. - The Kconfig mq scheduler update from me. - Fixing a use-after-free in dm-rq, spotted by Bart, introduced in this merge window. - Three fixes for nbd from Josef. - Bug fix from Omar, fixing a bug in sas transport code that oopses when bsg ioctls were used. From Omar. - Improvements to the queue restart and tag wait from from Omar. - Set of fixes for the sed/opal code from Scott. - Three trivial patches to cciss from Tobin * 'for-linus' of git://git.kernel.dk/linux-block: (41 commits) dm-rq: don't dereference request payload after ending request blk-mq-sched: separate mark hctx and queue restart operations blk-mq: use sbq wait queues instead of restart for driver tags block/sed-opal: Propagate original error message to userland. nvme/pci: re-check security protocol support after reset block/sed-opal: Introduce free_opal_dev to free the structure and clean up state nvme: detect NVMe controller in recent MacBooks nvme-rdma: add support for host_traddr nvmet-rdma: Fix error handling nvmet-rdma: use nvme cm status helper nvme-rdma: move nvme cm status helper to .h file nvme-fc: don't bother to validate ioccsz and iorcsz nvme/pci: No special case for queue busy on IO nvme/core: Fix race kicking freed request_queue nvme/pci: Disable on removal when disconnected nvme: Enable autonomous power state transitions nvme: Add a quirk mechanism that uses identify_ctrl nvme: make nvmf_register_transport require a create_ctrl callback nvme: Use CNS as 8-bit field and avoid endianness conversion nvme: add semicolon in nvme_command setting ...
2017-02-24dm-rq: don't dereference request payload after ending requestJens Axboe
Bart reported a case where dm would crash with use-after-free poison. This is due to dm_softirq_done() accessing memory associated with a request after calling end_request on it. This is most visible on !blk-mq, since we free the memory immediately for that case. Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: eb8db831be80 ("dm: always defer request allocation to the owner of the request_queue") Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-25openrisc: head: Init r0 to 0 on startStafford Horne
Originally openrisc spec 0 specified that r0 would be wired to ground. This is no longer the case. r0 is not guaranteed to be 0 at init, so we need to initialize it to 0 before using it. Also, if we are clearing r0 we cant use r0 to clear itself. Change the the CLEAR_GPR macro to use movhi for clearing. Reported-by: Jakob Viketoft <jakob.viketoft@aacmicrotec.com> Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: Export ioremap symbols used by modulesStafford Horne
Noticed this when building with allyesconfig. Got build failures due to iounmap and __ioremap symbols missing. This patch exports them so modules can use them. This is inline with other architectures. Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25arch/openrisc/lib/memcpy.c: use correct OR1200 optionValentin Rothberg
The Kconfig option for OR12000 is OR1K_1200. Signed-off-by: Valentin Rothberg <valentinrothberg@gmail.com> Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: head: Remove unused stringsStafford Horne
These string definitions are no longer used removed them. Noticed this while working on a CONFIG_DEBUG_INFO build issue. Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: head: Move init strings to rodata sectionStafford Horne
The strings used during the head/init phase of openrisc bootup were stored in the executable section of the binary. This causes compilation to fail when using CONFIG_DEBUG_INFO with error: Error: unaligned opcodes detected in executable segment Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: entry: Fix delay slot detectionStafford Horne
Use execption SR stored in pt_regs for detection, the current SR is not correct as the handler is running after return from exception. Also, The code that checks for a delay slot uses a flag bitmask and then wants to check if the result is not zero. The test it implemented was wrong. Correct it by changing the test to check result against non zero. Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: entry: Whitespace and comment cleanupsStafford Horne
Cleanups to whitespace and add some comments. Reading through the delay slot logic I noticed some things: - Delay slot instructions were not indented - Some comments are not lined up - Use tabs and spaces consistent with other code No functional change Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25scripts/checkstack.pl: Add openrisc supportStafford Horne
Openrisc stack pointer is managed by decrementing r1. Add regexes to recognize this. Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25MAINTAINERS: Add the openrisc official repositoryStafford Horne
The openrisc official repository and patch work happens currently on github. Add the repo for reference. Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: Add .gitignoreStafford Horne
This helps to suppress the vmlinux.lds file. Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: Add optimized memcpy routineStafford Horne
The generic memcpy routine provided in kernel does only byte copies. Using word copies we can lower boot time and cycles spend in memcpy quite significantly. Booting on my de0 nano I see boot times go from 7.2 to 5.6 seconds. The avg cycles in memcpy during boot go from 6467 to 1887. I tested several algorithms (see code in previous patch mails) The implementations I tested and avg cycles: - Word Copies + Loop Unrolls + Non Aligned 1882 - Word Copies + Loop Unrolls 1887 - Word Copies 2441 - Byte Copies + Loop Unrolls 6467 - Byte Copies 7600 In the end I ended up going with Word Copies + Loop Unrolls as it provides best tradeoff between simplicity and boot speedups. Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: Add optimized memsetOlof Kindgren
This adds a hand-optimized assembler version of memset and sets __HAVE_ARCH_MEMSET to use this version instead of the generic C routine Signed-off-by: Olof Kindgren <olof.kindgren@gmail.com> Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: Initial support for the idle stateSebastian Macke
This patch adds basic support for the idle state of the cpu. The patch overrides the regular idle function, enables the interupts, checks for the power management unit and enables the cpu doze mode if available. Signed-off-by: Sebastian Macke <sebastian@macke.de> [shorne@gmail.com: Fixed checkpatch, blankline after declarations] Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: Fix the bitmask for the unit present registerSebastian Macke
The bits were swapped, as per spec and processor implementation the power management present bit is 9 and PIC bit is 8. This patch brings the definitions into spec. Signed-off-by: Sebastian Macke <sebastian@macke.de> [shorne@gmail.com: Added commit body] Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: remove unnecessary stddef.h includeStefan Kristiansson
This causes the build to fail when building with the or1k-musl-linux- toolchain and it is not needed. Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: add futex_atomic_* implementationsStefan Kristiansson
Support for the futex_atomic_* operations by using the load-link/store-conditional l.lwa/l.swa instructions. Most openrisc cores provide these instructions now if not available, emulation is provided. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> [shorne@gmail.com: remove OPENRISC_HAVE_INST_LWA_SWA config suggesed by Alan Cox https://lkml.org/lkml/2014/7/23/666] Signed-off-by: Stafford Horne <shorne@gmail.com>
2017-02-25openrisc: add optimized atomic operationsStefan Kristiansson
Using the l.lwa and l.swa atomic instruction pair. Most openrisc processor cores provide these instructions now. If the instructions are not available emulation is provided. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> [shorne@gmail.com: remove OPENRISC_HAVE_INST_LWA_SWA config suggesed by Alan Cox https://lkml.org/lkml/2014/7/23/666] [shorne@gmail.com: expand to implement all ops suggested by Peter Zijlstra https://lkml.org/lkml/2017/2/20/317] Signed-off-by: Stafford Horne <shorne@gmail.com>