summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-12-12drivers/base/memory.c: prohibit offlining of memory blocks with missing sectionsSeth Jennings
Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems") and 982792c782ef ("x86, mm: probe memory block size for generic x86 64bit") introduced large block sizes for x86. This made it possible to have multiple sections per memory block where previously, there was a only every one section per block. Since blocks consist of contiguous ranges of section, there can be holes in the blocks where sections are not present. If one attempts to offline such a block, a crash occurs since the code is not designed to deal with this. This patch is a quick fix to gaurd against the crash by not allowing blocks with non-present sections to be offlined. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=107781 Signed-off-by: Seth Jennings <sjennings@variantweb.net> Reported-by: Andrew Banman <abanman@sgi.com> Cc: Daniel J Blueman <daniel@numascale.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Russ Anderson <rja@sgi.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12tmpfs: fix shmem_evict_inode() warnings on i_blocksHugh Dickins
Dmitry Vyukov provides a little program, autogenerated by syzkaller, which races a fault on a mapping of a sparse memfd object, against truncation of that object below the fault address: run repeatedly for a few minutes, it reliably generates shmem_evict_inode()'s WARN_ON(inode->i_blocks). (But there's nothing specific to memfd here, nor to the fstat which it happened to use to generate the fault: though that looked suspicious, since a shmem_recalc_inode() had been added there recently. The same problem can be reproduced with open+unlink in place of memfd_create, and with fstatfs in place of fstat.) v3.7 commit 0f3c42f522dc ("tmpfs: change final i_blocks BUG to WARNING") explains one cause of such a warning (a race with shmem_writepage to swap), and possible solutions; but we never took it further, and this syzkaller incident turns out to have a different cause. shmem_getpage_gfp()'s error recovery, when a freshly allocated page is then found to be beyond eof, looks plausible - decrementing the alloced count that was just before incremented - but in fact can go wrong, if a racing thread (the truncator, for example) gets its shmem_recalc_inode() in just after our delete_from_page_cache(). delete_from_page_cache() decrements nrpages, that shmem_recalc_inode() will balance the books by decrementing alloced itself, then our decrement of alloced take it one too low: leading to the WARNING when the object is finally evicted. Once the new page has been exposed in the page cache, shmem_getpage_gfp() must leave it to shmem_recalc_inode() itself to get the accounting right in all cases (and not fall through from "trunc:" to "decused:"). Adjust that error recovery block; and the reinitialization of info and sbinfo can be removed too. While we're here, fix shmem_writepage() to avoid the original issue: it will be safe against a racing shmem_recalc_inode(), if it merely increments swapped before the shmem_delete_from_page_cache() which decrements nrpages (but it must then do its own shmem_recalc_inode() before that, while still in balance, instead of after). (Aside: why do we shmem_recalc_inode() here in the swap path? Because its raison d'etre is to cope with clean sparse shmem pages being reclaimed behind our back: so here when swapping is a good place to look for that case.) But I've not now managed to reproduce this bug, even without the patch. I don't see why I didn't do that earlier: perhaps inhibited by the preference to eliminate shmem_recalc_inode() altogether. Driven by this incident, I do now have a patch to do so at last; but still want to sit on it for a bit, there's a couple of questions yet to be resolved. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12mm/hugetlb.c: fix resv map memory leak for placeholder entriesMike Kravetz
Dmitry Vyukov reported the following memory leak unreferenced object 0xffff88002eaafd88 (size 32): comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s) hex dump (first 32 bytes): 28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff (.Nc....(.Nc.... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: kmalloc include/linux/slab.h:458 region_chg+0x2d4/0x6b0 mm/hugetlb.c:398 __vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791 vma_needs_reservation mm/hugetlb.c:1813 alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845 hugetlb_no_page mm/hugetlb.c:3543 hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717 follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880 __get_user_pages+0x542/0xf30 mm/gup.c:497 populate_vma_page_range+0xde/0x110 mm/gup.c:919 __mm_populate+0x1c7/0x310 mm/gup.c:969 do_mlock+0x291/0x360 mm/mlock.c:637 SYSC_mlock2 mm/mlock.c:658 SyS_mlock2+0x4b/0x70 mm/mlock.c:648 Dmitry identified a potential memory leak in the routine region_chg, where a region descriptor is not free'ed on an error path. However, the root cause for the above memory leak resides in region_del. In this specific case, a "placeholder" entry is created in region_chg. The associated page allocation fails, and the placeholder entry is left in the reserve map. This is "by design" as the entry should be deleted when the map is released. The bug is in the region_del routine which is used to delete entries within a specific range (and when the map is released). region_del did not handle the case where a placeholder entry exactly matched the start of the range range to be deleted. In this case, the entry would not be deleted and leaked. The fix is to take these special placeholder entries into account in region_del. The region_chg error path leak is also fixed. Fixes: feba16e25a57 ("mm/hugetlb: add region_del() to delete a specific range of entries") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12mm: hugetlb: call huge_pte_alloc() only if ptep is nullNaoya Horiguchi
Currently at the beginning of hugetlb_fault(), we call huge_pte_offset() and check whether the obtained *ptep is a migration/hwpoison entry or not. And if not, then we get to call huge_pte_alloc(). This is racy because the *ptep could turn into migration/hwpoison entry after the huge_pte_offset() check. This race results in BUG_ON in huge_pte_alloc(). We don't have to call huge_pte_alloc() when the huge_pte_offset() returns non-NULL, so let's fix this bug with moving the code into else block. Note that the *ptep could turn into a migration/hwpoison entry after this block, but that's not a problem because we have another !pte_present check later (we never go into hugetlb_no_page() in that case.) Fixes: 290408d4a250 ("hugetlb: hugepage migration core") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12kernel: remove stop_machine() Kconfig dependencyChris Wilson
Currently the full stop_machine() routine is only enabled on SMP if module unloading is enabled, or if the CPUs are hotpluggable. This leads to configurations where stop_machine() is broken as it will then only run the callback on the local CPU with irqs disabled, and not stop the other CPUs or run the callback on them. For example, this breaks MTRR setup on x86 in certain configs since ea8596bb2d8d379 ("kprobes/x86: Remove unused text_poke_smp() and text_poke_smp_batch() functions") as the MTRR is only established on the boot CPU. This patch removes the Kconfig option for STOP_MACHINE and uses the SMP and HOTPLUG_CPU config options to compile the correct stop_machine() for the architecture, removing the false dependency on MODULE_UNLOAD in the process. Link: https://lkml.org/lkml/2014/10/8/124 References: https://bugs.freedesktop.org/show_bug.cgi?id=84794 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Pranith Kumar <bobby.prani@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Tejun Heo <tj@kernel.org> Cc: Iulia Manda <iulia.manda21@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Chuck Ebbert <cebbert.lkml@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12mm: kmemleak: mark kmemleak_init prototype as __initNicolas Iooss
The kmemleak_init() definition in mm/kmemleak.c is marked __init but its prototype in include/linux/kmemleak.h is marked __ref since commit a6186d89c913 ("kmemleak: Mark the early log buffer as __initdata"). This causes a section mismatch which is reported as a warning when building with clang -Wsection, because kmemleak_init() is declared in section .ref.text but defined in .init.text. Fix this by marking kmemleak_init() prototype __init. Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12mm: fix kerneldoc on mem_cgroup_replace_pageHugh Dickins
Whoops, I missed removing the kerneldoc comment of the lrucare arg removed from mem_cgroup_replace_page; but it's a good comment, keep it. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12osd fs: __r4w_get_page rely on PageUptodate for uptodateHugh Dickins
Commit 42cb14b110a5 ("mm: migrate dirty page without clear_page_dirty_for_io etc") simplified the migration of a PageDirty pagecache page: one stat needs moving from zone to zone and that's about all. It's convenient and safest for it to shift the PageDirty bit from old page to new, just before updating the zone stats: before copying data and marking the new PageUptodate. This is all done while both pages are isolated and locked, just as before; and just as before, there's a moment when the new page is visible in the radix_tree, but not yet PageUptodate. What's new is that it may now be briefly visible as PageDirty before it is PageUptodate. When I scoured the tree to see if this could cause a problem anywhere, the only places I found were in two similar functions __r4w_get_page(): which look up a page with find_get_page() (not using page lock), then claim it's uptodate if it's PageDirty or PageWriteback or PageUptodate. I'm not sure whether that was right before, but now it might be wrong (on rare occasions): only claim the page is uptodate if PageUptodate. Or perhaps the page in question could never be migratable anyway? Signed-off-by: Hugh Dickins <hughd@google.com> Tested-by: Boaz Harrosh <ooo@electrozaur.com> Cc: Benny Halevy <bhalevy@panasas.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12MAINTAINERS: make Vladimir co-maintainer of the memory controllerJohannes Weiner
Vladimir architected and authored much of the current state of the memcg's slab memory accounting and tracking. Make sure he gets CC'd on bug reports ;-) Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any ↵Michal Hocko
progress Tetsuo Handa has reported that the system might basically livelock in OOM condition without triggering the OOM killer. The issue is caused by internal dependency of the direct reclaim on vmstat counter updates (via zone_reclaimable) which are performed from the workqueue context. If all the current workers get assigned to an allocation request, though, they will be looping inside the allocator trying to reclaim memory but zone_reclaimable can see stalled numbers so it will consider a zone reclaimable even though it has been scanned way too much. WQ concurrency logic will not consider this situation as a congested workqueue because it relies that worker would have to sleep in such a situation. This also means that it doesn't try to spawn new workers or invoke the rescuer thread if the one is assigned to the queue. In order to fix this issue we need to do two things. First we have to let wq concurrency code know that we are in trouble so we have to do a short sleep. In order to prevent from issues handled by 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone") we limit the sleep only to worker threads which are the ones of the interest anyway. The second thing to do is to create a dedicated workqueue for vmstat and mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to have a spare worker thread for it. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Tejun Heo <tj@kernel.org> Cc: Cristopher Lameter <clameter@sgi.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Arkadiusz Miskiewicz <arekm@maven.pl> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12mm: fix swapped Movable and Reclaimable in /proc/pagetypeinfoVlastimil Babka
Commit 016c13daa5c9 ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE in the enum definition. However, migratetype_names wasn't updated to reflect that. As a result, the file /proc/pagetypeinfo shows the counts for Movable as Reclaimable and vice versa. Additionally, commit 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") introduced MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into show_migration_types(), so it doesn't appear in the listing of free areas during page alloc failures or oom kills. This patch fixes both problems. The atomic reserves will show with a letter 'H' in the free areas listings. Fixes: 016c13daa5c9 ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types") Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12memcg: fix memory.high targetVladimir Davydov
When the memory.high threshold is exceeded, try_charge() schedules a task_work to reclaim the excess. The reclaim target is set to the number of pages requested by try_charge(). This is wrong, because try_charge() usually charges more pages than requested (batch > nr_pages) in order to refill per cpu stocks. As a result, a process in a cgroup can easily exceed memory.high significantly when doing a lot of charges w/o returning to userspace (e.g. reading a file in big chunks). Fix this issue by assuring that when exceeding memory.high a process reclaims as many pages as were actually charged (i.e. batch). Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12mm: hugetlb: fix hugepage memory leak caused by wrong reserve countNaoya Horiguchi
When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on alloc_buddy_huge_page() to directly create a hugepage from the buddy allocator. In that case, however, if alloc_buddy_huge_page() succeeds we don't decrement h->resv_huge_pages, which means that successful hugetlb_fault() returns without releasing the reserve count. As a result, subsequent hugetlb_fault() might fail despite that there are still free hugepages. This patch simply adds decrementing code on that code path. I reproduced this problem when testing v4.3 kernel in the following situation: - the test machine/VM is a NUMA system, - hugepage overcommiting is enabled, - most of hugepages are allocated and there's only one free hugepage which is on node 0 (for example), - another program, which calls set_mempolicy(MPOL_BIND) to bind itself to node 1, tries to allocate a hugepage, - the allocation should fail but the reserve count is still hold. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> [3.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12i2c: designware: Keep pm_runtime_enable/_disable calls in syncJarkko Nikula
On an hardware shared I2C bus (certain Intel Baytrail SoC platforms) the runtime PM disable depth keeps increasing over repeated modprobe/rmmod cycle because pm_runtime_disable() is called without checking should it be disabled already because of bus sharing. This hasn't made any other harm than dev->power.disable_depth keeps increasing but keep it sync by calling pm_runtime_disable() only when runtime PM is not disabled. Reported-by: Wolfram Sang <wsa@the-dreams.de> Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2015-12-12i2c: designware: fix IO timeout issue for AMD controllerXiangliang Yu
Because of some hardware limitation, AMD I2C controller can't trigger pending interrupt if interrupt status has been changed after clearing interrupt status bits. Then, I2C will lost interrupt and IO timeout. According to hardware design, this patch implements a workaround to disable i2c controller interrupt and re-enable i2c interrupt before exiting ISR. To reduce the performance impacts on other vendors, use unlikely function to check flag in ISR. Signed-off-by: Xiangliang Yu <Xiangliang.Yu@amd.com> Acked-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
2015-12-12parisc: Disable huge pages on Mako machinesHelge Deller
Mako-based machines (PA8800 and PA8900 CPUs) don't allow aliasing on non-equaivalent addresses. Signed-off-by: Helge Deller <deller@gmx.de>
2015-12-12parisc: Wire up mlock2 syscallHelge Deller
Signed-off-by: Helge Deller <deller@gmx.de>
2015-12-12parisc: Remove unused pcibios_init_bus()Bjorn Helgaas
There are no callers of pcibios_init_bus(), so remove it. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Helge Deller <deller@gmx.de>
2015-12-12parisc iommu: fix panic due to trying to allocate too large regionMikulas Patocka
When using the Promise TX2+ SATA controller on PA-RISC, the system often crashes with kernel panic, for example just writing data with the dd utility will make it crash. Kernel panic - not syncing: drivers/parisc/sba_iommu.c: I/O MMU @ 000000000000a000 is out of mapping resources CPU: 0 PID: 18442 Comm: mkspadfs Not tainted 4.4.0-rc2 #2 Backtrace: [<000000004021497c>] show_stack+0x14/0x20 [<0000000040410bf0>] dump_stack+0x88/0x100 [<000000004023978c>] panic+0x124/0x360 [<0000000040452c18>] sba_alloc_range+0x698/0x6a0 [<0000000040453150>] sba_map_sg+0x260/0x5b8 [<000000000c18dbb4>] ata_qc_issue+0x264/0x4a8 [libata] [<000000000c19535c>] ata_scsi_translate+0xe4/0x220 [libata] [<000000000c19a93c>] ata_scsi_queuecmd+0xbc/0x320 [libata] [<0000000040499bbc>] scsi_dispatch_cmd+0xfc/0x130 [<000000004049da34>] scsi_request_fn+0x6e4/0x970 [<00000000403e95a8>] __blk_run_queue+0x40/0x60 [<00000000403e9d8c>] blk_run_queue+0x3c/0x68 [<000000004049a534>] scsi_run_queue+0x2a4/0x360 [<000000004049be68>] scsi_end_request+0x1a8/0x238 [<000000004049de84>] scsi_io_completion+0xfc/0x688 [<0000000040493c74>] scsi_finish_command+0x17c/0x1d0 The cause of the crash is not exhaustion of the IOMMU space, there is plenty of free pages. The function sba_alloc_range is called with size 0x11000, thus the pages_needed variable is 0x11. The function sba_search_bitmap is called with bits_wanted 0x11 and boundary size is 0x10 (because dma_get_seg_boundary(dev) returns 0xffff). The function sba_search_bitmap attempts to allocate 17 pages that must not cross 16-page boundary - it can't satisfy this requirement (iommu_is_span_boundary always returns true) and fails even if there are many free entries in the IOMMU space. How did it happen that we try to allocate 17 pages that don't cross 16-page boundary? The cause is in the function iommu_coalesce_chunks. This function tries to coalesce adjacent entries in the scatterlist. The function does several checks if it may coalesce one entry with the next, one of those checks is this: if (startsg->length + dma_len > max_seg_size) break; When it finishes coalescing adjacent entries, it allocates the mapping: sg_dma_len(contig_sg) = dma_len; dma_len = ALIGN(dma_len + dma_offset, IOVP_SIZE); sg_dma_address(contig_sg) = PIDE_FLAG | (iommu_alloc_range(ioc, dev, dma_len) << IOVP_SHIFT) | dma_offset; It is possible that (startsg->length + dma_len > max_seg_size) is false (we are just near the 0x10000 max_seg_size boundary), so the funcion decides to coalesce this entry with the next entry. When the coalescing succeeds, the function performs dma_len = ALIGN(dma_len + dma_offset, IOVP_SIZE); And now, because of non-zero dma_offset, dma_len is greater than 0x10000. iommu_alloc_range (a pointer to sba_alloc_range) is called and it attempts to allocate 17 pages for a device that must not cross 16-page boundary. To fix the bug, we must make sure that dma_len after addition of dma_offset and alignment doesn't cross the segment boundary. I.e. change if (startsg->length + dma_len > max_seg_size) break; to if (ALIGN(dma_len + dma_offset + startsg->length, IOVP_SIZE) > max_seg_size) break; This patch makes this change (it precalculates max_seg_boundary at the beginning of the function iommu_coalesce_chunks). I also added a check that the mapping length doesn't exceed dma_get_seg_boundary(dev) (it is not needed for Promise TX2+ SATA, but it may be needed for other devices that have dma_get_seg_boundary lower than dma_get_max_seg_size). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Helge Deller <deller@gmx.de>
2015-12-12ARC: intc: Document arc_request_percpu_irq() betterVineet Gupta
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-12-12ARCv2: perf: Ensure perf intr gets enabled on all coresVineet Gupta
This was the second perf intr issue perf sampling on multicore requires intr to be enabled on all cores. ARC perf probe code used helper arc_request_percpu_irq() which calls - request_percpu_irq() on core0 - enable_percpu_irq() on all all cores (including core0) genirq requires that request be made ahead of enable call. However if perf probe happened on non core0 (observed on a 3.18 kernel), enable would get called ahead of request, failing obviously and rendering perf intr disabled on all such cores [ 11.120000] 1 ARC perf : 8 counters (48 bits), 113 conditions, [overflow IRQ support] [ 11.130000] 1 -----> enable_percpu_irq() IRQ 20 failed [ 11.140000] 3 -----> enable_percpu_irq() IRQ 20 failed [ 11.140000] 2 -----> enable_percpu_irq() IRQ 20 failed [ 11.140000] 0 =====> request_percpu_irq() IRQ 20 [ 11.140000] 0 -----> enable_percpu_irq() IRQ 20 Fix this fragility, by calling request_percpu_irq() on whatever core calls probe (there is no requirement on which core calls this anyways) and then calling enable on each cores. Interestingly this started as invesigation of STAR 9000838902: "sporadically IRQs enabled on perf prob" which was about occassional boot spew as request_percpu_irq got called non-locally (from an IPI), and re-enabled interrupts in following path proc_mkdir -> spin_unlock_irq() which the irq work code didn't like. | ARC perf : 8 counters (48 bits), 113 conditions, [overflow IRQ support] | | BUG: failure at ../kernel/irq_work.c:135/irq_work_run_list()! | CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.10-01127-g285efb8e66d1 #2 | | Stack Trace: | arc_unwind_core.constprop.1+0x94/0x104 | dump_stack+0x62/0x98 | irq_work_run_list+0xb0/0xb4 | irq_work_run+0x22/0x3c | do_IPI+0x74/0x9c | handle_irq_event_percpu+0x34/0x164 | handle_percpu_irq+0x58/0x78 | generic_handle_irq+0x1e/0x2c | arch_do_IRQ+0x3c/0x60 | ret_from_exception+0x0/0x8 Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-snps-arc@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: Alexey Brodkin <abrodkin@synopsys.com> Cc: <stable@vger.kernel.org> #4.2+ Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-12-12ARC: intc: No need to clear IRQ_NOAUTOENVineet Gupta
arc_request_percpu_irq() is called by all cores to request/enable percpu irq. It has some "prep" calls needed by genirq: - setup percpu devid - disable IRQ_NOAUTOEN However given that enable_percpu_irq() is called enayways, latter can be avoided. We are now left with irq_set_percpu_devid() quirk and that too for ARCompact builds only, since previous patch updated ARCv2 intc to do this in the "right" place, i.e. irq map function. By next release, this will ultimately be fixed for ARCompact as well. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexey Brodkin <abrodkin@synopsys.com> Cc: linux-snps-arc@lists.infradead.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-12-12ARCv2: intc: Fix random perf irq disabling in SMP setupVineet Gupta
As part of fixing another perf issue, observed that after a perf run, the interrupt got disabled on one/more cores. Turns out that despite requesting perf irq as percpu, the flow handler registered was not handle_percpu_irq() Given that on ARCv2 cores, IRQs < 24 are always private to cpu, we register the right handler at the very onset. Before Fix | [ARCLinux]# cat /proc/interrupts | grep perf | 20: 0 0 0 0 ARCv2 core Intc 20 ARC perf counters | | [ARCLinux]# perf record -c 20000 /sbin/hackbench | Running with 10*40 (== 400) tasks. | | [ARCLinux]# cat /proc/interrupts | grep perf | 20: 0 522 8 51916 ARCv2 core Intc 20 ARC perf counters | | [ARCLinux]# perf record -c 20000 /sbin/hackbench | Running with 10*40 (== 400) tasks. | | [ARCLinux]# cat /proc/interrupts | grep perf | 20: 0 522 8 104368 ARCv2 core Intc 20 ARC perf counters After Fix | [ARCLinux]# cat /proc/interrupts | grep perf | 20: 0 0 0 0 ARCv2 core Intc 20 ARC perf counters | | [ARCLinux]# perf record -c 20000 /sbin/hackbench | Running with 10*40 (== 400) tasks. | | [ARCLinux]# cat /proc/interrupts | grep perf | 20: 64198 62012 62697 67803 ARCv2 core Intc 20 ARC perf counters | | [ARCLinux]# perf record -c 20000 /sbin/hackbench | Running with 10*40 (== 400) tasks. | | [ARCLinux]# cat /proc/interrupts | grep perf | 20: 126014 122792 123301 133654 ARCv2 core Intc 20 ARC perf counters Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexey Brodkin <abrodkin@synopsys.com> Cc: stable@vger.kernel.org #4.2+ Cc: linux-kernel@vger.kernel.org Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2015-12-12Merge branch 'mpls-fixes'David S. Miller
Robert Shearman says: ==================== mpls: fixes for nexthops without via addresses These four fixes all apply to the case of having an mpls route with an output device, but without a nexthop. Patches 2 and 3 could really have been combined in one patch, but I wanted to separate the fix for some recent breakage from the fix for a day-1 issue. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12mpls: make via address optional for multipath routesRobert Shearman
The via address is optional for a single path route, yet is mandatory when the multipath attribute is used: # ip -f mpls route add 100 dev lo # ip -f mpls route add 101 nexthop dev lo RTNETLINK answers: Invalid argument Make them consistent by making the via address optional when the RTA_MULTIPATH attribute is being parsed so that both forms of specifying the route work. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12mpls: fix out-of-bounds access when via address not specifiedRobert Shearman
When a via address isn't specified, the via table is left initialised to 0 (NEIGH_ARP_TABLE), and the via address length also left initialised to 0. This results in a via address array of length 0 being allocated (contiguous with route and nexthop array), meaning that when a packet is sent using neigh_xmit the neighbour lookup and creation will cause an out-of-bounds access when accessing the 4 bytes of the IPv4 address it assumes it has been given a pointer to. This could be fixed by allocating the 4 bytes of via address necessary and leaving it as all zeroes. However, it seems wrong to me to use an ipv4 nexthop (including possibly ARPing for 0.0.0.0) when the user didn't specify to do so. Instead, set the via address table to NEIGH_NR_TABLES to signify it hasn't been specified and use this at forwarding time to signify a neigh_xmit using an L2 address consisting of the device address. This mechanism is the same as that used for both ARP and ND for loopback interfaces and those flagged as no-arp, which are all we can really support in this case. Fixes: cf4b24f0024f ("mpls: reduce memory usage of routes") Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12mpls: don't dump RTA_VIA attribute if not specifiedRobert Shearman
The problem seen is that when adding a route with a nexthop with no via address specified, iproute2 generates bogus output: # ip -f mpls route add 100 dev lo # ip -f mpls route list 100 via inet 0.0.8.0 dev lo The reason for this is that the kernel generates an RTA_VIA attribute with the family set to AF_INET, but the via address data having zero length. The cause of family being AF_INET is that on route insert cfg->rc_via_table is left set to 0, which just happens to be NEIGH_ARP_TABLE which is then translated into AF_INET. iproute2 doesn't validate the length prior to printing and so prints garbage. Although it could be fixed to do the validation, I would argue that AF_INET addresses should always be exactly 4 bytes so the kernel is really giving userspace bogus data. Therefore, avoid generating the RTA_VIA attribute when dumping the route if the via address wasn't specified on add/modify. This is indicated by NEIGH_ARP_TABLE and a zero via address length - if the user specified a via address the address length would have been validated such that it was 4 bytes. Although this is a change in behaviour that is visible to userspace, I believe that what was generated before was invalid and as such userspace wouldn't be expecting it. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12mpls: validate L2 via address lengthRobert Shearman
If an L2 via address for an mpls nexthop is specified, the length of the L2 address must match that expected by the output device, otherwise it could access memory beyond the end of the via address buffer in the route. This check was present prior to commit f8efb73c97e2 ("mpls: multipath route support"), but got lost in the refactoring, so add it back, applying it to all nexthops in multipath routes. Fixes: f8efb73c97e2 ("mpls: multipath route support") Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12sfc: only use RSS filters if we're using RSSBert Kenward
Without this, filter insertion on a VF would fail if only one channel was in use. This would include the unicast station filter and therefore no traffic would be received. Signed-off-by: Bert Kenward <bkenward@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11uapi: export ila.hstephen hemminger
The file ila.h used for lightweight tunnels is being used by iproute2 but is not exported yet. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11Merge branch 'bnxt_en-fixes'David S. Miller
Michael Chan says: ==================== bnxt_en: Bug fix and add tx timeout recovery. Fix a bitmap declaration bug and add missing tx timeout recovery. v2: Fixed white space error. Thanks Dave. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11bnxt_en: Implement missing tx timeout reset logic.Michael Chan
The reset logic calls bnxt_close_nic() and bnxt_open_nic() under rtnl_lock from bnxt_sp_task. BNXT_STATE_IN_SP_TASK must be cleared before calling bnxt_close_nic() to avoid deadlock. v2: Fixed white space error. Thanks Dave. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11bnxt_en: Don't cancel sp_task from bnxt_close_nic().Michael Chan
When implementing driver reset from tx_timeout in the next patch, bnxt_close_nic() will be called from the sp_task workqueue. Calling cancel_work() on sp_task will hang the workqueue. Instead, set a new bit BNXT_STATE_IN_SP_TASK when bnxt_sp_task() is running. bnxt_close_nic() will wait for BNXT_STATE_IN_SP_TASK to clear before proceeding. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11bnxt_en: Change bp->state to bitmap.Michael Chan
This allows multiple independent bits to be set for various states. Subsequent patches to implement tx timeout reset will require this. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11bnxt_en: Fix bitmap declaration to work on 32-bit arches.Michael Chan
The declaration of the bitmap vf_req_snif_bmap using fixed array of unsigned long will only work on 64-bit archs. Use DECLARE_BITMAP instead which will work on all archs. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11openvswitch: Respect conntrack zone even if invalidJoe Stringer
If userspace executes ct(zone=1), and the connection tracker determines that the packet is invalid, then the ct_zone flow key field is populated with the default zone rather than the zone that was specified. Even though connection tracking failed, this field should be updated with the value that the action specified. Fix the issue. Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action") Signed-off-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11openvswitch: Fix helper reference leakJoe Stringer
If the actions (re)allocation fails, or the actions list is larger than the maximum size, and the conntrack action is the last action when these problems are hit, then references to helper modules may be leaked. Fix the issue. Fixes: cae3a2627520 ("openvswitch: Allow attaching helpers to ct action") Signed-off-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11phy: micrel: Fix finding PHY properties in MAC node.Andrew Lunn
commit 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the bus' parent.") changed the parenting of PHY devices, making them a child of the MDIO bus, instead of the MAC device. This broken the Micrel PHY driver which has a deprecated feature of allowing PHY properties to be placed into the MAC node. In order to find the MAC node, we need to walk up the tree of devices until we find one with an OF node attached. Reported-by: Dinh Nguyen <dinguyen@opensource.altera.com> Suggested-by: David Daney <david.daney@cavium.com> Acked-by: David Daney <david.daney@cavium.com> Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the bus' parent.") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Dinh Nguyen <dinguyen@opensource.altera.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-12powercap / RAPL: fix BIOS lock checkPrarit Bhargava
Intel RAPL initialized on several systems where the BIOS lock bit (msr 0x610, bit 63) was set. This occured because the return value of rapl_read_data_raw() was being checked, rather than the value of the variable passed in, locked. This patch properly implments the rapl_read_data_raw() call to check the variable locked, and now the Intel RAPL driver outputs the warning: intel_rapl: RAPL package 0 domain package locked by BIOS and does not initialize for the package. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Acked-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-12-12cpufreq: intel_pstate: Minor cleanup for FRAC_BITSPrarit Bhargava
785ee27 ("cpufreq: intel_pstate: Fix limits->max_perf rounding error") hardcodes the value of FRAC_BITS. This patch fixes that minor issue. Fixes: 785ee2788141 (cpufreq: intel_pstate: Fix limits->max_perf rounding error) Signed-off-by: Prarit Bhargava <prarit@redhat.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-12-12cpufreq: tegra: add regulator dependency for T124Arnd Bergmann
This driver is the only one that calls regulator_sync_voltage(), but it can currently be built with CONFIG_REGULATOR disabled, producing this build error: drivers/cpufreq/tegra124-cpufreq.c: In function 'tegra124_cpu_switch_to_pllx': drivers/cpufreq/tegra124-cpufreq.c:68:2: error: implicit declaration of function 'regulator_sync_voltage' [-Werror=implicit-function-declaration] regulator_sync_voltage(priv->vdd_cpu_reg); My first attempt was to implement a helper for this function for regulator_sync_voltage, but Mark Brown explained: We don't do this for *all* regulator API functions - there's some where using them strongly suggests that there is actually a dependency on the regulator API. This does seem like it might be falling into the specialist category [...] Looking at the code I'm pretty unclear on what the authors think the use of _sync_voltage() is doing in the first place so it may be even better to just remove the call. It seems to have been included in the first commit so there's not changelog explaining things and there's no comment either. I'd *expect* it to be a noop as far as I can see. This adds the dependency to make the driver always build successfully or not be enabled at all. Alternatively, we could investigate if the driver should stop calling regulator_sync_voltage instead. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-12-11ipv6: sctp: clone options to avoid use after freeEric Dumazet
SCTP is lacking proper np->opt cloning at accept() time. TCP and DCCP use ipv6_dup_options() helper, do the same in SCTP. We might later factorize this code in a common helper to avoid future mistakes. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11xfrm: add rcu protection to sk->sk_policy[]Eric Dumazet
XFRM can deal with SYNACK messages, sent while listener socket is not locked. We add proper rcu protection to __xfrm_sk_clone_policy() and xfrm_sk_policy_lookup() This might serve as the first step to remove xfrm.xfrm_policy_lock use in fast path. Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11xfrm: add rcu grace period in xfrm_policy_destroy()Eric Dumazet
We will soon switch sk->sk_policy[] to RCU protection, as SYNACK packets are sent while listener socket is not locked. This patch simply adds RCU grace period before struct xfrm_policy freeing, and the corresponding rcu_head in struct xfrm_policy. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-11Merge tag 'imx-fixes-4.4-2' of ↵Kevin Hilman
git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into fixes Merge "ARM: imx: fixes for 4.4, 2nd round" from Shawn Guo: The i.MX fixes for 4.4, 2nd round: - Fix vf610 SAI clock configuration bug which is discovered by the newly added master mode support in SAI audio driver. - Fix buggy L2 cache latency values in vf610 device trees, which may cause system hang when cpu runs at a higher frequency. * tag 'imx-fixes-4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux: ARM: dts: vf610: use reset values for L2 cache latencies ARM: dts: vf610: fix clock definition for SAI2 ARM: imx: clk-vf610: fix SAI clock tree
2015-12-11ls2080a/dts: Add little endian property for GPIO IP blockLiu Gang
The GPIO block for ls2080a platform has little endian registers, the GPIO driver needs this property to read/write registers by right interface. Signed-off-by: Liu Gang <Gang.Liu@freescale.com> Signed-off-by: Li Yang <leoli@freescale.com> Signed-off-by: Kevin Hilman <khilman@linaro.org>
2015-12-11dt-bindings: define little-endian property for QorIQ GPIOLi Yang
The GPIO block on different QorIQ chips could have registers in different endianess. Define the property to specify which endian is used by the hardware. Signed-off-by: Liu Gang <Gang.Liu@freescale.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Li Yang <leoli@freescale.com> Signed-off-by: Kevin Hilman <khilman@linaro.org>
2015-12-11ARM64: dts: ls2080a: fix eSDHC endiannessyangbo lu
Add the "little-endian" property to fix the issue that eSDHC is not working and dumping out "mmc0: Controller never released inhibit bit(s)." error messages constantly. Fixes: 5461597f6ce0 ("dts/ls2080a: Update DTSI to add support of various peripherals") Signed-off-by: Yangbo Lu <yangbo.lu@freescale.com> Signed-off-by: Li Yang <leoli@freescale.com> Signed-off-by: Kevin Hilman <khilman@linaro.org>
2015-12-11USB: add quirk for devices with broken LPMAlan Stern
Some USB device / host controller combinations seem to have problems with Link Power Management. For example, Steinar found that his xHCI controller wouldn't handle bandwidth calculations correctly for two video cards simultaneously when LPM was enabled, even though the bus had plenty of bandwidth available. This patch introduces a new quirk flag for devices that should remain disabled for LPM, and creates quirk entries for Steinar's devices. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Steinar H. Gunderson <sgunderson@bigfoot.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-12-11xhci: fix usb2 resume timing and races.Mathias Nyman
According to USB 2 specs ports need to signal resume for at least 20ms, in practice even longer, before moving to U0 state. Both host and devices can initiate resume. On device initiated resume, a port status interrupt with the port in resume state in issued. The interrupt handler tags a resume_done[port] timestamp with current time + USB_RESUME_TIMEOUT, and kick roothub timer. Root hub timer requests for port status, finds the port in resume state, checks if resume_done[port] timestamp passed, and set port to U0 state. On host initiated resume, current code sets the port to resume state, sleep 20ms, and finally sets the port to U0 state. This should also be changed to work in a similar way as the device initiated resume, with timestamp tagging, but that is not yet tested and will be a separate fix later. There are a few issues with this approach 1. A host initiated resume will also generate a resume event. The event handler will find the port in resume state, believe it's a device initiated resume, and act accordingly. 2. A port status request might cut the resume signalling short if a get_port_status request is handled during the host resume signalling. The port will be found in resume state. The timestamp is not set leading to time_after_eq(jiffies, timestamp) returning true, as timestamp = 0. get_port_status will proceed with moving the port to U0. 3. If an error, or anything else happens to the port during device initiated resume signalling it will leave all the device resume parameters hanging uncleared, preventing further suspend, returning -EBUSY, and cause the pm thread to busyloop trying to enter suspend. Fix this by using the existing resuming_ports bitfield to indicate that resume signalling timing is taken care of. Check if the resume_done[port] is set before using it for timestamp comparison, and also clear out any resume signalling related variables if port is not in U0 or Resume state This issue was discovered when a PM thread busylooped, trying to runtime suspend the xhci USB 2 roothub on a Dell XPS Cc: stable <stable@vger.kernel.org> Reported-by: Daniel J Blueman <daniel@quora.org> Tested-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>