summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-01-24kernel/panic.c: add missing \nJiri Slaby
When a system panics, the "Rebooting in X seconds.." message is never printed because it lacks a new line. Fix it. Link: http://lkml.kernel.org/r/20170119114751.2724-1-jslaby@suse.cz Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24fbdev: color map copying bounds checkingKees Cook
Copying color maps to userspace doesn't check the value of to->start, which will cause kernel heap buffer OOB read due to signedness wraps. CVE-2016-8405 Link: http://lkml.kernel.org/r/20170105224249.GA50925@beast Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Peter Pi (@heisecode) of Trend Micro Cc: Min Chong <mchong@google.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24frv: add atomic64_add_unless()Sudip Mukherjee
The build of frv allmodconfig was failing with the error: lib/atomic64_test.c:209:9: error: implicit declaration of function 'atomic64_add_unless' All the atomic64 operations were defined in frv, but atomic64_add_unless() was not done. Implement atomic64_add_unless() as done in other arches. Link: http://lkml.kernel.org/r/1484781236-6698-1-git-send-email-sudipm.mukherjee@gmail.com Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm/mempolicy.c: do not put mempolicy before using its nodemaskVlastimil Babka
Since commit be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma") alloc_pages_vma() can potentially free a mempolicy by mpol_cond_put() before accessing the embedded nodemask by __alloc_pages_nodemask(). The commit log says it's so "we can use a single exit path within the function" but that's clearly wrong. We can still do that when doing mpol_cond_put() after the allocation attempt. Make sure the mempolicy is not freed prematurely, otherwise __alloc_pages_nodemask() can end up using a bogus nodemask, which could lead e.g. to premature OOM. Fixes: be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma") Link: http://lkml.kernel.org/r/20170118141124.8345-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24radix-tree: fix private list warningsMatthew Wilcox
The newly introduced warning in radix_tree_free_nodes() was testing the wrong variable; it should have been 'old' instead of 'node'. Fixes: ea07b862ac8e ("mm: workingset: fix use-after-free in shadow node shrinker") Link: http://lkml.kernel.org/r/20170118163746.GA32495@cmpxchg.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24Documentation/filesystems/proc.txt: add VmPinFabian Frederick
Commit bc3e53f682d9 ("mm: distinguish between mlocked and pinned pages") added VmPin in /proc/<pid>/status. Report that in Documentation/filesystems/proc.txt Also move Umask after Name to keep correct order. Link: http://lkml.kernel.org/r/20170114201219.30387-1-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm, memcg: do not retry precharge chargesDavid Rientjes
When memory.move_charge_at_immigrate is enabled and precharges are depleted during move, mem_cgroup_move_charge_pte_range() will attempt to increase the size of the precharge. Prevent precharges from ever looping by setting __GFP_NORETRY. This was probably the intention of the GFP_KERNEL & ~__GFP_NORETRY, which is pointless as written. Fixes: 0029e19ebf84 ("mm: memcontrol: remove explicit OOM parameter in charge path") Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701130208510.69402@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24proc: add a schedule point in proc_pid_readdir()Eric Dumazet
We have seen proc_pid_readdir() invocations holding cpu for more than 50 ms. Add a cond_resched() to be gentle with other tasks. [akpm@linux-foundation.org: coding style fix] Link: http://lkml.kernel.org/r/1484238380.15816.42.camel@edumazet-glaptop3.roam.corp.google.com Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm: alloc_contig: re-allow CMA to compact FS pagesLucas Stach
Commit 73e64c51afc5 ("mm, compaction: allow compaction for GFP_NOFS requests") changed compation to skip FS pages if not explicitly allowed to touch them, but missed to update the CMA compact_control. This leads to a very high isolation failure rate, crippling performance of CMA even on a lightly loaded system. Re-allow CMA to compact FS pages by setting the correct GFP flags, restoring CMA behavior and performance to the kernel 4.9 level. Fixes: 73e64c51afc5 (mm, compaction: allow compaction for GFP_NOFS requests) Link: http://lkml.kernel.org/r/20170113115155.24335-1-l.stach@pengutronix.de Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm/slub.c: trace free objects at KERN_INFODaniel Thompson
Currently when trace is enabled (e.g. slub_debug=T,kmalloc-128 ) the trace messages are mostly output at KERN_INFO. However the trace code also calls print_section() to hexdump the head of a free object. This is hard coded to use KERN_ERR, meaning the console is deluged with trace messages even if we've asked for quiet. Fix this the obvious way but adding a level parameter to print_section(), allowing calls from the trace code to use the same trace level as other trace messages. Link: http://lkml.kernel.org/r/20170113154850.518-1-daniel.thompson@linaro.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24userfaultfd: fix SIGBUS resulting from false rwsem wakeupsAndrea Arcangeli
With >=32 CPUs the userfaultfd selftest triggered a graceful but unexpected SIGBUS because VM_FAULT_RETRY was returned by handle_userfault() despite the UFFDIO_COPY wasn't completed. This seems caused by rwsem waking the thread blocked in handle_userfault() and we can't run up_read() before the wait_event sequence is complete. Keeping the wait_even sequence identical to the first one, would require running userfaultfd_must_wait() again to know if the loop should be repeated, and it would also require retaking the rwsem and revalidating the whole vma status. It seems simpler to wait the targeted wakeup so that if false wakeups materialize we still wait for our specific wakeup event, unless of course there are signals or the uffd was released. Debug code collecting the stack trace of the wakeup showed this: $ ./userfaultfd 100 99999 nr_pages: 25600, nr_pages_per_cpu: 800 bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2 bounces: 99997, mode: rnd ver poll, Bus error (core dumped) save_stack_trace+0x2b/0x50 try_to_wake_up+0x2a6/0x580 wake_up_q+0x32/0x70 rwsem_wake+0xe0/0x120 call_rwsem_wake+0x1b/0x30 up_write+0x3b/0x40 vm_mmap_pgoff+0x9c/0xc0 SyS_mmap_pgoff+0x1a9/0x240 SyS_mmap+0x22/0x30 entry_SYSCALL_64_fastpath+0x1f/0xbd 0xffffffffffffffff FAULT_FLAG_ALLOW_RETRY missing 70 CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G W 4.8.0+ #30 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0xb8/0x112 handle_userfault+0x572/0x650 handle_mm_fault+0x12cb/0x1520 __do_page_fault+0x175/0x500 trace_do_page_fault+0x61/0x270 do_async_page_fault+0x19/0x90 async_page_fault+0x25/0x30 This always happens when the main userfault selftest thread is running clone() while glibc runs either mprotect or mmap (both taking mmap_sem down_write()) to allocate the thread stack of the background threads, while locking/userfault threads already run at full throttle and are susceptible to false wakeups that may cause handle_userfault() to return before than expected (which results in graceful SIGBUS at the next attempt). This was reproduced only with >=32 CPUs because the loop to start the thread where clone() is too quick with fewer CPUs, while with 32 CPUs there's already significant activity on ~32 locking and userfault threads when the last background threads are started with clone(). This >=32 CPUs SMP race condition is likely reproducible only with the selftest because of the much heavier userfault load it generates if compared to real apps. We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a patch floating around that provides it also hidden this problem but in reality only is successfully at hiding the problem. False wakeups could still happen again the second time handle_userfault() is invoked, even if it's a so rare race condition that getting false wakeups twice in a row is impossible to reproduce. This full fix is needed for correctness, the only alternative would be to allow VM_FAULT_RETRY to be returned infinitely. With this fix the WP support can stick to a strict "one more" VM_FAULT_RETRY logic (no need of returning it infinite times to avoid the SIGBUS). Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24drivers/memstick/core/memstick.c: avoid -Wnonnull warningArnd Bergmann
gcc-7 produces a harmless false-postive warning about a possible NULL pointer access: drivers/memstick/core/memstick.c: In function 'h_memstick_read_dev_id': drivers/memstick/core/memstick.c:309:3: error: argument 2 null where non-null expected [-Werror=nonnull] memcpy(mrq->data, buf, mrq->data_len); This can't happen because the caller sets the command to 'MS_TPC_READ_REG', which causes the data direction to be 'READ' and the NULL pointer not accessed. As a simple workaround for the warning, we can pass a pointer to the data that we actually want to read into. This is not needed here, but also harmless, and lets the compiler know that the access is ok. Link: http://lkml.kernel.org/r/20170111144143.548867-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alex Dubov <oakad@yahoo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24kernel/watchdog: prevent false hardlockup on overloaded systemDon Zickus
On an overloaded system, it is possible that a change in the watchdog threshold can be delayed long enough to trigger a false positive. This can easily be achieved by having a cpu spinning indefinitely on a task, while another cpu updates watchdog threshold. What happens is while trying to park the watchdog threads, the hrtimers on the other cpus trigger and reprogram themselves with the new slower watchdog threshold. Meanwhile, the nmi watchdog is still programmed with the old faster threshold. Because the one cpu is blocked, it prevents the thread parking on the other cpus from completing, which is needed to shutdown the nmi watchdog and reprogram it correctly. As a result, a false positive from the nmi watchdog is reported. Fix this by setting a park_in_progress flag to block all lockups until the parking is complete. Fix provided by Ulrich Obergfell. [akpm@linux-foundation.org: s/park_in_progress/watchdog_park_in_progress/] Link: http://lkml.kernel.org/r/1481041033-192236-1-git-send-email-dzickus@redhat.com Signed-off-by: Don Zickus <dzickus@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24dax: fix build warnings with FS_DAX and !FS_IOMAPRoss Zwisler
As reported by Arnd: https://lkml.org/lkml/2017/1/10/756 Compiling with the following configuration: # CONFIG_EXT2_FS is not set # CONFIG_EXT4_FS is not set # CONFIG_XFS_FS is not set # CONFIG_FS_IOMAP depends on the above filesystems, as is not set CONFIG_FS_DAX=y generates build warnings about unused functions in fs/dax.c: fs/dax.c:878:12: warning: `dax_insert_mapping' defined but not used [-Wunused-function] static int dax_insert_mapping(struct address_space *mapping, ^~~~~~~~~~~~~~~~~~ fs/dax.c:572:12: warning: `copy_user_dax' defined but not used [-Wunused-function] static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size, ^~~~~~~~~~~~~ fs/dax.c:542:12: warning: `dax_load_hole' defined but not used [-Wunused-function] static int dax_load_hole(struct address_space *mapping, void **entry, ^~~~~~~~~~~~~ fs/dax.c:312:14: warning: `grab_mapping_entry' defined but not used [-Wunused-function] static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index, ^~~~~~~~~~~~~~~~~~ Now that the struct buffer_head based DAX fault paths and I/O path have been removed we really depend on iomap support being present for DAX. Make this explicit by selecting FS_IOMAP if we compile in DAX support. This allows us to remove conditional selections of FS_IOMAP when FS_DAX was present for ext2 and ext4, and to remove an #ifdef in fs/dax.c. Link: http://lkml.kernel.org/r/1484087383-29478-1-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thpKeno Fischer
In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had also been set. However, a similar check in huge_memory.c was forgotten. As a result, remote memory writes to ro regions of memory backed by transparent huge pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this state the process is stil SIGKILLable, but little else works (e.g. no ptrace attach, no other signals). This is easily reproduced with the following code (assuming thp are set to always): #include <assert.h> #include <fcntl.h> #include <stdint.h> #include <stdio.h> #include <string.h> #include <sys/mman.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #define TEST_SIZE 5 * 1024 * 1024 int main(void) { int status; pid_t child; int fd = open("/proc/self/mem", O_RDWR); void *addr = mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr != MAP_FAILED); pid_t parent_pid = getpid(); if ((child = fork()) == 0) { void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr2 != MAP_FAILED); memset(addr2, 'a', TEST_SIZE); pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr); return 0; } assert(child == waitpid(child, &status, 0)); assert(WIFEXITED(status) && WEXITSTATUS(status) == 0); return 0; } Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously to the update in gup.c in the original commit. The same pattern exists in follow_devmap_pmd. However, we should not be able to reach that check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we ever do. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170106015025.GA38411@juliacomputing.com Signed-off-by: Keno Fischer <keno@juliacomputing.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-24memory_hotplug: make zone_can_shift() return a boolean valueYasuaki Ishimatsu
online_{kernel|movable} is used to change the memory zone to ZONE_{NORMAL|MOVABLE} and online the memory. To check that memory zone can be changed, zone_can_shift() is used. Currently the function returns minus integer value, plus integer value and 0. When the function returns minus or plus integer value, it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}. But when the function returns 0, there are two meanings. One of the meanings is that the memory zone does not need to be changed. For example, when memory is in ZONE_NORMAL and onlined by online_kernel the memory zone does not need to be changed. Another meaning is that the memory zone cannot be changed. When memory is in ZONE_NORMAL and onlined by online_movable, the memory zone may not be changed to ZONE_MOVALBE due to memory online limitation(see Documentation/memory-hotplug.txt). In this case, memory must not be onlined. The patch changes the return type of zone_can_shift() so that memory online operation fails when memory zone cannot be changed as follows: Before applying patch: # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 # echo online_movable > memory4097/state # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 8388608 managed 8388608 online_movable operation succeeded. But memory is onlined as ZONE_NORMAL, not ZONE_MOVABLE. After applying patch: # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 # echo online_movable > memory4097/state bash: echo: write error: Invalid argument # grep -A 35 "Node 2" /proc/zoneinfo Node 2, zone Normal <snip> node_scanned 0 spanned 8388608 present 7864320 managed 7864320 online_movable operation failed because of failure of changing the memory zone from ZONE_NORMAL to ZONE_MOVABLE Fixes: df429ac03936 ("memory-hotplug: more general validation of zone during online") Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-25vring: Force use of DMA API for ARM-based systems with legacy devicesWill Deacon
Booting Linux on an ARM fastmodel containing an SMMU emulation results in an unexpected I/O page fault from the legacy virtio-blk PCI device: [ 1.211721] arm-smmu-v3 2b400000.smmu: event 0x10 received: [ 1.211800] arm-smmu-v3 2b400000.smmu: 0x00000000fffff010 [ 1.211880] arm-smmu-v3 2b400000.smmu: 0x0000020800000000 [ 1.211959] arm-smmu-v3 2b400000.smmu: 0x00000008fa081002 [ 1.212075] arm-smmu-v3 2b400000.smmu: 0x0000000000000000 [ 1.212155] arm-smmu-v3 2b400000.smmu: event 0x10 received: [ 1.212234] arm-smmu-v3 2b400000.smmu: 0x00000000fffff010 [ 1.212314] arm-smmu-v3 2b400000.smmu: 0x0000020800000000 [ 1.212394] arm-smmu-v3 2b400000.smmu: 0x00000008fa081000 [ 1.212471] arm-smmu-v3 2b400000.smmu: 0x0000000000000000 <system hangs failing to read partition table> This is because the legacy virtio-blk device is behind an SMMU, so we have consequently swizzled its DMA ops and configured the SMMU to translate accesses. This then requires the vring code to use the DMA API to establish translations, otherwise all transactions will result in fatal faults and termination. Given that ARM-based systems only see an SMMU if one is really present (the topology is all described by firmware tables such as device-tree or IORT), then we can safely use the DMA API for all legacy virtio devices. Modern devices can advertise the prescense of an IOMMU using the VIRTIO_F_IOMMU_PLATFORM feature flag. Cc: Andy Lutomirski <luto@kernel.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: <stable@vger.kernel.org> Fixes: 876945dbf649 ("arm64: Hook up IOMMU dma_ops") Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com>
2017-01-25virtio_mmio: Set DMA masks appropriatelyRobin Murphy
Once DMA API usage is enabled, it becomes apparent that virtio-mmio is inadvertently relying on the default 32-bit DMA mask, which leads to problems like rapidly exhausting SWIOTLB bounce buffers. Ensure that we set the appropriate 64-bit DMA mask whenever possible, with the coherent mask suitably limited for the legacy vring as per a0be1db4304f ("virtio_pci: Limit DMA mask to 44 bits for legacy virtio devices"). Cc: Andy Lutomirski <luto@kernel.org> Cc: Michael S. Tsirkin <mst@redhat.com> Reported-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Fixes: b42111382f0e ("virtio_mmio: Use the DMA API if enabled") Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-01-25vhost/vsock: handle vhost_vq_init_access() errorStefan Hajnoczi
Propagate the error when vhost_vq_init_access() fails and set vq->private_data to NULL. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-01-24ARCv2: smp-boot: wake_flag polling by non-Masters needs to be uncachedVineet Gupta
This is needed on HS38 cores, for setting up IO-Coherency aperture properly The polling could perturb the caches and coherecy fabric which could be wrong in the small window when Master is setting up IOC aperture etc in arc_cache_init() We do it only for ARCv2 based builds to not affect EZChip ARCompact based platform. Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
2017-01-24Merge branch 'lwt-module-unload'David S. Miller
Robert Shearman says: ==================== net: Fix oops on state free after lwt module unload An oops is seen in lwtstate_free after an lwt ops module has been unloaded. This patchset fixes this by preventing modules implementing lwtunnel ops from being unloaded whilst there's state alive using those ops. The first patch adds fills in a new owner field in all lwt ops and the second patch makes use of this to reference count the modules as state is built and destroyed using them. Changes in v3: - don't put module reference if try_module_get fails on building state Changes in v2: - specify module owner for all modules as suggested by DaveM - reference count all modules building lwt state, not just those ops implementing destroy_state, as also suggested by DaveM. - rebased on top of David Ahern's lwtunnel changes ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24lwtunnel: Fix oops on state free after encap module unloadRobert Shearman
When attempting to free lwtunnel state after the module for the encap has been unloaded an oops occurs: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: lwtstate_free+0x18/0x40 [..] task: ffff88003e372380 task.stack: ffffc900001fc000 RIP: 0010:lwtstate_free+0x18/0x40 RSP: 0018:ffff88003fd83e88 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88002bbb3380 RCX: ffff88000c91a300 [..] Call Trace: <IRQ> free_fib_info_rcu+0x195/0x1a0 ? rt_fibinfo_free+0x50/0x50 rcu_process_callbacks+0x2d3/0x850 ? rcu_process_callbacks+0x296/0x850 __do_softirq+0xe4/0x4cb irq_exit+0xb0/0xc0 smp_apic_timer_interrupt+0x3d/0x50 apic_timer_interrupt+0x93/0xa0 [..] Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8 The problem is after the module for the encap can be unloaded the corresponding ops is removed and is thus NULL here. Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so grab the module reference for the ops on creating lwtunnel state and of course release the reference when freeing the state. Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation") Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: Specify the owning module for lwtunnel opsRobert Shearman
Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so specify the owning module for all lwtunnel ops. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24IB/rxe: Prevent from completer to operate on non valid QPYonatan Cohen
On UD QP completer tasklet is scheduled for each packet sent. If it is followed by a destroy_qp(), the kernel panic will happen as the completer tries to operate on a destroyed QP. Fixes: 8700e3e7c485 ("Soft RoCE driver") Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com> Reviewed-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24IB/rxe: Fix rxe dev insertion to rxe_dev_listMaor Gottlieb
The first argument of list_add_tail is the new item and the second is the head of the list. Fix the code to pass arguments in the right order, otherwise not all the rxe devices will be removed during teardown. Fixes: 8700e3e7c4857 ('Soft RoCE driver') Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24Merge branch 'tipc-topology-fixes'David S. Miller
Parthasarathy Bhuvaragan says: ==================== tipc: topology server fixes for nametable soft lockup In this series, we revert the commit 333f796235a527 ("tipc: fix a race condition leading to subscriber refcnt bug") and provide an alternate solution to fix the race conditions in commits 2-4. We have to do this as the above commit introduced a nametbl soft lockup at module exit as described by patch#4. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: fix cleanup at module unloadParthasarathy Bhuvaragan
In tipc_server_stop(), we iterate over the connections with limiting factor as server's idr_in_use. We ignore the fact that this variable is decremented in tipc_close_conn(), leading to premature exit. In this commit, we iterate until the we have no connections left. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Tested-by: John Thompson <thompa.atl@gmail.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: ignore requests when the connection state is not CONNECTEDParthasarathy Bhuvaragan
In tipc_conn_sendmsg(), we first queue the request to the outqueue followed by the connection state check. If the connection is not connected, we should not queue this message. In this commit, we reject the messages if the connection state is not CF_CONNECTED. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Tested-by: John Thompson <thompa.atl@gmail.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: fix nametbl_lock soft lockup at module exitParthasarathy Bhuvaragan
Commit 333f796235a527 ("tipc: fix a race condition leading to subscriber refcnt bug") reveals a soft lockup while acquiring nametbl_lock. Before commit 333f796235a527, we call tipc_conn_shutdown() from tipc_close_conn() in the context of tipc_topsrv_stop(). In that context, we are allowed to grab the nametbl_lock. Commit 333f796235a527, moved tipc_conn_release (renamed from tipc_conn_shutdown) to the connection refcount cleanup. This allows either tipc_nametbl_withdraw() or tipc_topsrv_stop() to the cleanup. Since tipc_exit_net() first calls tipc_topsrv_stop() and then tipc_nametble_withdraw() increases the chances for the later to perform the connection cleanup. The soft lockup occurs in the call chain of tipc_nametbl_withdraw(), when it performs the tipc_conn_kref_release() as it tries to grab nametbl_lock again while holding it already. tipc_nametbl_withdraw() grabs nametbl_lock tipc_nametbl_remove_publ() tipc_subscrp_report_overlap() tipc_subscrp_send_event() tipc_conn_sendmsg() << if (con->flags != CF_CONNECTED) we do conn_put(), triggering the cleanup as refcount=0. >> tipc_conn_kref_release tipc_sock_release tipc_conn_release tipc_subscrb_delete tipc_subscrp_delete tipc_nametbl_unsubscribe << Soft Lockup >> The previous changes in this series fixes the race conditions fixed by commit 333f796235a527. Hence we can now revert the commit. Fixes: 333f796235a52727 ("tipc: fix a race condition leading to subscriber refcnt bug") Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: fix connection refcount errorParthasarathy Bhuvaragan
Until now, the generic server framework maintains the connection id's per subscriber in server's conn_idr. At tipc_close_conn, we remove the connection id from the server list, but the connection is valid until we call the refcount cleanup. Hence we have a window where the server allocates the same connection to an new subscriber leading to inconsistent reference count. We have another refcount warning we grab the refcount in tipc_conn_lookup() for connections with flag with CF_CONNECTED not set. This usually occurs at shutdown when the we stop the topology server and withdraw TIPC_CFG_SRV publication thereby triggering a withdraw message to subscribers. In this commit, we: 1. remove the connection from the server list at recount cleanup. 2. grab the refcount for a connection only if CF_CONNECTED is set. Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: add subscription refcount to avoid invalid deleteParthasarathy Bhuvaragan
Until now, the subscribers keep track of the subscriptions using reference count at subscriber level. At subscription cancel or subscriber delete, we delete the subscription only if the timer was pending for the subscription. This approach is incorrect as: 1. del_timer() is not SMP safe, if on CPU0 the check for pending timer returns true but CPU1 might schedule the timer callback thereby deleting the subscription. Thus when CPU0 is scheduled, it deletes an invalid subscription. 2. We export tipc_subscrp_report_overlap(), which accesses the subscription pointer multiple times. Meanwhile the subscription timer can expire thereby freeing the subscription and we might continue to access the subscription pointer leading to memory violations. In this commit, we introduce subscription refcount to avoid deleting an invalid subscription. Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: fix nametbl_lock soft lockup at node/link eventsParthasarathy Bhuvaragan
We trigger a soft lockup as we grab nametbl_lock twice if the node has a pending node up/down or link up/down event while: - we process an incoming named message in tipc_named_rcv() and perform an tipc_update_nametbl(). - we have pending backlog items in the name distributor queue during a nametable update using tipc_nametbl_publish() or tipc_nametbl_withdraw(). The following are the call chain associated: tipc_named_rcv() Grabs nametbl_lock tipc_update_nametbl() (publish/withdraw) tipc_node_subscribe()/unsubscribe() tipc_node_write_unlock() << lockup occurs if an outstanding node/link event exits, as we grabs nametbl_lock again >> tipc_nametbl_withdraw() Grab nametbl_lock tipc_named_process_backlog() tipc_update_nametbl() << rest as above >> The function tipc_node_write_unlock(), in addition to releasing the lock processes the outstanding node/link up/down events. To do this, we need to grab the nametbl_lock again leading to the lockup. In this commit we fix the soft lockup by introducing a fast variant of node_unlock(), where we just release the lock. We adapt the node_subscribe()/node_unsubscribe() to use the fast variants. Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24netfilter: nf_tables: bump set->ndeact on set flushPablo Neira Ayuso
Add missing set->ndeact update on each deactivated element from the set flush path. Otherwise, sets with fixed size break after flush since accounting breaks. # nft add set x y { type ipv4_addr\; size 2\; } # nft add element x y { 1.1.1.1 } # nft add element x y { 1.1.1.2 } # nft flush set x y # nft add element x y { 1.1.1.1 } <cmdline>:1:1-28: Error: Could not process rule: Too many open files in system Fixes: 8411b6442e59 ("netfilter: nf_tables: support for set flushing") Reported-by: Elise Lennion <elise.lennion@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24netfilter: nf_tables: deconstify walk callback functionPablo Neira Ayuso
The flush operation needs to modify set and element objects, so let's deconstify this. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCLPablo Neira Ayuso
If the element exists and no NLM_F_EXCL is specified, do not bump set->nelems, otherwise we leak one set element slot. This problem amplifies if the set is full since the abort path always decrements the counter for the -ENFILE case too, giving one spare extra slot. Fix this by moving set->nelems update to nft_add_set_elem() after successful element insertion. Moreover, remove the element if the set is full so there is no need to rely on the abort path to undo things anymore. Fixes: c016c7e45ddf ("netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24netfilter: nft_log: restrict the log prefix length to 127Liping Zhang
First, log prefix will be truncated to NF_LOG_PREFIXLEN-1, i.e. 127, at nf_log_packet(), so the extra part is useless. Second, after adding a log rule with a very very long prefix, we will fail to dump the nft rules after this _special_ one, but acctually, they do exist. For example: # name_65000=$(printf "%0.sQ" {1..65000}) # nft add rule filter output log prefix "$name_65000" # nft add rule filter output counter # nft add rule filter output counter # nft list chain filter output table ip filter { chain output { type filter hook output priority 0; policy accept; } } So now, restrict the log prefix length to NF_LOG_PREFIXLEN-1. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24IB/umem: Release pid in error and ODP flowKenneth Lee
1. Release pid before enter odp flow 2. Release pid when fail to allocate memory Fixes: 87773dd56d54 ("IB: ib_umem_release() should decrement mm->pinned_vm from ib_umem_get") Fixes: 8ada2c1c0c1d ("IB/core: Add support for on demand paging regions") Signed-off-by: Kenneth Lee <liguozhu@hisilicon.com> Reviewed-by: Haggai Eran <haggaie@mellanox.com> Reviewed-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24Merge tag 'platform-drivers-x86-v4.10-4' of ↵Linus Torvalds
git://git.infradead.org/linux-platform-drivers-x86 Pull x86 platform-driver fixes from Andy Shevchenko: "This is my first pull request since I become a co-maintainer of Platform Drivers x86 subsystem. It's a bit bigger than usual due to material collected for almost two weeks in a row. MAINTAINERS: - Add myself to X86 PLATFORM DRIVERS as a co-maintainer ideapad-laptop: - handle ACPI event 1 intel_mid_powerbtn: - Set IRQ_ONESHOT surface3-wmi: - fix uninitialized symbol - Shut up unused-function warning mlx-platform: - free first dev on error" * tag 'platform-drivers-x86-v4.10-4' of git://git.infradead.org/linux-platform-drivers-x86: MAINTAINERS: Add myself to X86 PLATFORM DRIVERS as a co-maintainer platform/x86: ideapad-laptop: handle ACPI event 1 platform/x86: intel_mid_powerbtn: Set IRQ_ONESHOT platform/x86: surface3-wmi: fix uninitialized symbol platform/x86: surface3-wmi: Shut up unused-function warning platform/x86: mlx-platform: free first dev on error
2017-01-24RDMA/qedr: Dispatch port active event from qedr_addRam Amrani
Relying on qede to trigger qedr on startup is problematic. When probing both if qedr loads slowly then qede can assume qedr is missing and not trigger it. This patch adds a triggering from qedr and protects against a race via an atomic bit. Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDMA/qedr: Fix and simplify memory leak in PD allocRam Amrani
Free the PD if no internal resources were available. Move userspace code under the relevant 'if'. Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDMA/qedr: Fix RDMA CM loopbackRam Amrani
The loopback logic in RDMA CM packets compares Ethernet addresses and was accidently inverse. Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDMA/qedr: Fix formattingRam Amrani
Remove standalone ';'. List function's parameters in a single line. Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDMA/qedr: Mark three functions as staticRam Amrani
mark qedr_get_state_from_ibqp(), __qedr_alloc_mr() and __qedr_post_send() as static since they are only used in the same file. Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDMA/qedr: Don't reset QP when queues aren't flushedRam Amrani
Fail QP state transition from error to reset if SQ/RQ are not empty and still in the process of flushing out the queued work entries. Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDMA/qedr: Don't spam dmesg if QP is in error stateRam Amrani
It is normal to flush CQEs if the QP is in error state. Hence there's no use in printing a message per CQE to dmesg. Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDMA/qedr: Remove CQ spinlock from CM completion handlersRam Amrani
There is only a single event queue that triggers the completion events for the RDMA CM and it is being processed serially. This means that inherently there can no parallelism of CQ completion handler callbacks, hence the lock is redundant. Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDMA/qedr: Return max inline data in QP query resultRam Amrani
Return the maximum supported amount of inline data, not the qp's current configured inline data size, when filling out the results of a query qp call. Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDMA/qedr: Return success when not changing QP stateRam Amrani
If the user is requesting us to change the QP state to the same state that it is already in, return success instead of failure. Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDMA/qedr: Add uapi header qedr-abi.hAmrani, Ram
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-24RDMA/qedr: Fix MTU returned from QP queryAmrani, Ram
MTU value returned from QP query should include overhead. Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Doug Ledford <dledford@redhat.com>