summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-06-07mm/docs: describe memory.low refinementsRoman Gushchin
Refine cgroup v2 docs after latest memory.low changes. Link: http://lkml.kernel.org/r/20180405185921.4942-4-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm: treat memory.low value inclusiveRoman Gushchin
If memcg's usage is equal to the memory.low value, avoid reclaiming from this cgroup while there is a surplus of reclaimable memory. This sounds more logical and also matches memory.high and memory.max behavior: both are inclusive. Empty cgroups are not considered protected, so MEMCG_LOW events are not emitted for empty cgroups, if there is no more reclaimable memory in the system. Link: http://lkml.kernel.org/r/20180406122132.GA7185@castle Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm: memory.low hierarchical behaviorRoman Gushchin
This patch aims to address an issue in current memory.low semantics, which makes it hard to use it in a hierarchy, where some leaf memory cgroups are more valuable than others. For example, there are memcgs A, A/B, A/C, A/D and A/E: A A/memory.low = 2G, A/memory.current = 6G //\\ BC DE B/memory.low = 3G B/memory.current = 2G C/memory.low = 1G C/memory.current = 2G D/memory.low = 0 D/memory.current = 2G E/memory.low = 10G E/memory.current = 0 If we apply memory pressure, B, C and D are reclaimed at the same pace while A's usage exceeds 2G. This is obviously wrong, as B's usage is fully below B's memory.low, and C has 1G of protection as well. Also, A is pushed to the size, which is less than A's 2G memory.low, which is also wrong. A simple bash script (provided below) can be used to reproduce the problem. Current results are: A: 1430097920 A/B: 711929856 A/C: 717426688 A/D: 741376 A/E: 0 To address the issue a concept of effective memory.low is introduced. Effective memory.low is always equal or less than original memory.low. In a case, when there is no memory.low overcommittment (and also for top-level cgroups), these two values are equal. Otherwise it's a part of parent's effective memory.low, calculated as a cgroup's memory.low usage divided by sum of sibling's memory.low usages (under memory.low usage I mean the size of actually protected memory: memory.current if memory.current < memory.low, 0 otherwise). It's necessary to track the actual usage, because otherwise an empty cgroup with memory.low set (A/E in my example) will affect actual memory distribution, which makes no sense. To avoid traversing the cgroup tree twice, page_counters code is reused. Calculating effective memory.low can be done in the reclaim path, as we conveniently traversing the cgroup tree from top to bottom and check memory.low on each level. So, it's a perfect place to calculate effective memory low and save it to use it for children cgroups. This also eliminates a need to traverse the cgroup tree from bottom to top each time to check if parent's guarantee is not exceeded. Setting/resetting effective memory.low is intentionally racy, but it's fine and shouldn't lead to any significant differences in actual memory distribution. With this patch applied results are matching the expectations: A: 2147930112 A/B: 1428721664 A/C: 718393344 A/D: 815104 A/E: 0 Test script: #!/bin/bash CGPATH="/sys/fs/cgroup" truncate /file1 --size 2G truncate /file2 --size 2G truncate /file3 --size 2G truncate /file4 --size 50G mkdir "${CGPATH}/A" echo "+memory" > "${CGPATH}/A/cgroup.subtree_control" mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E" echo 2G > "${CGPATH}/A/memory.low" echo 3G > "${CGPATH}/A/B/memory.low" echo 1G > "${CGPATH}/A/C/memory.low" echo 0 > "${CGPATH}/A/D/memory.low" echo 10G > "${CGPATH}/A/E/memory.low" echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1 echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2 echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3 echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4 echo "A: " `cat "${CGPATH}/A/memory.current"` echo "A/B: " `cat "${CGPATH}/A/B/memory.current"` echo "A/C: " `cat "${CGPATH}/A/C/memory.current"` echo "A/D: " `cat "${CGPATH}/A/D/memory.current"` echo "A/E: " `cat "${CGPATH}/A/E/memory.current"` rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E" rmdir "${CGPATH}/A" rm /file1 /file2 /file3 /file4 Link: http://lkml.kernel.org/r/20180405185921.4942-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm: rename page_counter's count/limit into usage/maxRoman Gushchin
This patch renames struct page_counter fields: count -> usage limit -> max and the corresponding functions: page_counter_limit() -> page_counter_set_max() mem_cgroup_get_limit() -> mem_cgroup_get_max() mem_cgroup_resize_limit() -> mem_cgroup_resize_max() memcg_update_kmem_limit() -> memcg_update_kmem_max() memcg_update_tcp_limit() -> memcg_update_tcp_max() The idea behind this renaming is to have the direct matching between memory cgroup knobs (low, high, max) and page_counters API. This is pure renaming, this patch doesn't bring any functional change. Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm/memblock: introduce PHYS_ADDR_MAXStefan Agner
So far code was using ULLONG_MAX and type casting to obtain a phys_addr_t with all bits set. The typecast is necessary to silence compiler warnings on 32-bit platforms. Use the simpler but still type safe approach "~(phys_addr_t)0" to create a preprocessor define for all bits set. Link: http://lkml.kernel.org/r/20180406213809.566-1-stefan@agner.ch Signed-off-by: Stefan Agner <stefan@agner.ch> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm: remove odd HAVE_PTE_SPECIALLaurent Dufour
Remove the additional define HAVE_PTE_SPECIAL and rely directly on CONFIG_ARCH_HAS_PTE_SPECIAL. There is no functional change introduced by this patch Link: http://lkml.kernel.org/r/1523533733-25437-1-git-send-email-ldufour@linux.vnet.ibm.com Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christophe LEROY <christophe.leroy@c-s.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm: introduce ARCH_HAS_PTE_SPECIALLaurent Dufour
Currently the PTE special supports is turned on in per architecture header files. Most of the time, it is defined in arch/*/include/asm/pgtable.h depending or not on some other per architecture static definition. This patch introduce a new configuration variable to manage this directly in the Kconfig files. It would later replace __HAVE_ARCH_PTE_SPECIAL. Here notes for some architecture where the definition of __HAVE_ARCH_PTE_SPECIAL is not obvious: arm __HAVE_ARCH_PTE_SPECIAL which is currently defined in arch/arm/include/asm/pgtable-3level.h which is included by arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set. So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE. powerpc __HAVE_ARCH_PTE_SPECIAL is defined in 2 files: - arch/powerpc/include/asm/book3s/64/pgtable.h - arch/powerpc/include/asm/pte-common.h The first one is included if (PPC_BOOK3S & PPC64) while the second is included in all the other cases. So select ARCH_HAS_PTE_SPECIAL all the time. sparc: __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) && defined(__arch64__) which are defined through the compiler in sparc/Makefile if !SPARC32 which I assume to be if SPARC64. So select ARCH_HAS_PTE_SPECIAL if SPARC64 There is no functional change introduced by this patch. Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Suggested-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: David S. Miller <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Albert Ou <albert@sifive.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Christophe LEROY <christophe.leroy@c-s.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm/page_alloc: remove realsize in free_area_init_core()Wei Yang
Highmem's realsize always equals to freesize, so it is not necessary to spare a variable to record this. Link: http://lkml.kernel.org/r/20180413083859.65888-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm: restructure memfd codeMike Kravetz
With the addition of memfd hugetlbfs support, we now have the situation where memfd depends on TMPFS -or- HUGETLBFS. Previously, memfd was only supported on tmpfs, so it made sense that the code resided in shmem.c. In the current code, memfd is only functional if TMPFS is defined. If HUGETLFS is defined and TMPFS is not defined, then memfd functionality will not be available for hugetlbfs. This does not cause BUGs, just a lack of potentially desired functionality. Code is restructured in the following way: - include/linux/memfd.h is a new file containing memfd specific definitions previously contained in shmem_fs.h. - mm/memfd.c is a new file containing memfd specific code previously contained in shmem.c. - memfd specific code is removed from shmem_fs.h and shmem.c. - A new config option MEMFD_CREATE is added that is defined if TMPFS or HUGETLBFS is defined. No functional changes are made to the code: restructuring only. Link: http://lkml.kernel.org/r/20180415182119.4517-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm/shmem: update file sealing comments and file checkingMike Kravetz
In preparation for memfd code restructure, update comments, definitions and function names dealing with file sealing to indicate that tmpfs and hugetlbfs are the supported filesystems. Also, change file pointer checks in memfd_file_seals_ptr to use defined interfaces instead of directly referencing file_operation structs. Link: http://lkml.kernel.org/r/20180415182119.4517-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm/shmem: add __rcu annotations and properly deref radix entryMike Kravetz
Patch series "restructure memfd code", v4. This patch (of 3): In preparation for memfd code restucture, clean up sparse warnings. Most changes required adding __rcu annotations. The routine find_swap_entry was modified to properly deference radix tree entries. Link: http://lkml.kernel.org/r/20180415182119.4517-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07zram: introduce zram memory trackingMinchan Kim
zRam as swap is useful for small memory device. However, swap means those pages on zram are mostly cold pages due to VM's LRU algorithm. Especially, once init data for application are touched for launching, they tend to be not accessed any more and finally swapped out. zRAM can store such cold pages as compressed form but it's pointless to keep in memory. Better idea is app developers free them directly rather than remaining them on heap. This patch tell us last access time of each block of zram via "cat /sys/kernel/debug/zram/zram0/block_state". The output is as follows, 300 75.033841 .wh 301 63.806904 s.. 302 63.806919 ..h First column is zram's block index and 3rh one represents symbol (s: same page w: written page to backing store h: huge page) of the block state. Second column represents usec time unit of the block was last accessed. So above example means the 300th block is accessed at 75.033851 second and it was huge so it was written to the backing store. Admin can leverage this information to catch cold|incompressible pages of process with *pagemap* once part of heaps are swapped out. I used the feature a few years ago to find memory hoggers in userspace to notify them what memory they have wasted without touch for a long time. With it, they could reduce unnecessary memory space. However, at that time, I hacked up zram for the feature but now I need the feature again so I decided it would be better to upstream rather than keeping it alone. I hope I submit the userspace tool to use the feature soon. [akpm@linux-foundation.org: fix i386 printk warning] [minchan@kernel.org: use ktime_get_boottime() instead of sched_clock()] Link: http://lkml.kernel.org/r/20180420063525.GA253739@rodete-desktop-imager.corp.google.com [akpm@linux-foundation.org: documentation tweak] [akpm@linux-foundation.org: fix i386 printk warning] [minchan@kernel.org: fix compile warning] Link: http://lkml.kernel.org/r/20180508104849.GA8209@rodete-desktop-imager.corp.google.com [rdunlap@infradead.org: fix printk formats] Link: http://lkml.kernel.org/r/3652ccb1-96ef-0b0b-05d1-f661d7733dcc@infradead.org Link: http://lkml.kernel.org/r/20180416090946.63057-5-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07zram: record accessed secondMinchan Kim
zRam as swap is useful for small memory device. However, swap means those pages on zram are mostly cold pages due to VM's LRU algorithm. Especially, once init data for application are touched for launching, they tend to be not accessed any more and finally swapped out. zRAM can store such cold pages as compressed form but it's pointless to keep in memory. Better idea is app developers free them directly rather than remaining them on heap. This patch records last access time of each block of zram so that With upcoming zram memory tracking, it could help userspace developers to reduce memory footprint. Link: http://lkml.kernel.org/r/20180416090946.63057-4-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07zram: mark incompressible page as ZRAM_HUGEMinchan Kim
Mark incompressible pages so that we could investigate who is the owner of the incompressible pages once the page is swapped out via using upcoming zram memory tracker feature. With it, we could prevent such pages to be swapped out by using mlock. Otherwise we might remove them. This patch exposes new stat for huge pages via mm_stat. Link: http://lkml.kernel.org/r/20180416090946.63057-3-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07zram: correct flag name of ZRAM_ACCESSMinchan Kim
Patch series "zram memory tracking", v5. zRam as swap is useful for small memory device. However, swap means those pages on zram are mostly cold pages due to VM's LRU algorithm. Especially, once init data for application are touched for launching, they tend to be not accessed any more and finally swapped out. zRAM can store such cold pages as compressed form but it's pointless to keep in memory. As well, it's pointless to store incompressible pages to zram so better idea is app developers manages them directly like free or mlock rather than remaining them on heap. This patch provides a debugfs /sys/kernel/debug/zram/zram0/block_state to represent each block's state so admin can investigate what memory is cold|incompressible|same page with using pagemap once the pages are swapped out. The output is as follows: 300 75.033841 .wh 301 63.806904 s.. 302 63.806919 ..h First column is zram's block index and 3rh one represents symbol (s: same page w: written page to backing store h: huge page) of the block state. Second column represents usec time unit of the block was last accessed. So above example means the 300th block is accessed at 75.033851 second and it was huge so it was written to the backing store. This patch (of 4): ZRAM_ACCESS is used for locking a slot of zram so correct the name. It is also not a common flag to indicate status of the block so move the declare position on top of the flag. Lastly, let's move the function to the top of source code to be able to use it easily without forward declaration. Link: http://lkml.kernel.org/r/20180416090946.63057-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm, memcontrol: implement memory.swap.eventsTejun Heo
Add swap max and fail events so that userland can monitor and respond to running out of swap. I'm not too sure about the fail event. Right now, it's a bit confusing which stats / events are recursive and which aren't and also which ones reflect events which originate from a given cgroup and which targets the cgroup. No idea what the right long term solution is and it could just be that growing them organically is actually the only right thing to do. Link: http://lkml.kernel.org/r/20180416231151.GI1911913@devbig577.frc2.facebook.com Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm, memcontrol: move swap charge handling into get_swap_page()Tejun Heo
Patch series "mm, memcontrol: Implement memory.swap.events", v2. This patchset implements memory.swap.events which contains max and fail events so that userland can monitor and respond to swap running out. This patch (of 2): get_swap_page() is always followed by mem_cgroup_try_charge_swap(). This patch moves mem_cgroup_try_charge_swap() into get_swap_page() and makes get_swap_page() call the function even after swap allocation failure. This simplifies the callers and consolidates memcg related logic and will ease adding swap related memcg events. Link: http://lkml.kernel.org/r/20180416230934.GH1911913@devbig577.frc2.facebook.com Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_structYang Shi
mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start|end and evn_start|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start|end, env_start|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07slab: clean up the code comment in slab kmem_cache structBaoquan He
In commit 3b0efdfa1e7 ("mm, sl[aou]b: Extract common fields from struct kmem_cache") the variable 'obj_size' was moved above, however the related code comment is not updated accordingly. Do it here. Link: http://lkml.kernel.org/r/20180603032402.27526-1-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm/slub: remove obsolete commentCanjiang Lu
The obsolete comment removed in this patch was introduced by 51df1142816e4 ("slub: Dynamically size kmalloc cache allocations"). I paste related modification from that commit: +#ifdef CONFIG_NUMA + /* + * Allocate kmem_cache_node properly from the kmem_cache slab. + * kmem_cache_node is separately allocated so no need to + * update any list pointers. + */ + temp_kmem_cache_node = kmem_cache_node; + kmem_cache_node = kmem_cache_alloc(kmem_cache, GFP_NOWAIT); + memcpy(kmem_cache_node, temp_kmem_cache_node, kmem_size); + + kmem_cache_bootstrap_fixup(kmem_cache_node); + + caches++; +#else + /* + * kmem_cache has kmem_cache_node embedded and we moved it! + * Update the list heads + */ + INIT_LIST_HEAD(&kmem_cache->local_node.partial); + list_splice(&temp_kmem_cache->local_node.partial, &kmem_cache->local_node.partial); +#ifdef CONFIG_SLUB_DEBUG + INIT_LIST_HEAD(&kmem_cache->local_node.full); + list_splice(&temp_kmem_cache->local_node.full, &kmem_cache->local_node.full); +#endif As we can see there're used to distinguish the difference handling between NUMA/non-NUMA configuration in the original commit. I think it doesn't make any sense in current implementation which is placed above kmem_cache_node = bootstrap(&boot_kmem_cache_node); So maybe it's better to remove them now? Link: http://lkml.kernel.org/r/5af26f58.1c69fb81.1be0e.c520SMTPIN_ADDED_BROKEN@mx.google.com Signed-off-by: Canjiang Lu <canjiang.lu@samsung.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm/slub.c: add __printf verification to slab_err()Mathieu Malaterre
__printf is useful to verify format and arguments. Remove the following warning (with W=1): mm/slub.c:721:2: warning: function might be possible candidate for `gnu_printf' format attribute [-Wsuggest-attribute=format] Link: http://lkml.kernel.org/r/20180505200706.19986-1-malat@debian.org Signed-off-by: Mathieu Malaterre <malat@debian.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07slab: __GFP_ZERO is incompatible with a constructorMatthew Wilcox
__GFP_ZERO requests that the object be initialised to all-zeroes, while the purpose of a constructor is to initialise an object to a particular pattern. We cannot do both. Add a warning to catch any users who mistakenly pass a __GFP_ZERO flag when allocating a slab with a constructor. Link: http://lkml.kernel.org/r/20180412191322.GA21205@bombadil.infradead.org Fixes: d07dbea46405 ("Slab allocators: support __GFP_ZERO in all allocators") Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07net/9p/trans_xen.c: don't inclide rwlock.h directlySebastian Andrzej Siewior
rwlock.h should not be included directly. Instead linux/splinlock.h should be included. One thing it does is to break the RT build. Link: http://lkml.kernel.org/r/20180504100319.11880-1-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07fs/9p: detect invalid options as much as possibleChengguang Xu
Currently when detecting invalid options in option parsing, some options(e.g. msize) just set errno and allow to continuously validate other options so that it can detect invalid options as much as possible and give proper error messages together. This patch applies same rule to option 'cache' and 'access' when detecting -EINVAL. Link: http://lkml.kernel.org/r/1525340676-34072-2-git-send-email-cgxu519@gmx.com Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07net/9p: detect invalid options as much as possibleChengguang Xu
Currently when detecting invalid options in option parsing, some options(e.g. msize) just set errno and allow to continuously validate other options so that it can detect invalid options as much as possible and give proper error messages together. This patch applies same rule to option 'trans' and 'version' when detecting -EINVAL. Link: http://lkml.kernel.org/r/1525340676-34072-1-git-send-email-cgxu519@gmx.com Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07fs: ocfs2: use new return type vm_fault_tSouptick Joarder
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t") vmf_error() is the newly introduce inline function in 4.18. Fix one checkpatch.pl warning by replacing BUG_ON() with WARN_ON() [akpm@linux-foundation.org: undo BUG_ON->WARN_ON change] Link: http://lkml.kernel.org/r/20180523153258.GA28451@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07ocfs2: drop a VLA in ocfs2_orphan_del()Salvatore Mesoraca
Avoid a VLA by using a real constant expression instead of a variable. The compiler should be able to optimize the original code and avoid using an actual VLA. Anyway this change is useful because it will avoid a false positive with -Wvla, it might also help the compiler generating better code. Link: http://lkml.kernel.org/r/1520970710-19732-1-git-send-email-s.mesoraca16@gmail.com Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07ocfs2: correct the comments position of struct ocfs2_dir_block_trailerGuozhonghua
Correct the comments position of the structure ocfs2_dir_block_trailer. Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA401071C5FDE@H3CMLB12-EX.srv.huawei-3com.com Signed-off-by: guozhonghua <guozhonghua@h3c.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07ocfs2: eliminate a misreported warningZhen Lei
The warning is invalid because the parameter chunksize passed from ocfs2_info_freefrag_scan_chain-->ocfs2_info_update_ffg is guaranteed to be positive. So __ilog2_u32 cannot return -1. fs/ocfs2/ioctl.c: In function 'ocfs2_info_update_ffg': fs/ocfs2/ioctl.c:411:17: warning: array subscript is below array bounds [-Warray-bounds] hist->fc_chunks[index]++; ^ fs/ocfs2/ioctl.c:411:17: warning: array subscript is below array bounds [-Warray-bounds] Link: http://lkml.kernel.org/r/1524655799-12112-1-git-send-email-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07ocfs2: ocfs2_inode_lock_tracker does not distinguish lock levelLarry Chen
ocfs2_inode_lock_tracker as a variant of ocfs2_inode_lock, is used to prevent deadlock due to recursive lock acquisition. But this function does not distinguish whether the requested level is EX or PR. If a RP lock has been attained, this function will immediately return success afterwards even an EX lock is requested. But actually the return value does not mean that the process got a EX lock, because ocfs2_inode_lock has not been called. When taking lock levels into account, we face some different situations: 1. no lock is held In this case, just lock the inode and return 0 2. We are holding a lock For this situation, things diverges into several cases wanted holding what to do ex ex see 2.1 below ex pr see 2.2 below pr ex see 2.1 below pr pr see 2.1 below 2.1 lock level that is been held is compatible with the wanted level, so no lock action will be tacken. 2.2 Otherwise, an upgrade is needed, but it is forbidden. Reason why upgrade within a process is forbidden is that lock upgrade may cause dead lock. The following illustrate how it happens. process 1 process 2 ocfs2_inode_lock_tracker(ex=0) <====== ocfs2_inode_lock_tracker(ex=1) ocfs2_inode_lock_tracker(ex=1) For the status quo of ocfs2, without this patch, neither a bug nor end-user impact will be caused because the wrong logic is avoided. But I'm afraid this generic interface, may be called by other developers in future and used in this situation. a process ocfs2_inode_lock_tracker(ex=0) ocfs2_inode_lock_tracker(ex=1) Link: http://lkml.kernel.org/r/20180510053230.17217-1-lchen@suse.com Signed-off-by: Larry Chen <lchen@suse.com> Reviewed-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07ocfs2: clean up redundant function declarationsJia Guo
ocfs2_extend_allocation() has been deleted, clean up its declaration. Also change the static function name from __ocfs2_extend_allocation() to ocfs2_extend_allocation() to be consistent with the corresponding trace events as well as comments for ocfs2_lock_allocators(). Link: http://lkml.kernel.org/r/09cf7125-6f12-e53e-20f5-e606b2c16b48@huawei.com Fixes: 964f14a0d350 ("ocfs2: clean up some dead code") Signed-off-by: Jia Guo <guojia12@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07scripts: use SPDX tag in get_maintainer and checkpatchJoe Perches
Add the appropriate SPDX tag to these scripts. Miscellanea: o Add my copyright to checkpatch Link: http://lkml.kernel.org/r/d08e49e8f6562c58a63792aa64306d1851f81f4b.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07fs/dax.c: use new return type vm_fault_tSouptick Joarder
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f422059ae ("mm: change return type to vm_fault_t") There was an existing bug inside dax_load_hole() if vm_insert_mixed had failed to allocate a page table, we'd return VM_FAULT_NOPAGE instead of VM_FAULT_OOM. With new vmf_insert_mixed() this issue is addressed. vm_insert_mixed_mkwrite has inefficiency when it returns an error value, driver has to convert it to vm_fault_t type. With new vmf_insert_mixed_mkwrite() this limitation will be addressed. Link: http://lkml.kernel.org/r/20180510181121.GA15239@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07bpfilter: fix race in pipe accessAlexei Starovoitov
syzbot reported the following crash [ 338.293946] bpfilter: read fail -512 [ 338.304515] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 338.311863] general protection fault: 0000 [#1] SMP KASAN [ 338.344360] RIP: 0010:__vfs_write+0x4a6/0x960 [ 338.426363] Call Trace: [ 338.456967] __kernel_write+0x10c/0x380 [ 338.460928] __bpfilter_process_sockopt+0x1d8/0x35b [ 338.487103] bpfilter_mbox_request+0x4d/0xb0 [ 338.491492] bpfilter_ip_get_sockopt+0x6b/0x90 This can happen when multiple cpus trying to talk to user mode process via bpfilter_mbox_request(). One cpu grabs the mutex while another goes to sleep on the same mutex. Then former cpu sees that umh pipe is down and shuts down the pipes. Later cpu finally acquires the mutex and crashes on freed pipe. Fix the race by using info.pid as an indicator that umh and pipes are healthy and check it after acquiring the mutex. Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module") Reported-by: syzbot+7ade6c94abb2774c0fee@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2018-06-08 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix in the BPF verifier to reject modified ctx pointers on helper functions, from Daniel. 2) Fix in BPF kselftests for get_cgroup_id_user() helper to only record the cgroup id for a provided pid in order to reduce test failures from processes interferring with the test, from Yonghong. 3) Fix a crash in AF_XDP's mem accounting when the process owning the sock has CAP_IPC_LOCK capabilities set, from Daniel. 4) Fix an issue for AF_XDP on 32 bit machines where XDP_UMEM_PGOFF_*_RING defines need ULL suffixes and use loff_t type as they are otherwise truncated, from Geert. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-07Merge branch 'next-smack' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull smack update from James Morris: "From Casey: One simple patch that fixes a memory leak in kernfs and labeled NFS" * 'next-smack' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: Smack: Fix memory leak in smack_inode_getsecctx
2018-06-07Merge branch 'next-tpm' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull TPM updates from James Morris: "From Jarkko: This purely a bug fix release. The only major change is to move the event log code to its own subdirectory because there starts to be so much of it" * 'next-tpm' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: tpm: fix race condition in tpm_common_write() tpm: reduce polling time to usecs for even finer granularity tpm: replace kmalloc() + memcpy() with kmemdup() tpm: replace kmalloc() + memcpy() with kmemdup() tpm: fix use after free in tpm2_load_context() tpm: reduce poll sleep time in tpm_transmit() tpm_tis: verify locality released before returning from release_locality tpm: tpm_crb: relinquish locality on error path. tpm/st33zp24: Fix spelling mistake in macro ST33ZP24_TISREGISTER_UKNOWN tpm: Move eventlog declarations to its own header tpm: Move shared eventlog functions to common.c tpm: Move eventlog files to a subdirectory tpm: Add explicit endianness cast tpm: st33zp24: remove redundant null check on chip tpm: move the delay_msec increment after sleep in tpm_transmit()
2018-06-07Merge branch 'next-integrity' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull integrity updates from James Morris: "From Mimi: - add run time support for specifying additional security xattrs included in the security.evm HMAC/signature - some code clean up and bug fixes" * 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: EVM: unlock on error path in evm_read_xattrs() EVM: prevent array underflow in evm_write_xattrs() EVM: Fix null dereference on xattr when xattr fails to allocate EVM: fix memory leak of temporary buffer 'temp' IMA: use list_splice_tail_init_rcu() instead of its open coded variant ima: use match_string() helper ima: fix updating the ima_appraise flag ima: based on policy verify firmware signatures (pre-allocated buffer) ima: define a new policy condition based on the filesystem name EVM: Allow runtime modification of the set of verified xattrs EVM: turn evm_config_xattrnames into a list integrity: Add an integrity directory in securityfs ima: Remove unused variable ima_initialized ima: Unify logging ima: Reflect correct permissions for policy
2018-06-07bpf, xdp: fix crash in xdp_umem_unaccount_pagesDaniel Borkmann
syzkaller was able to trigger the following panic for AF_XDP: BUG: KASAN: null-ptr-deref in atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline] BUG: KASAN: null-ptr-deref in atomic_long_sub include/asm-generic/atomic-long.h:199 [inline] BUG: KASAN: null-ptr-deref in xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135 Write of size 8 at addr 0000000000000060 by task syz-executor246/4527 CPU: 1 PID: 4527 Comm: syz-executor246 Not tainted 4.17.0+ #89 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 kasan_report_error mm/kasan/report.c:352 [inline] kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267 kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278 atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline] atomic_long_sub include/asm-generic/atomic-long.h:199 [inline] xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135 xdp_umem_reg net/xdp/xdp_umem.c:334 [inline] xdp_umem_create+0xd6c/0x10f0 net/xdp/xdp_umem.c:349 xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531 __sys_setsockopt+0x1bd/0x390 net/socket.c:1935 __do_sys_setsockopt net/socket.c:1946 [inline] __se_sys_setsockopt net/socket.c:1943 [inline] __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe In xdp_umem_reg() the call to xdp_umem_account_pages() passed with CAP_IPC_LOCK where we didn't need to end up charging rlimit on memlock for the current user and therefore umem->user continues to be NULL. Later on through fault injection syzkaller triggered a failure in either umem->pgs or umem->pages allocation such that we bail out and undo accounting in xdp_umem_unaccount_pages() where we eventually hit the panic since it tries to deref the umem->user. The code is pretty close to mm_account_pinned_pages() and mm_unaccount_pinned_pages() pair and potentially could reuse it even in a later cleanup, and it appears that the initial commit c0c77d8fb787 ("xsk: add user memory registration support sockopt") got this right while later follow-up introduced the bug via a49049ea2576 ("xsk: simplified umem setup"). Fixes: a49049ea2576 ("xsk: simplified umem setup") Reported-by: syzbot+979217770b09ebf5c407@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-06-08xsk: Fix umem fill/completion queue mmap on 32-bitGeert Uytterhoeven
With gcc-4.1.2 on 32-bit: net/xdp/xsk.c:663: warning: integer constant is too large for ‘long’ type net/xdp/xsk.c:665: warning: integer constant is too large for ‘long’ type Add the missing "ULL" suffixes to the large XDP_UMEM_PGOFF_*_RING values to fix this. net/xdp/xsk.c:663: warning: comparison is always false due to limited range of data type net/xdp/xsk.c:665: warning: comparison is always false due to limited range of data type "unsigned long" is 32-bit on 32-bit systems, hence the offset is truncated, and can never be equal to any of the XDP_UMEM_PGOFF_*_RING values. Use loff_t (and the required cast) to fix this. Fixes: 423f38329d267969 ("xsk: add umem fill queue support and mmap") Fixes: fe2308328cd2f26e ("xsk: add umem completion queue support and mmap") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-08tools/bpf: fix selftest get_cgroup_id_userYonghong Song
Commit f269099a7e7a ("tools/bpf: add a selftest for bpf_get_current_cgroup_id() helper") added a test for bpf_get_current_cgroup_id() helper. The bpf program is attached to tracepoint syscalls/sys_enter_nanosleep and will record the cgroup id if the tracepoint is hit. The test program creates a cgroup and attachs itself to this cgroup and expects that the test program process cgroup id is the same as the cgroup_id retrieved by the bpf program. In a light system where no other processes called nanosleep syscall, the test case can pass. In a busy system where many different processes can hit syscalls/sys_enter_nanosleep tracepoint, the cgroup id recorded by bpf program may not match the test program process cgroup_id. This patch fixed an issue by communicating the test program pid to bpf program. The bpf program only records cgroup id if the current task pid is the same as passed-in pid. This ensures that the recorded cgroup_id is for the cgroup within which the test program resides. Fixes: f269099a7e7a ("tools/bpf: add a selftest for bpf_get_current_cgroup_id() helper") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-06-07Merge tag 'devicetree-for-4.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull DeviceTree updates from Rob Herring: - Sync dtc with upstream version v1.4.6-21-g84e414b0b5bc. This adds new warnings which are either fixed or disabled by default (enabled with W=1). - Validate an untrusted offset in DT overlay function update_usages_of_a_phandle_reference - Fix a use after free error of_platform_device_destroy - Fix an off by 1 string errors in unittest - Avoid creating a struct device for OPP nodes - Update DT specific submitting-patches.txt with patch content and subject requirements. - Move some bindings to their proper subsystem locations - Add vendor prefixes for Kaohsiung, SiFive, Avnet, Wi2Wi, Logic PD, and ArcherMind - Add documentation for "no-gpio-delays" property in FSI bus GPIO master - Add compatible for r8a77990 SoC ravb ethernet block - More wack-a-mole removal of 'status' property in examples * tag 'devicetree-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (25 commits) dt-bindings: submitting-patches: add guidance on patch content and subject of: platform: stop accessing invalid dev in of_platform_device_destroy dt-bindings: net: ravb: Add support for r8a77990 SoC dt-bindings: Add vendor prefix for ArcherMind dt-bindings: fsi-master-gpio: Document "no-gpio-delays" property dt-bindings: Add vendor prefix for Logic PD of: overlay: validate offset from property fixups of: unittest: for strings, account for trailing \0 in property length field drm: rcar-du: disable dtc graph-endpoint warnings on DT overlays kbuild: disable new dtc graph and unit-address warnings scripts/dtc: Update to upstream version v1.4.6-21-g84e414b0b5bc MAINTAINERS: add keyword for devicetree overlay notifiers dt-bindings: define vendor prefix for Wi2Wi, Inc. dt-bindings: Add vendor prefix for Avnet, Inc. dt-bindings: Relocate Tegra20 memory controller bindings dt-bindings: Add "sifive" vendor prefix dt-bindings: exynos: move ADC binding to iio/adc/ directory dt-bindings: powerpc/4xx: move 4xx NDFC and EMAC bindings to subsystem directories dt-bindings: move various RNG bindings to rng/ directory dt-bindings: move various timer bindings to timer/ directory ...
2018-06-07Merge tag 'auxdisplay-for-linus-v4.18-rc1' of git://github.com/ojeda/linuxLinus Torvalds
Pull auxdisplay updates from Miguel Ojeda: "Mostly small fixes and cleanups, plus a non-trivial fix for charlcd - charlcd: fixes and cleanups (Robert Abel and Sean Young) - Kconfig fixes (Randy Dunlap, Corentin Labbe and Ulf Magnusson) - cfag12864bfb: const cleanup (Gustavo A. R. Silva) - Docs/licenses/warnings cleanups" * tag 'auxdisplay-for-linus-v4.18-rc1' of git://github.com/ojeda/linux: auxdisplay: Replace licenses with SPDX identifiers auxdisplay: make PANEL a menuconfig auxdisplay: fix broken menu auxdisplay: charlcd: Fix and clean up handling of x/y commands auxdisplay: charlcd: fix hex literal ranges for graphics command auxdisplay: charlcd: fix two-line command ^[[LN not marked as processed auxdisplay: charlcd: replace octal literal with form-feed escape sequence auxdisplay: charlcd: use null character instead of zero literal to terminate strings auxdisplay: charlcd: no need to call charlcd_gotoxy() if nothing changes auxdisplay: cfag12864bfb: constify fb_fix_screeninfo and fb_var_screeninfo structures auxdisplay: img-ascii-lcd: fix typo on select SYSCON/MFD_SYSCON auxdisplay: img-ascii-lcd: kconfig: Remove MIPS_SEAD3 reference auxdisplay: arm-charlcd: Fix struct charlcd doc line MAINTAINERS: auxdisplay: remove obsolete webpages Doc: misc-devices: move lcd-panel-cgram.txt to auxdisplay/
2018-06-07bpfilter: fix OUTPUT_FORMATAlexei Starovoitov
CONFIG_OUTPUT_FORMAT is x86 only macro. Used objdump to extract elf file format. Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module") Reported-by: David S. Miller <davem@davemloft.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-07Merge tag 'pinctrl-v4.18-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control updates from Linus Walleij: "This is the bulk of pin control changes for v4.18. No core changes this time! Just a calm all-over-the-place drivers, updates and fixes cycle as it seems. New drivers/subdrivers: - Actions Semiconductor S900 driver with more Actions variants for S700, S500 in the pipe. Also generic GPIO support on top of the same driver and IRQ support is in the pipe. - Renesas r8a77470 PFC support. - Renesas r8a77990 PFC support. - Allwinner Sunxi H6 R_PIO support. - Rockchip PX30 support. - Meson Meson8m2 support. - Remove support for the ill-fated Samsung Exynos 5440 SoC. Improvements: - Context save/restore support in pinctrl-single. - External interrupt support for the Mediatek MT7622. - Qualcomm ACPI HID QCOM8002 supported. Fixes: - Fix up suspend/resume support for Exynos 5433. - Fix Strago DMI fixes on the Intel Cherryview" * tag 'pinctrl-v4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (72 commits) pinctrl: cherryview: limit Strago DMI workarounds to version 1.0 pinctrl: at91-pio4: add missing of_node_put pinctrl: armada-37xx: Fix spurious irq management gpiolib: discourage gpiochip_add_pin[group]_range for DT pinctrls pinctrl: msm: fix gpio-hog related boot issues MAINTAINERS: update entry for Mediatek pin controller pinctrl: mediatek: remove unused fields in struct mtk_eint_hw pinctrl: mediatek: use generic EINT register maps for each SoC pinctrl: mediatek: add EINT support to MT7622 SoC pinctrl: mediatek: refactor EINT related code for all MediaTek pinctrl can fit dt-bindings: pinctrl: add external interrupt support to MT7622 pinctrl pinctrl: freescale: Switch to SPDX identifier pinctrl: samsung: Fix suspend/resume for Exynos5433 GPF1..5 banks pinctrl: sh-pfc: rcar-gen3: Fix grammar in static pin comments pinctrl: sh-pfc: r8a77965: Add I2C pin support pinctrl: sh-pfc: r8a77990: Add EthernetAVB pins, groups and functions pinctrl: sh-pfc: r8a77990: Add I2C{1,2,4,5,6,7} pins, groups and functions pinctrl: sh-pfc: r8a77990: Add SCIF pins, groups and functions pinctrl: sh-pfc: r8a77990: Add bias pinconf support pinctrl: sh-pfc: Initial R8A77990 PFC support ...
2018-06-07umh: fix race conditionAlexei Starovoitov
kasan reported use-after-free: BUG: KASAN: use-after-free in call_usermodehelper_exec_work+0x2d3/0x310 kernel/umh.c:195 Write of size 4 at addr ffff8801d9202370 by task kworker/u4:2/50 Workqueue: events_unbound call_usermodehelper_exec_work Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412 __asan_report_store4_noabort+0x17/0x20 mm/kasan/report.c:437 call_usermodehelper_exec_work+0x2d3/0x310 kernel/umh.c:195 process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145 worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279 kthread+0x345/0x410 kernel/kthread.c:240 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412 The reason is that 'sub_info' cannot be accessed out of parent task context, since it will be freed by the child. Instead remember the pid in the child task. Fixes: 449325b52b7a ("umh: introduce fork_usermode_blob() helper") Reported-by: syzbot+2c73319c406f1987d156@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-07net: mscc: ocelot: Fix uninitialized error in ocelot_netdevice_event()Geert Uytterhoeven
With gcc-4.1.2: drivers/net/ethernet/mscc/ocelot.c: In function ‘ocelot_netdevice_event’: drivers/net/ethernet/mscc/ocelot.c:1129: warning: ‘ret’ may be used uninitialized in this function If the list iterated over by netdev_for_each_lower_dev() is empty, ret is never initialized, and converted into a notifier return value. Fix this by preinitializing ret to zero. Fixes: a556c76adc052c97 ("net: mscc: Add initial Ocelot switch support") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-07Merge tag 'spi-nor/for-4.18' of git://git.infradead.org/linux-mtd into mtd/nextBoris Brezillon
Core changes: - Add support for a bunch of SPI NOR chips - Clear EAR reg when switching to 3-byte addressing mode on Winbond chips SPI NOR controller driver changes: - cadence: Add DMA support for direct mode reads - hisi: Prefix a few functions with hisi_ - intel: * Mark the driver as "dangerous" in Kconfig * Fix atomic sequence handling * Pass a 40us delay (instead of 0us) to readl_poll_timeout() - fsl: * fix a typo in a function name * add support for IP variants embedded in the ls2080a and ls1080a SoCs - stm32: request exclusive control of the reset line
2018-06-07Merge tag 'nand/for-4.18' of git://git.infradead.org/linux-mtd into mtd/nextBoris Brezillon
Core changes: - Add Miquel as a NAND maintainer - Add access mode to the nand_page_io_req struct - Fix kernel-doc in rawnand.h - Support bit-wise majority to recover from corrupted ONFI parameter pages - Stop checking FAIL bit after a SET_FEATURES, as documented in the ONFI spec Raw NAND Driver changes: - Fix and cleanup the error path of many NAND controller drivers - GPMI: * Cleanup/simplification of a few aspects in the driver * Take ECC setup specified in the DT into account - sunxi: remove support for GPIO-based R/B polling - MTK: * Use of_device_get_match_data() instead of of_match_device() * Add an entry in MAINTAINERS for this driver * Fix nand-ecc-step-size and nand-ecc-strength description in the DT bindings doc - fsl_ifc: fix ->cmdfunc() to read more than one ONFI parameter page OneNAND driver changes: - samsung: use dev_get_drvdata() instead of platform_get_drvdata()
2018-06-07bonding: re-evaluate force_primary when the primary slave name changesXiangning Yu
There is a timing issue under active-standy mode, when bond_enslave() is called, bond->params.primary might not be initialized yet. Any time the primary slave string changes, bond->force_primary should be set to true to make sure the primary becomes the active slave. Signed-off-by: Xiangning Yu <yuxiangning@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>