summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-11-06mm/smaps: simplify shmem handling of pte holesPeter Xu
Firstly, check_shmem_swap variable is actually not necessary, because it's always set with pte_hole hook; checking each would work. Meanwhile, the check within smaps_pte_entry is not easy to follow. E.g., pte_none() check is not needed as "!pte_present && !is_swap_pte" is the same. Since at it, use the pte_hole() helper rather than dup the page cache lookup. Still keep the CONFIG_SHMEM part so the code can be optimized to nop for !SHMEM. There will be a very slight functional change in smaps_pte_entry(), that for !SHMEM we'll return early for pte_none (before checking page==NULL), but that's even nicer. Link: https://lkml.kernel.org/r/20210917164756.8586-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm/smaps: use vma->vm_pgoff directly when counting partial swapPeter Xu
As it's trying to cover the whole vma anyways, use direct vm_pgoff value and vma_pages() rather than linear_page_index. Link: https://lkml.kernel.org/r/20210917164756.8586-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm/smaps: fix shmem pte hole swap calculationPeter Xu
Patch series "mm/smaps: Fixes and optimizations on shmem swap handling". This patch (of 3): The shmem swap calculation on the privately writable mappings are using wrong parameters as spotted by Vlastimil. Fix them. This was introduced in commit 48131e03ca4e ("mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings"), when shmem_swap_usage was reworked to shmem_partial_swap_usage. Test program: void main(void) { char *buffer, *p; int i, fd; fd = memfd_create("test", 0); assert(fd > 0); /* isize==2M*3, fill in pages, swap them out */ ftruncate(fd, SIZE_2M * 3); buffer = mmap(NULL, SIZE_2M * 3, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); assert(buffer); for (i = 0, p = buffer; i < SIZE_2M * 3 / 4096; i++) { *p = 1; p += 4096; } madvise(buffer, SIZE_2M * 3, MADV_PAGEOUT); munmap(buffer, SIZE_2M * 3); /* * Remap with private+writtable mappings on partial of the inode (<= 2M*3), * while the size must also be >= 2M*2 to make sure there's a none pmd so * smaps_pte_hole will be triggered. */ buffer = mmap(NULL, SIZE_2M * 2, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); printf("pid=%d, buffer=%p\n", getpid(), buffer); /* Check /proc/$PID/smap_rollup, should see 4MB swap */ sleep(1000000); } Before the patch, smaps_rollup shows <4MB swap and the number will be random depending on the alignment of the buffer of mmap() allocated. After this patch, it'll show 4MB. Link: https://lkml.kernel.org/r/20210917164756.8586-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210917164756.8586-2-peterx@redhat.com Fixes: 48131e03ca4e ("mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06kasan: test: add memcpy test that avoids out-of-bounds writePeter Collingbourne
With HW tag-based KASAN, error checks are performed implicitly by the load and store instructions in the memcpy implementation. A failed check results in tag checks being disabled and execution will keep going. As a result, under HW tag-based KASAN, prior to commit 1b0668be62cf ("kasan: test: disable kmalloc_memmove_invalid_size for HW_TAGS"), this memcpy would end up corrupting memory until it hits an inaccessible page and causes a kernel panic. This is a pre-existing issue that was revealed by commit 285133040e6c ("arm64: Import latest memcpy()/memmove() implementation") which changed the memcpy implementation from using signed comparisons (incorrectly, resulting in the memcpy being terminated early for negative sizes) to using unsigned comparisons. It is unclear how this could be handled by memcpy itself in a reasonable way. One possibility would be to add an exception handler that would force memcpy to return if a tag check fault is detected -- this would make the behavior roughly similar to generic and SW tag-based KASAN. However, this wouldn't solve the problem for asynchronous mode and also makes memcpy behavior inconsistent with manually copying data. This test was added as a part of a series that taught KASAN to detect negative sizes in memory operations, see commit 8cceeff48f23 ("kasan: detect negative size in memory operation function"). Therefore we should keep testing for negative sizes with generic and SW tag-based KASAN. But there is some value in testing small memcpy overflows, so let's add another test with memcpy that does not destabilize the kernel by performing out-of-bounds writes, and run it in all modes. Link: https://linux-review.googlesource.com/id/I048d1e6a9aff766c4a53f989fb0c83de68923882 Link: https://lkml.kernel.org/r/20210910211356.3603758-1-pcc@google.com Signed-off-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: Marco Elver <elver@google.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06kasan: fix tag for large allocations when using CONFIG_SLABMatthew Wilcox (Oracle)
If an object is allocated on a tail page of a multi-page slab, kasan will get the wrong tag because page->s_mem is NULL for tail pages. I'm not quite sure what the user-visible effect of this might be. Link: https://lkml.kernel.org/r/20211001024105.3217339-1-willy@infradead.org Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06workqueue, kasan: avoid alloc_pages() when recording stackMarco Elver
Shuah Khan reported: | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled, | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when | it tries to allocate memory attempting to acquire spinlock in page | allocation code while holding workqueue pool raw_spinlock. | | There are several instances of this problem when block layer tries | to __queue_work(). Call trace from one of these instances is below: | | kblockd_mod_delayed_work_on() | mod_delayed_work_on() | __queue_delayed_work() | __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held) | insert_work() | kasan_record_aux_stack() | kasan_save_stack() | stack_depot_save() | alloc_pages() | __alloc_pages() | get_page_from_freelist() | rm_queue() | rm_queue_pcplist() | local_lock_irqsave(&pagesets.lock, flags); | [ BUG: Invalid wait context triggered ] The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...). In general, however, it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain non-preemptive contexts, including raw_spin_locks (see gfp.h and commmit ab00db216c9c7). Fix it by instructing stackdepot to not expand stack storage via alloc_pages() in case it runs out by using kasan_record_aux_stack_noalloc(). While there is an increased risk of failing to insert the stack trace, this is typically unlikely, especially if the same insertion had already succeeded previously (stack depot hit). For frequent calls from the same location, it therefore becomes extremely unlikely that kasan_record_aux_stack_noalloc() fails. Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org Link: https://lkml.kernel.org/r/20210913112609.2651084-7-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reported-by: Shuah Khan <skhan@linuxfoundation.org> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Taras Madan <tarasmadan@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06kasan: generic: introduce kasan_record_aux_stack_noalloc()Marco Elver
Introduce a variant of kasan_record_aux_stack() that does not do any memory allocation through stackdepot. This will permit using it in contexts that cannot allocate any memory. Link: https://lkml.kernel.org/r/20210913112609.2651084-6-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Taras Madan <tarasmadan@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06kasan: common: provide can_alloc in kasan_save_stack()Marco Elver
Add another argument, can_alloc, to kasan_save_stack() which is passed as-is to __stack_depot_save(). No functional change intended. Link: https://lkml.kernel.org/r/20210913112609.2651084-5-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Taras Madan <tarasmadan@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06lib/stackdepot: introduce __stack_depot_save()Marco Elver
Add __stack_depot_save(), which provides more fine-grained control over stackdepot's memory allocation behaviour, in case stackdepot runs out of "stack slabs". Normally stackdepot uses alloc_pages() in case it runs out of space; passing can_alloc==false to __stack_depot_save() prohibits this, at the cost of more likely failure to record a stack trace. Link: https://lkml.kernel.org/r/20210913112609.2651084-4-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Taras Madan <tarasmadan@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06lib/stackdepot: remove unused function argumentMarco Elver
alloc_flags in depot_alloc_stack() is no longer used; remove it. Link: https://lkml.kernel.org/r/20210913112609.2651084-3-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Taras Madan <tarasmadan@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06lib/stackdepot: include gfp.hMarco Elver
Patch series "stackdepot, kasan, workqueue: Avoid expanding stackdepot slabs when holding raw_spin_lock", v2. Shuah Khan reported [1]: | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled, | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when | it tries to allocate memory attempting to acquire spinlock in page | allocation code while holding workqueue pool raw_spinlock. | | There are several instances of this problem when block layer tries | to __queue_work(). Call trace from one of these instances is below: | | kblockd_mod_delayed_work_on() | mod_delayed_work_on() | __queue_delayed_work() | __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held) | insert_work() | kasan_record_aux_stack() | kasan_save_stack() | stack_depot_save() | alloc_pages() | __alloc_pages() | get_page_from_freelist() | rm_queue() | rm_queue_pcplist() | local_lock_irqsave(&pagesets.lock, flags); | [ BUG: Invalid wait context triggered ] PROVE_RAW_LOCK_NESTING is pointing out that (on RT kernels) the locking rules are being violated. More generally, memory is being allocated from a non-preemptive context (raw_spin_lock'd c-s) where it is not allowed. To properly fix this, we must prevent stackdepot from replenishing its "stack slab" pool if memory allocations cannot be done in the current context: it's a bug to use either GFP_ATOMIC nor GFP_NOWAIT in certain non-preemptive contexts, including raw_spin_locks (see gfp.h and commit ab00db216c9c7). The only downside is that saving a stack trace may fail if: stackdepot runs out of space AND the same stack trace has not been recorded before. I expect this to be unlikely, and a simple experiment (boot the kernel) didn't result in any failure to record stack trace from insert_work(). The series includes a few minor fixes to stackdepot that I noticed in preparing the series. It then introduces __stack_depot_save(), which exposes the option to force stackdepot to not allocate any memory. Finally, KASAN is changed to use the new stackdepot interface and provide kasan_record_aux_stack_noalloc(), which is then used by workqueue code. [1] https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org This patch (of 6): <linux/stackdepot.h> refers to gfp_t, but doesn't include gfp.h. Fix it by including <linux/gfp.h>. Link: https://lkml.kernel.org/r/20210913112609.2651084-1-elver@google.com Link: https://lkml.kernel.org/r/20210913112609.2651084-2-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Walter Wu <walter-zh.wu@mediatek.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Taras Madan <tarasmadan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm: don't include <linux/dax.h> in <linux/mempolicy.h>Christoph Hellwig
Not required at all, and having this causes a huge kernel rebuild as soon as something in dax.h changes. Link: https://lkml.kernel.org/r/20210921082253.1859794-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on ↵Sebastian Andrzej Siewior
PREEMPT_RT TRANSPARENT_HUGEPAGE: There are potential non-deterministic delays to an RT thread if a critical memory region is not THP-aligned and a non-RT buffer is located in the same hugepage-aligned region. It's also possible for an unrelated thread to migrate pages belonging to an RT task incurring unexpected page faults due to memory defragmentation even if khugepaged is disabled. Regular HUGEPAGEs are not affected by this can be used. NUMA_BALANCING: There is a non-deterministic delay to mark PTEs PROT_NONE to gather NUMA fault samples, increased page faults of regions even if mlocked and non-deterministic delays when migrating pages. [Mel Gorman worded 99% of the commit description]. Link: https://lore.kernel.org/all/20200304091159.GN3818@techsingularity.net/ Link: https://lore.kernel.org/all/20211026165100.ahz5bkx44lrrw5pt@linutronix.de/ Link: https://lkml.kernel.org/r/20211028143327.hfbxjze7palrpfgp@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm, slub: use prefetchw instead of prefetchHyeonggon Yoo
Commit 0ad9500e16fe ("slub: prefetch next freelist pointer in slab_alloc()") introduced prefetch_freepointer() because when other cpu(s) freed objects into a page that current cpu owns, the freelist link is hot on cpu(s) which freed objects and possibly very cold on current cpu. But if freelist link chain is hot on cpu(s) which freed objects, it's better to invalidate that chain because they're not going to access again within a short time. So use prefetchw instead of prefetch. On supported architectures like x86 and arm, it invalidates other copied instances of a cache line when prefetching it. Before: Time: 91.677 Performance counter stats for 'hackbench -g 100 -l 10000': 1462938.07 msec cpu-clock # 15.908 CPUs utilized 18072550 context-switches # 12.354 K/sec 1018814 cpu-migrations # 696.416 /sec 104558 page-faults # 71.471 /sec 1580035699271 cycles # 1.080 GHz (54.51%) 2003670016013 instructions # 1.27 insn per cycle (54.31%) 5702204863 branch-misses (54.28%) 643368500985 cache-references # 439.778 M/sec (54.26%) 18475582235 cache-misses # 2.872 % of all cache refs (54.28%) 642206796636 L1-dcache-loads # 438.984 M/sec (46.87%) 18215813147 L1-dcache-load-misses # 2.84% of all L1-dcache accesses (46.83%) 653842996501 dTLB-loads # 446.938 M/sec (46.63%) 3227179675 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.85%) 537531951350 iTLB-loads # 367.433 M/sec (54.33%) 114750630 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.37%) 630135543177 L1-icache-loads # 430.733 M/sec (46.80%) 22923237620 L1-icache-load-misses # 3.64% of all L1-icache accesses (46.76%) 91.964452802 seconds time elapsed 43.416742000 seconds user 1422.441123000 seconds sys After: Time: 90.220 Performance counter stats for 'hackbench -g 100 -l 10000': 1437418.48 msec cpu-clock # 15.880 CPUs utilized 17694068 context-switches # 12.310 K/sec 958257 cpu-migrations # 666.651 /sec 100604 page-faults # 69.989 /sec 1583259429428 cycles # 1.101 GHz (54.57%) 2004002484935 instructions # 1.27 insn per cycle (54.37%) 5594202389 branch-misses (54.36%) 643113574524 cache-references # 447.409 M/sec (54.39%) 18233791870 cache-misses # 2.835 % of all cache refs (54.37%) 640205852062 L1-dcache-loads # 445.386 M/sec (46.75%) 17968160377 L1-dcache-load-misses # 2.81% of all L1-dcache accesses (46.79%) 651747432274 dTLB-loads # 453.415 M/sec (46.59%) 3127124271 dTLB-load-misses # 0.48% of all dTLB cache accesses (46.75%) 535395273064 iTLB-loads # 372.470 M/sec (54.38%) 113500056 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%) 628871845924 L1-icache-loads # 437.501 M/sec (46.80%) 22585641203 L1-icache-load-misses # 3.59% of all L1-icache accesses (46.79%) 90.514819303 seconds time elapsed 43.877656000 seconds user 1397.176001000 seconds sys Link: https://lkml.org/lkml/2021/10/8/598=20 Link: https://lkml.kernel.org/r/20211011144331.70084-1-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm/slub: increase default cpu partial list sizesVlastimil Babka
The defaults are determined based on object size and can go up to 30 for objects smaller than 256 bytes. Before the previous patch changed the accounting, this could have made cpu partial list contain up to 30 pages. After that patch, only up to 2 pages with default allocation order. Very short lists limit the usefulness of the whole concept of cpu partial lists, so this patch aims at a more reasonable default under the new accounting. The defaults are quadrupled, except for object size >= PAGE_SIZE where it's doubled. This makes the lists grow up to 10 pages in practice. A quick test of booting a kernel under virtme with 4GB RAM and 8 vcpus shows the following slab memory usage after boot: Before previous patch (using page->pobjects): Slab: 36732 kB SReclaimable: 14836 kB SUnreclaim: 21896 kB After previous patch (using page->pages): Slab: 34720 kB SReclaimable: 13716 kB SUnreclaim: 21004 kB After this patch (using page->pages, higher defaults): Slab: 35252 kB SReclaimable: 13944 kB SUnreclaim: 21308 kB In the same setup, I also ran 5 times: hackbench -l 16000 -g 16 Differences in time were in the noise, we can compare slub stats as given by slabinfo -r skbuff_head_cache (the other cache heavily used by hackbench, kmalloc-cg-512 looks similar). Negligible stats left out for brevity. Before previous patch (using page->pobjects): Objects: 1408, Memory Total: 401408 Used : 304128 Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 469952498 5946606 91 1 Slowpath 42053573 506059465 8 98 Page Alloc 41093 41044 0 0 Add partial 18 21229327 0 4 Remove partial 20039522 36051 3 0 Cpu partial list 4686640 24767229 0 4 RemoteObj/SlabFrozen 16 124027841 0 24 Total 512006071 512006071 Flushes 18 Slab Deactivation Occurrences % ------------------------------------------------- Slab empty 4993 0% Deactivation bypass 24767229 99% Refilled from foreign frees 21972674 88% After previous patch (using page->pages): Objects: 480, Memory Total: 131072 Used : 103680 Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 473016294 5405653 92 1 Slowpath 38989777 506600418 7 98 Page Alloc 32717 32701 0 0 Add partial 3 22749164 0 4 Remove partial 11371127 32474 2 0 Cpu partial list 11686226 23090059 2 4 RemoteObj/SlabFrozen 2 67541803 0 13 Total 512006071 512006071 Flushes 3 Slab Deactivation Occurrences % ------------------------------------------------- Slab empty 227 0% Deactivation bypass 23090059 99% Refilled from foreign frees 27585695 119% After this patch (using page->pages, higher defaults): Objects: 896, Memory Total: 229376 Used : 193536 Slab Perf Counter Alloc Free %Al %Fr -------------------------------------------------- Fastpath 473799295 4980278 92 0 Slowpath 38206776 507025793 7 99 Page Alloc 32295 32267 0 0 Add partial 11 23291143 0 4 Remove partial 5815764 31278 1 0 Cpu partial list 18119280 23967320 3 4 RemoteObj/SlabFrozen 10 76974794 0 15 Total 512006071 512006071 Flushes 11 Slab Deactivation Occurrences % ------------------------------------------------- Slab empty 989 0% Deactivation bypass 23967320 99% Refilled from foreign frees 32358473 135% As expected, memory usage dropped significantly with change of accounting, increasing the defaults increased it, but not as much. The number of page allocation/frees dropped significantly with the new accounting, but didn't increase with the higher defaults. Interestingly, the number of fasthpath allocations increased, as well as allocations from the cpu partial list, even though it's shorter. Link: https://lkml.kernel.org/r/20211012134651.11258-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm, slub: change percpu partial accounting from objects to pagesVlastimil Babka
With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of partial slabs that can be promoted to cpu slab when the previous one is depleted, without accessing the shared partial list. A slab can be added to this list by 1) refill of an empty list from get_partial_node() - once we really have to access the shared partial list, we acquire multiple slabs to amortize the cost of locking, and 2) first free to a previously full slab - instead of putting the slab on a shared partial list, we can more cheaply freeze it and put it on the per-cpu list. To control how large a percpu partial list can grow for a kmem cache, set_cpu_partial() calculates a target number of free objects on each cpu's percpu partial list, and this can be also set by the sysfs file cpu_partial. However, the tracking of actual number of objects is imprecise, in order to limit overhead from cpu X freeing an objects to a slab on percpu partial list of cpu Y. Basically, the percpu partial slabs form a single linked list, and when we add a new slab to the list with current head "oldpage", we set in the struct page of the slab we're adding: page->pages = oldpage->pages + 1; // this is precise page->pobjects = oldpage->pobjects + (page->objects - page->inuse); page->next = oldpage; Thus the real number of free objects in the slab (objects - inuse) is only determined at the moment of adding the slab to the percpu partial list, and further freeing doesn't update the pobjects counter nor propagate it to the current list head. As Jann reports [1], this can easily lead to large inaccuracies, where the target number of objects (up to 30 by default) can translate to the same number of (empty) slab pages on the list. In case 2) above, we put a slab with 1 free object on the list, thus only increase page->pobjects by 1, even if there are subsequent frees on the same slab. Jann has noticed this in practice and so did we [2] when investigating significant increase of kmemcg usage after switching from SLAB to SLUB. While this is no longer a problem in kmemcg context thanks to the accounting rewrite in 5.9, the memory waste is still not ideal and it's questionable whether it makes sense to perform free object count based control when object counts can easily become so much inaccurate. So this patch converts the accounting to be based on number of pages only (which is precise) and removes the page->pobjects field completely. This is also ultimately simpler. To retain the existing set_cpu_partial() heuristic, first calculate the target number of objects as previously, but then convert it to target number of pages by assuming the pages will be half-filled on average. This assumption might obviously also be inaccurate in practice, but cannot degrade to actual number of pages being equal to the target number of objects. We could also skip the intermediate step with target number of objects and rewrite the heuristic in terms of pages. However we still have the sysfs file cpu_partial which uses number of objects and could break existing users if it suddenly becomes number of pages, so this patch doesn't do that. In practice, after this patch the heuristics limit the size of percpu partial list up to 2 pages. In case of a reported regression (which would mean some workload has benefited from the previous imprecise object based counting), we can tune the heuristics to get a better compromise within the new scheme, while still avoid the unexpectedly long percpu partial lists. [1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/ [2] https://lore.kernel.org/all/2f0f46e8-2535-410a-1859-e9cfa4e57c18@suse.cz/ ========== Evaluation ========== Mel was kind enough to run v1 through mmtests machinery for netperf (localhost) and hackbench and, for most significant results see below. So there are some apparent regressions, especially with hackbench, which I think ultimately boils down to having shorter percpu partial lists on average and some benchmarks benefiting from longer ones. Monitoring slab usage also indicated less memory usage by slab. Based on that, the following patch will bump the defaults to allow longer percpu partial lists than after this patch. However the goal is certainly not such that we would limit the percpu partial lists to 30 pages just because previously a specific alloc/free pattern could lead to the limit of 30 objects translate to a limit to 30 pages - that would make little sense. This is a correctness patch, and if a workload benefits from larger lists, the sysfs tuning knobs are still there to allow that. Netperf 2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM TCP-RR: hmean before 127045.79 after 121092.94 (-4.69%, worse) stddev before 2634.37 after 1254.08 UDP-RR: hmean before 166985.45 after 160668.94 ( -3.78%, worse) stddev before 4059.69 after 1943.63 2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM TCP-RR: hmean before 84173.25 after 76914.72 ( -8.62%, worse) UDP-RR: hmean before 93571.12 after 96428.69 ( 3.05%, better) stddev before 23118.54 after 16828.14 2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM TCP-RR: hmean before 49984.92 after 48922.27 ( -2.13%, worse) stddev before 6248.15 after 4740.51 UDP-RR: hmean before 61854.31 after 68761.81 ( 11.17%, better) stddev before 4093.54 after 5898.91 other machines - within 2% Hackbench (results before and after the patch, negative % means worse) 2-socket AMD EPYC 7713 (64 cores, 128 threads per core), 256GB RAM hackbench-process-sockets Amean 1 0.5380 0.5583 ( -3.78%) Amean 4 0.7510 0.8150 ( -8.52%) Amean 7 0.7930 0.9533 ( -20.22%) Amean 12 0.7853 1.1313 ( -44.06%) Amean 21 1.1520 1.4993 ( -30.15%) Amean 30 1.6223 1.9237 ( -18.57%) Amean 48 2.6767 2.9903 ( -11.72%) Amean 79 4.0257 5.1150 ( -27.06%) Amean 110 5.5193 7.4720 ( -35.38%) Amean 141 7.2207 9.9840 ( -38.27%) Amean 172 8.4770 12.1963 ( -43.88%) Amean 203 9.6473 14.3137 ( -48.37%) Amean 234 11.3960 18.7917 ( -64.90%) Amean 265 13.9627 22.4607 ( -60.86%) Amean 296 14.9163 26.0483 ( -74.63%) hackbench-thread-sockets Amean 1 0.5597 0.5877 ( -5.00%) Amean 4 0.7913 0.8960 ( -13.23%) Amean 7 0.8190 1.0017 ( -22.30%) Amean 12 0.9560 1.1727 ( -22.66%) Amean 21 1.7587 1.5660 ( 10.96%) Amean 30 2.4477 1.9807 ( 19.08%) Amean 48 3.4573 3.0630 ( 11.41%) Amean 79 4.7903 5.1733 ( -8.00%) Amean 110 6.1370 7.4220 ( -20.94%) Amean 141 7.5777 9.2617 ( -22.22%) Amean 172 9.2280 11.0907 ( -20.18%) Amean 203 10.2793 13.3470 ( -29.84%) Amean 234 11.2410 17.1070 ( -52.18%) Amean 265 12.5970 23.3323 ( -85.22%) Amean 296 17.1540 24.2857 ( -41.57%) 2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM hackbench-process-sockets Amean 1 0.5760 0.4793 ( 16.78%) Amean 4 0.9430 0.9707 ( -2.93%) Amean 7 1.5517 1.8843 ( -21.44%) Amean 12 2.4903 2.7267 ( -9.49%) Amean 21 3.9560 4.2877 ( -8.38%) Amean 30 5.4613 5.8343 ( -6.83%) Amean 48 8.5337 9.2937 ( -8.91%) Amean 79 14.0670 15.2630 ( -8.50%) Amean 110 19.2253 21.2467 ( -10.51%) Amean 141 23.7557 25.8550 ( -8.84%) Amean 172 28.4407 29.7603 ( -4.64%) Amean 203 33.3407 33.9927 ( -1.96%) Amean 234 38.3633 39.1150 ( -1.96%) Amean 265 43.4420 43.8470 ( -0.93%) Amean 296 48.3680 48.9300 ( -1.16%) hackbench-thread-sockets Amean 1 0.6080 0.6493 ( -6.80%) Amean 4 1.0000 1.0513 ( -5.13%) Amean 7 1.6607 2.0260 ( -22.00%) Amean 12 2.7637 2.9273 ( -5.92%) Amean 21 5.0613 4.5153 ( 10.79%) Amean 30 6.3340 6.1140 ( 3.47%) Amean 48 9.0567 9.5577 ( -5.53%) Amean 79 14.5657 15.7983 ( -8.46%) Amean 110 19.6213 21.6333 ( -10.25%) Amean 141 24.1563 26.2697 ( -8.75%) Amean 172 28.9687 30.2187 ( -4.32%) Amean 203 33.9763 34.6970 ( -2.12%) Amean 234 38.8647 39.3207 ( -1.17%) Amean 265 44.0813 44.1507 ( -0.16%) Amean 296 49.2040 49.4330 ( -0.47%) 2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM hackbench-process-sockets Amean 1 0.5027 0.5017 ( 0.20%) Amean 4 1.1053 1.2033 ( -8.87%) Amean 7 1.8760 2.1820 ( -16.31%) Amean 12 2.9053 3.1810 ( -9.49%) Amean 21 4.6777 4.9920 ( -6.72%) Amean 30 6.5180 6.7827 ( -4.06%) Amean 48 10.0710 10.5227 ( -4.48%) Amean 79 16.4250 17.5053 ( -6.58%) Amean 110 22.6203 24.4617 ( -8.14%) Amean 141 28.0967 31.0363 ( -10.46%) Amean 172 34.4030 36.9233 ( -7.33%) Amean 203 40.5933 43.0850 ( -6.14%) Amean 234 46.6477 48.7220 ( -4.45%) Amean 265 53.0530 53.9597 ( -1.71%) Amean 296 59.2760 59.9213 ( -1.09%) hackbench-thread-sockets Amean 1 0.5363 0.5330 ( 0.62%) Amean 4 1.1647 1.2157 ( -4.38%) Amean 7 1.9237 2.2833 ( -18.70%) Amean 12 2.9943 3.3110 ( -10.58%) Amean 21 4.9987 5.1880 ( -3.79%) Amean 30 6.7583 7.0043 ( -3.64%) Amean 48 10.4547 10.8353 ( -3.64%) Amean 79 16.6707 17.6790 ( -6.05%) Amean 110 22.8207 24.4403 ( -7.10%) Amean 141 28.7090 31.0533 ( -8.17%) Amean 172 34.9387 36.8260 ( -5.40%) Amean 203 41.1567 43.0450 ( -4.59%) Amean 234 47.3790 48.5307 ( -2.43%) Amean 265 53.9543 54.6987 ( -1.38%) Amean 296 60.0820 60.2163 ( -0.22%) 1-socket Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz (4 cores, 8 threads), 32 GB RAM hackbench-process-sockets Amean 1 1.4760 1.5773 ( -6.87%) Amean 3 3.9370 4.0910 ( -3.91%) Amean 5 6.6797 6.9357 ( -3.83%) Amean 7 9.3367 9.7150 ( -4.05%) Amean 12 15.7627 16.1400 ( -2.39%) Amean 18 23.5360 23.6890 ( -0.65%) Amean 24 31.0663 31.3137 ( -0.80%) Amean 30 38.7283 39.0037 ( -0.71%) Amean 32 41.3417 41.6097 ( -0.65%) hackbench-thread-sockets Amean 1 1.5250 1.6043 ( -5.20%) Amean 3 4.0897 4.2603 ( -4.17%) Amean 5 6.7760 7.0933 ( -4.68%) Amean 7 9.4817 9.9157 ( -4.58%) Amean 12 15.9610 16.3937 ( -2.71%) Amean 18 23.9543 24.3417 ( -1.62%) Amean 24 31.4400 31.7217 ( -0.90%) Amean 30 39.2457 39.5467 ( -0.77%) Amean 32 41.8267 42.1230 ( -0.71%) 2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM hackbench-process-sockets Amean 1 1.0347 1.0880 ( -5.15%) Amean 4 1.7267 1.8527 ( -7.30%) Amean 7 2.6707 2.8110 ( -5.25%) Amean 12 4.1617 4.3383 ( -4.25%) Amean 21 7.0070 7.2600 ( -3.61%) Amean 30 9.9187 10.2397 ( -3.24%) Amean 48 15.6710 16.3923 ( -4.60%) Amean 79 24.7743 26.1247 ( -5.45%) Amean 110 34.3000 35.9307 ( -4.75%) Amean 141 44.2043 44.8010 ( -1.35%) Amean 172 54.2430 54.7260 ( -0.89%) Amean 192 60.6557 60.9777 ( -0.53%) hackbench-thread-sockets Amean 1 1.0610 1.1353 ( -7.01%) Amean 4 1.7543 1.9140 ( -9.10%) Amean 7 2.7840 2.9573 ( -6.23%) Amean 12 4.3813 4.4937 ( -2.56%) Amean 21 7.3460 7.5350 ( -2.57%) Amean 30 10.2313 10.5190 ( -2.81%) Amean 48 15.9700 16.5940 ( -3.91%) Amean 79 25.3973 26.6637 ( -4.99%) Amean 110 35.1087 36.4797 ( -3.91%) Amean 141 45.8220 46.3053 ( -1.05%) Amean 172 55.4917 55.7320 ( -0.43%) Amean 192 62.7490 62.5410 ( 0.33%) Link: https://lkml.kernel.org/r/20211012134651.11258-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Jann Horn <jannh@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06slub: add back check for free nonslab objectsKefeng Wang
After commit f227f0faf63b ("slub: fix unreclaimable slab stat for bulk free"), the check for free nonslab page is replaced by VM_BUG_ON_PAGE, which only check with CONFIG_DEBUG_VM enabled, but this config may impact performance, so it only for debug. Commit 0937502af7c9 ("slub: Add check for kfree() of non slab objects.") add the ability, which should be needed in any configs to catch the invalid free, they even could be potential issue, eg, memory corruption, use after free and double free, so replace VM_BUG_ON_PAGE to WARN_ON_ONCE, add object address printing to help use to debug the issue. Link: https://lkml.kernel.org/r/20210930070214.61499-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rienjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm/slab.c: remove useless lines in enable_cpucache()Shi Lei
These lines are useless, so remove them. Link: https://lkml.kernel.org/r/20210930034845.2539-1-shi_lei@massclouds.com Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations") Signed-off-by: Shi Lei <shi_lei@massclouds.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm: move kvmalloc-related functions to slab.hMatthew Wilcox (Oracle)
Not all files in the kernel should include mm.h. Migrating callers from kmalloc to kvmalloc is easier if the kvmalloc functions are in slab.h. [akpm@linux-foundation.org: move the new kvrealloc() also] [akpm@linux-foundation.org: drivers/hwmon/occ/p9_sbe.c needs slab.h] Link: https://lkml.kernel.org/r/20210622215757.3525604-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06d_path: fix Kernel doc validator complainingJia He
Kernel doc validator complains: Function parameter or member 'p' not described in 'prepend_name' Excess function parameter 'buffer' description in 'prepend_name' Link: https://lkml.kernel.org/r/20211011005614.26189-1-justin.he@arm.com Fixes: ad08ae586586 ("d_path: introduce struct prepend_buffer") Signed-off-by: Jia He <justin.he@arm.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06fs/posix_acl.c: avoid -Wempty-body warningArnd Bergmann
The fallthrough comment for an ignored cmpxchg() return value produces a harmless warning with 'make W=1': fs/posix_acl.c: In function 'get_acl': fs/posix_acl.c:127:36: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 127 | /* fall through */ ; | ^ Simplify it as a step towards a clean W=1 build. As all architectures define cmpxchg() as a statement expression these days, it is no longer necessary to evaluate its return code, and the if() can just be droped. Link: https://lkml.kernel.org/r/20210927102410.1863853-1-arnd@kernel.org Link: https://lore.kernel.org/all/20210322132103.qiun2rjilnlgztxe@wittgenstein/ Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: James Morris <jamorris@linux.microsoft.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06ocfs2: do not zero pages beyond i_sizeJan Kara
ocfs2_zero_range_for_truncate() can try to zero pages beyond current inode size despite the fact that underlying blocks should be already zeroed out and writeback will skip writing such pages anyway. Avoid the pointless work. Link: https://lkml.kernel.org/r/20211025151332.11301-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06ocfs2: fix data corruption on truncateJan Kara
Patch series "ocfs2: Truncate data corruption fix". As further testing has shown, commit 5314454ea3f ("ocfs2: fix data corruption after conversion from inline format") didn't fix all the data corruption issues the customer started observing after 6dbf7bb55598 ("fs: Don't invalidate page buffers in block_write_full_page()") This time I have tracked them down to two bugs in ocfs2 truncation code. One bug (truncating page cache before clearing tail cluster and setting i_size) could cause data corruption even before 6dbf7bb55598, but before that commit it needed a race with page fault, after 6dbf7bb55598 it started to be pretty deterministic. Another bug (zeroing pages beyond old i_size) used to be harmless inefficiency before commit 6dbf7bb55598. But after commit 6dbf7bb55598 in combination with the first bug it resulted in deterministic data corruption. Although fixing only the first problem is needed to stop data corruption, I've fixed both issues to make the code more robust. This patch (of 2): ocfs2_truncate_file() did unmap invalidate page cache pages before zeroing partial tail cluster and setting i_size. Thus some pages could be left (and likely have left if the cluster zeroing happened) in the page cache beyond i_size after truncate finished letting user possibly see stale data once the file was extended again. Also the tail cluster zeroing was not guaranteed to finish before truncate finished causing possible stale data exposure. The problem started to be particularly easy to hit after commit 6dbf7bb55598 "fs: Don't invalidate page buffers in block_write_full_page()" stopped invalidation of pages beyond i_size from page writeback path. Fix these problems by unmapping and invalidating pages in the page cache after the i_size is reduced and tail cluster is zeroed out. Link: https://lkml.kernel.org/r/20211025150008.29002-1-jack@suse.cz Link: https://lkml.kernel.org/r/20211025151332.11301-1-jack@suse.cz Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06ocfs2/dlm: remove redundant assignment of variable retColin Ian King
The variable ret is being assigned a value that is never read, it is updated later on with a different value. The assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Link: https://lkml.kernel.org/r/20211007233452.30815-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06ocfs2: cleanup journal init and shutdownValentin Vidic
Allocate and free struct ocfs2_journal in ocfs2_journal_init and ocfs2_journal_shutdown. Init and release of system inodes references the journal so reorder calls to make sure they work correctly. Link: https://lkml.kernel.org/r/20211009145006.3478-1-vvidic@valentin-vidic.from.hr Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06ocfs2: fix handle refcount leak in two exception handling pathsChenyuan Mi
The reference counting issue happens in two exception handling paths of ocfs2_replay_truncate_records(). When executing these two exception handling paths, the function forgets to decrease the refcount of handle increased by ocfs2_start_trans(), causing a refcount leak. Fix this issue by using ocfs2_commit_trans() to decrease the refcount of handle in two handling paths. Link: https://lkml.kernel.org/r/20210908102055.10168-1-cymi20@fudan.edu.cn Signed-off-by: Chenyuan Mi <cymi20@fudan.edu.cn> Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Wengang Wang <wen.gang.wang@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06scripts/decodecode: fix faulting instruction no print when opps.file is DOS ↵weidonghui
format If opps.file is in DOS format, faulting instruction cannot be printed: / # ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- / # ./scripts/decodecode < oops.file [ 0.734345] Code: d0002881 912f9c21 94067e68 d2800001 (b900003f) aarch64-linux-gnu-strip: '/tmp/tmp.5Y9eybnnSi.o': No such file aarch64-linux-gnu-objdump: '/tmp/tmp.5Y9eybnnSi.o': No such file All code ======== 0: d0002881 adrp x1, 0x512000 4: 912f9c21 add x1, x1, #0xbe7 8: 94067e68 bl 0x19f9a8 c: d2800001 mov x1, #0x0 // #0 10: b900003f str wzr, [x1] Code starting with the faulting instruction =========================================== Background: The compilation environment is Ubuntu, and the test environment is Windows. Most logs are generated in the Windows environment. In this way, CR (carriage return) will inevitably appear, which will affect the use of decodecode in the Ubuntu environment. The repaired effect is as follows: / # ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- / # ./scripts/decodecode < oops.file [ 0.734345] Code: d0002881 912f9c21 94067e68 d2800001 (b900003f) All code ======== 0: d0002881 adrp x1, 0x512000 4: 912f9c21 add x1, x1, #0xbe7 8: 94067e68 bl 0x19f9a8 c: d2800001 mov x1, #0x0 // #0 10:* b900003f str wzr, [x1] <-- trapping instruction Code starting with the faulting instruction =========================================== 0: b900003f str wzr, [x1] Link: https://lkml.kernel.org/r/20211008064712.926-1-weidonghui@allwinnertech.com Signed-off-by: weidonghui <weidonghui@allwinnertech.com> Acked-by: Borislav Petkov <bp@suse.de> Cc: Marc Zyngier <maz@misterjones.org> Cc: Will Deacon <will@kernel.org> Cc: Rabin Vincent <rabin@rab.in> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06scripts/spelling.txt: fix "mistake" version of "synchronization"Sven Eckelmann
If both "mistake" version and "correction" version are the same, a warning message is created by checkpatch which is impossible to fix. But it was noticed that Colan Ian King created a commit e6c0a0889b80 ("ALSA: aloop: Fix spelling mistake "synchronization" -> "synchronization"") which suggests that this spelling mistake was fixed by replacing the word "synchronization" with itself. But the actual diff shows that the mistake in the code was "sychronization". It is rather likely that the "mistake" in spelling.txt should have been the latter. Link: https://lkml.kernel.org/r/20210926065529.6880-1-sven@narfation.org Fixes: 2e74c9433ba8 ("scripts/spelling.txt: add more spellings to spelling.txt") Signed-off-by: Sven Eckelmann <sven@narfation.org> Reviewed-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06scripts/spelling.txt: add more spellings to spelling.txtColin Ian King
Some of the more common spelling mistakes and typos that I've found while fixing up spelling mistakes in the kernel in the past few months. Link: https://lkml.kernel.org/r/20210907072941.7033-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05smb3: add dynamic trace points for socket connectionSteve French
In debugging user problems with ip address/DNS issues with smb3 mounts, we sometimes needed additional info on the hostname and ip address. Add two tracepoints, one to show socket connection success and one for failures to connect to the socket. Sample output: mount.cifs-14551 [005] ..... 7636.547906: smb3_connect_done: conn_id=0x1 server=localhost addr=127.0.0.1:445 mount.cifs-14558 [004] ..... 7642.405413: smb3_connect_done: conn_id=0x2 server=smfrench.file.core.windows.net addr=52.239.158.232:445 mount.cifs-14741 [005] ..... 7818.490716: smb3_connect_done: conn_id=0x3 server=::1 addr=[::1]:445/0%0 mount.cifs-14810 [000] ..... 7966.380337: smb3_connect_err: rc=-101 conn_id=0x4 server=::2 addr=[::2]:445/0%0 mount.cifs-14810 [000] ..... 7966.380356: smb3_connect_err: rc=-101 conn_id=0x4 server=::2 addr=[::2]:139/0%0 mount.cifs-14818 [003] ..... 7986.771992: smb3_connect_done: conn_id=0x5 server=127.0.0.9 addr=127.0.0.9:445 mount.cifs-14825 [008] ..... 8008.178109: smb3_connect_err: rc=-115 conn_id=0x6 server=124.23.0.9 addr=124.23.0.9:445 mount.cifs-14825 [008] ..... 8013.298085: smb3_connect_err: rc=-115 conn_id=0x6 server=124.23.0.9 addr=124.23.0.9:139 cifsd-14553 [006] ..... 8036.735615: smb3_reconnect: conn_id=0x1 server=localhost current_mid=32 cifsd-14743 [010] ..... 8036.735644: smb3_reconnect: conn_id=0x3 server=::1 current_mid=29 cifsd-14743 [010] ..... 8039.921740: smb3_connect_err: rc=-111 conn_id=0x3 server=::1 addr=[::1]:445/0%0 cifsd-14553 [008] ..... 8042.993894: smb3_connect_err: rc=-111 conn_id=0x1 server=localhost addr=127.0.0.1:445 cifsd-14743 [010] ..... 8042.993894: smb3_connect_err: rc=-111 conn_id=0x3 server=::1 addr=[::1]:445/0%0 cifsd-14553 [008] ..... 8046.065824: smb3_connect_err: rc=-111 conn_id=0x1 server=localhost addr=127.0.0.1:445 cifsd-14743 [010] ..... 8046.065824: smb3_connect_err: rc=-111 conn_id=0x3 server=::1 addr=[::1]:445/0%0 cifsd-14553 [008] ..... 8049.137796: smb3_connect_done: conn_id=0x1 server=localhost addr=127.0.0.1:445 cifsd-14743 [010] ..... 8049.137796: smb3_connect_done: conn_id=0x3 server=::1 addr=[::1]:445/0%0 Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-11-05Merge branch 'remotes/lorenzo/pci/xgene'Bjorn Helgaas
- Use PCI_VENDOR_ID_AMCC macro instead of defining a custom macro in pci-xgene.c (Pali Rohár) * remotes/lorenzo/pci/xgene: PCI: xgene: Use PCI_VENDOR_ID_AMCC macro
2021-11-05Merge branch 'remotes/lorenzo/pci/vmd'Bjorn Helgaas
- Assign a number to each VMD controller to distinguish them in /proc/interrupts (Chunguang Xu) - Don't disable VMD MSI-X remapping if IOMMU remapping is enabled (Adrian Huang) - Add Kconfig dependency on !UML for allyesconfig build issue (Johannes Berg) * remotes/lorenzo/pci/vmd: PCI: vmd: depend on !UML PCI: vmd: Do not disable MSI-X remapping if interrupt remapping is enabled by IOMMU PCI: vmd: Assign a number to each VMD controller
2021-11-05Merge branch 'pci/host/rcar'Bjorn Helgaas
- Remove unneeded includes (Geert Uytterhoeven) * pci/host/rcar: PCI: rcar-host: Remove unneeded includes PCI: rcar-ep: Remove unneeded includes
2021-11-05Merge branch 'remotes/lorenzo/pci/qcom'Bjorn Helgaas
- Add Qualcomm PCIe Endpoint controller driver and DT binding (Manivannan Sadhasivam) - Add qcom struct for device-specific details in match data (Prasad Malisetty) - Switch pcie_1_pipe_clk_src from TCXO to pipe clock after PHY init in SC7280 (Prasad Malisetty) - Add .compatible device ID for SC8180x platform (Bjorn Andersson) * remotes/lorenzo/pci/qcom: PCI: qcom: Add sc8180x compatible PCI: qcom: Switch pcie_1_pipe_clk_src after PHY init in SC7280 PCI: qcom: Replace ops with struct pcie_cfg in pcie match data MAINTAINERS: Add entry for Qualcomm PCIe Endpoint driver and binding PCI: qcom-ep: Add Qualcomm PCIe Endpoint controller driver dt-bindings: PCI: Add Qualcomm PCIe Endpoint controller
2021-11-05Merge branch 'pci/host/mt7621'Bjorn Helgaas
- Add MediaTek MT7621 SoC PCIe host controller (moved from staging) (Sergio Paracuellos) * pci/host/mt7621: MAINTAINERS: Add Sergio Paracuellos as MT7621 PCIe maintainer PCI: mt7621: Add MediaTek MT7621 PCIe host controller driver dt-bindings: PCI: Add MT7621 SoC PCIe host controller # Conflicts: # drivers/pci/controller/Kconfig # drivers/pci/controller/Makefile
2021-11-05Merge branch 'pci/host/kirin'Bjorn Helgaas
- Reorganize to separate out PHY logic (Mauro Carvalho Chehab) - Add support for external PHY in addition to the internal PHY (Mauro Carvalho Chehab) - Use regmap so both kirin driver and PHY can access APB registers (Mauro Carvalho Chehab) - Add support for per-lane PERST# GPIOs on HiKey970 external PEX 8606 PCI bridge (Mauro Carvalho Chehab) - Add "hisilicon,kirin970-pcie" compatible string (Mauro Carvalho Chehab) - Add MODULE_* macros (Mauro Carvalho Chehab) - Allow building as a module (Mauro Carvalho Chehab) - Add power_off support for Kirin 960 PHY (Mauro Carvalho Chehab) - Disable clkreq during poweroff sequence (Mauro Carvalho Chehab) - De-init the dwc driver during remove (Mauro Carvalho Chehab) - Allow removal of kirin driver (Mauro Carvalho Chehab) * pci/host/kirin: PCI: kirin: Allow removing the driver PCI: kirin: De-init the dwc driver PCI: kirin: Disable clkreq during poweroff sequence PCI: kirin: Move the power-off code to a common routine PCI: kirin: Add power_off support for Kirin 960 PHY PCI: kirin: Allow building it as a module PCI: kirin: Add MODULE_* macros PCI: kirin: Add Kirin 970 compatible PCI: kirin: Support PERST# GPIOs for HiKey970 external PEX 8606 bridge PCI: kirin: Use regmap for APB registers PCI: kirin: Add support for a PHY layer PCI: kirin: Reorganize the PHY logic inside the driver
2021-11-05Merge branch 'remotes/lorenzo/pci/imx6'Bjorn Helgaas
- Remove unused assignment (Krzysztof Wilczyński) * remotes/lorenzo/pci/imx6: PCI: imx6: Remove unused assignment to variable ret
2021-11-05Merge branch 'remotes/lorenzo/pci/endpoint'Bjorn Helgaas
- Use sysfs_emit() in "show" functions instead of sprintf() to avoid buffer overruns (Kunihiko Hayashi) * remotes/lorenzo/pci/endpoint: PCI: endpoint: Use sysfs_emit() in "show" functions
2021-11-05Merge branch 'pci/host/dwc'Bjorn Helgaas
- Export dw_pcie_ep_reset_bar(), dw_pcie_link_up() so more drivers can be modular (Luca Ceresoli) - Allow dra7xx host and endpoint drivers to be modules (Luca Ceresoli) - Enable dra7xx optional external clock if present (Luca Ceresoli) - Clean up Kconfig dependencies for PCIE_DW_HOST- and PCIE_DW_EP-based drivers (Andy Shevchenko) - Remove visconti redundant dev_err() after platform_get_irq_byname() failure (Krzysztof Wilczyński) - Run dwc .host_init() method before registering MSI interrupt handler so we have a chance to deal with pending interrupts left by bootloader (Bjorn Andersson) - Serialize uniphier INTx masking/unmasking (Kunihiko Hayashi) * pci/host/dwc: PCI: uniphier: Serialize INTx masking/unmasking and fix the bit operation PCI: dwc: Perform host_init() before registering msi PCI: visconti: Remove surplus dev_err() when using platform_get_irq_byname() PCI: dwc: Clean up Kconfig dependencies (PCIE_DW_EP) PCI: dwc: Clean up Kconfig dependencies (PCIE_DW_HOST) PCI: dra7xx: Get an optional clock PCI: dra7xx: Remove unused include PCI: dra7xx: Make it a kernel module PCI: dwc: Export more symbols to allow modular drivers
2021-11-05Merge branch 'remotes/lorenzo/pci/dt'Bjorn Helgaas
- Add Rockchip SoC PCIe controller DT bindings (Simon Xue) * remotes/lorenzo/pci/dt: dt-bindings: rockchip: Add DesignWare based PCIe controller
2021-11-05Merge branch 'pci/host/cadence'Bjorn Helgaas
- Disable PHY when j721e_pcie_probe() fails after initializing it (Christophe JAILLET) - Return success when cdns-pcie probe succeeds instead of doing error cleanup (Li Chen) * pci/host/cadence: PCI: cadence: Add cdns_plat_pcie_probe() missing return PCI: j721e: Fix j721e_pcie_probe() error path
2021-11-05Merge branch 'pci/host/apple'Bjorn Helgaas
- Make of_phandle_args_to_fwspec() generally available (Marc Zyngier) - Allow matching of interrupt-maps local to interrupt controller or PCI device (Marc Zyngier) - Add Apple SoC (e.g., M1) PCIe host controller driver, which enables access to USB type-A, Ethernet, Wi-Fi, Bluetooth devices; these require additional drivers of their own (Alyssa Rosenzweig) - Add apple INTx, per-port, and MSI interrupt support (Marc Zyngier) - Configure apple Requester-ID-to-Stream-ID mapper for IOMMU (DART) support (Marc Zyngier) * pci/host/apple: PCI: apple: Configure RID to SID mapper on device addition iommu/dart: Exclude MSI doorbell from PCIe device IOVA range PCI: apple: Implement MSI support PCI: apple: Add INTx and per-port interrupt support PCI: apple: Set up reference clocks when probing PCI: apple: Add initial hardware bring-up PCI: of: Allow matching of an interrupt-map local to a PCI device of/irq: Allow matching of an interrupt-map local to an interrupt controller irqdomain: Make of_phandle_args_to_fwspec() generally available
2021-11-05Merge branch 'remotes/lorenzo/pci/aardvark'Bjorn Helgaas
- Define macros for PCI_EXP_DEVCTL_PAYLOAD_* (Pali Rohár) - Set Max Payload Size to 512 bytes per Marvell spec (Pali Rohár) - Downgrade PIO Response Status messages to debug level (Marek Behún) - Preserve CRS SV (Config Request Retry Software Visibility) bit in emulated Root Control register (Pali Rohár) - Fix issue in configuring reference clock (Pali Rohár) - Don't clear status bits for masked interrupts (Pali Rohár) - Don't mask unused interrupts (Pali Rohár) - Avoid code repetition in advk_pcie_rd_conf() (Marek Behún) - Retry config accesses on CRS response (Pali Rohár) - Simplify emulated Root Capabilities initialization (Pali Rohár) - Fix several link training issues (Pali Rohár) - Fix link-up checking via LTSSM (Pali Rohár) - Fix reporting of Data Link Layer Link Active (Pali Rohár) - Fix emulation of W1C bits (Marek Behún) - Fix MSI domain .alloc() method to return zero on success (Marek Behún) - Read entire 16-bit MSI vector in MSI handler, not just low 8 bits (Marek Behún) - Clear Root Port I/O Space, Memory Space, and Bus Master Enable bits at startup; PCI core will set those as necessary (Pali Rohár) - When operating as a Root Port, set class code to "PCI Bridge" instead of the default "Mass Storage Controller" (Pali Rohár) - Add emulation for PCI_BRIDGE_CTL_BUS_RESET since aardvark doesn't implement this per spec (Pali Rohár) - Add emulation of option ROM BAR since aardvark doesn't implement this per spec (Pali Rohár) * remotes/lorenzo/pci/aardvark: PCI: aardvark: Fix support for PCI_ROM_ADDRESS1 on emulated bridge PCI: aardvark: Fix support for PCI_BRIDGE_CTL_BUS_RESET on emulated bridge PCI: aardvark: Set PCI Bridge Class Code to PCI Bridge PCI: aardvark: Fix support for bus mastering and PCI_COMMAND on emulated bridge PCI: aardvark: Read all 16-bits from PCIE_MSI_PAYLOAD_REG PCI: aardvark: Fix return value of MSI domain .alloc() method PCI: pci-bridge-emul: Fix emulation of W1C bits PCI: aardvark: Fix reporting Data Link Layer Link Active PCI: aardvark: Fix checking for link up via LTSSM state PCI: aardvark: Fix link training PCI: aardvark: Simplify initialization of rootcap on virtual bridge PCI: aardvark: Implement re-issuing config requests on CRS response PCI: aardvark: Deduplicate code in advk_pcie_rd_conf() PCI: aardvark: Do not unmask unused interrupts PCI: aardvark: Do not clear status bits of masked interrupts PCI: aardvark: Fix configuring Reference clock PCI: aardvark: Fix preserving PCI_EXP_RTCTL_CRSSVE flag on emulated bridge PCI: aardvark: Don't spam about PIO Response Status PCI: aardvark: Fix PCIe Max Payload Size setting PCI: Add PCI_EXP_DEVCTL_PAYLOAD_* macros
2021-11-05Merge branch 'pci/misc'Bjorn Helgaas
- Tidy setup-irq.c comments (Pranay Sanghai) - Fix misspellings (Krzysztof Wilczyński) - Fix sprintf(), sscanf() format mismatches (Krzysztof Wilczyński) - Tidy cpqphp code formatting (Krzysztof Wilczyński) - Remove unused pci_pool wrappers, which have been replaced by dma_pool (Cai Huoqing) - Remove a redundant initialization in __pci_reset_function_locked() (Colin Ian King) - Use 'unsigned int' instead of 'unsigned' (Krzysztof Wilczyński) - Update PCI subsystem information in MAINTAINERS (Krzysztof Wilczyński) - Include generic <linux/> headers instead of <asm/> for cpqphp and vmd (Krzysztof Wilczyński) * pci/misc: PCI: vmd: Drop redundant includes of <asm/device.h>, <asm/msi.h> PCI: cpqphp: Use <linux/io.h> instead of <asm/io.h> MAINTAINERS: Update PCI subsystem information PCI: Prefer 'unsigned int' over bare 'unsigned' PCI: Remove redundant 'rc' initialization PCI: Remove unused pci_pool wrappers PCI: cpqphp: Format if-statement code block correctly PCI: Use unsigned to match sscanf("%x") in pci_dev_str_match_path() PCI: hv: Remove unnecessary use of %hx PCI: Correct misspelled and remove duplicated words PCI: Tidy comments
2021-11-05Merge branch 'pci/vpd'Bjorn Helgaas
- Add pci_read_vpd_any(), pci_write_vpd_any() to access VPD at arbitrary offsets (Heiner Kallweit) - Use VPD API to replace custom code in cxgb3 driver (Heiner Kallweit) * pci/vpd: cxgb3: Remove seeprom_write and use VPD API cxgb3: Use VPD API in t3_seeprom_wp() cxgb3: Remove t3_seeprom_read and use VPD API PCI/VPD: Use pci_read_vpd_any() in pci_vpd_size() PCI/VPD: Add pci_read/write_vpd_any()
2021-11-05Merge branch 'pci/virtualization'Bjorn Helgaas
- Avoid bus resets on Atheros QCA6174, since they hang (Ingmar Klein) - Use store and forward mode on Pericom PI7C9X2G switches to avoid ACS erratum with ACS P2P Request Redirect (Nathan Rossi) * pci/virtualization: PCI: Add ACS quirk for Pericom PI7C9X2G switches PCI: Mark Atheros QCA6174 to avoid bus reset
2021-11-05Merge branch 'pci/sysfs'Bjorn Helgaas
- Check for CAP_SYS_ADMIN before validating sysfs user input, not after (Krzysztof Wilczyński) - Always return -EINVAL from sysfs "store" functions for invalid user input instead of -EINVAL sometimes and -ERANGE others (Krzysztof Wilczyński) - Use kstrtobool() directly instead of the strtobool() wrapper (Krzysztof Wilczyński) * pci/sysfs: PCI: Use kstrtobool() directly, sans strtobool() wrapper PCI/sysfs: Return -EINVAL consistently from "store" functions PCI/sysfs: Check CAP_SYS_ADMIN before parsing user input # Conflicts: # drivers/pci/iov.c
2021-11-05Merge branch 'pci/switchtec'Bjorn Helgaas
- Return error to application when command execution fails because an out-of-band reset has cleared the device BARs, Memory Space Enable, etc (Kelvin Cao) - Fix MRPC error status handling issue (Kelvin Cao) - Mask out other bits when reading of management VEP instance ID (Kelvin Cao) - Return EOPNOTSUPP instead of ENOTSUPP from sysfs show functions (Kelvin Cao) - Add check of event support (Logan Gunthorpe) * pci/switchtec: PCI/switchtec: Add check of event support PCI/switchtec: Replace ENOTSUPP with EOPNOTSUPP PCI/switchtec: Update the way of getting management VEP instance ID PCI/switchtec: Fix a MRPC error status handling issue PCI/switchtec: Error out MRPC execution when MMIO reads fail
2021-11-05Merge branch 'pci/resource'Bjorn Helgaas
- Coalesce host bridge contiguous apertures to allow P2P bridge windows that span several contiguous host bridge apertures (Kai-Heng Feng) * pci/resource: PCI: Coalesce host bridge contiguous apertures
2021-11-05Merge branch 'pci/portdrv'Bjorn Helgaas
- Don't setup portdrv IRQs if there are no port drivers that use them, to conserve vectors and avoid spurious events (Jan Kiszka) * pci/portdrv: PCI/portdrv: Do not setup up IRQs if there are no users