summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-10-02MAINTAINERS: Remove dead path from LOCKING PRIMITIVES entryWill Deacon
Since 890658b7ab48 ("locking/mutex: Kill arch specific code"), there are no mutex header files under arch/, so we can remove the redundant entry from MAINTAINERS. Reported-by: Joe Perches <joe@perches.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Jason Low <jason.low2@hpe.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181001142856.GC9716@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02drm: fix use-after-free read in drm_mode_create_lease_ioctl()Jann Horn
fd_install() moves the reference given to it into the file descriptor table of the current process. If the current process is multithreaded, then immediately after fd_install(), another thread can close() the file descriptor and cause the file's resources to be cleaned up. Since the reference to "lessee" is held by the file, we must not access "lessee" after the fd_install() call. As far as I can tell, to reach this codepath, the caller must have an open file descriptor to a DRI device in master mode. I'm not sure what the requirements for that are. Signed-off-by: Jann Horn <jannh@google.com> Fixes: 62884cd386b8 ("drm: Add four ioctls for managing drm mode object leases [v7]") Cc: stable@vger.kernel.org Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20181001153117.216923-1-jannh@google.com
2018-10-02sched/numa: Avoid task migration for small NUMA improvementSrikar Dronamraju
If NUMA improvement from the task migration is going to be very minimal, then avoid task migration. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 198512 205910 3.72673 1 313559 318491 1.57291 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 8 74761.9 74935.9 0.232739 1 214874 226796 5.54837 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 180536 189780 5.12031 1 210281 205695 -2.18089 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 56511.4 60370 6.828 1 104899 108100 3.05151 1/7 cases is regressing, if we look at events migrate_pages seem to vary the most especially in the regressing case. Also some amount of variance is expected between different runs of Specjbb2005. Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,818,546 13,801,554 migrations 1,149,960 1,151,541 faults 385,583 433,246 cache-misses 55,259,546,768 55,168,691,835 sched:sched_move_numa 2,257 2,551 sched:sched_stick_numa 9 24 sched:sched_swap_numa 512 904 migrate:mm_migrate_pages 2,225 1,571 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 72692 113682 numa_hint_faults_local 62270 102163 numa_hit 238762 240181 numa_huge_pte_updates 48 36 numa_interleave 75 64 numa_local 238676 240103 numa_other 86 78 numa_pages_migrated 2225 1564 numa_pte_updates 98557 134080 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,173,490 3,079,150 migrations 36,966 31,455 faults 108,776 99,081 cache-misses 12,200,075,320 11,588,126,740 sched:sched_move_numa 1,264 1 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 899 36 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 21109 430 numa_hint_faults_local 17120 77 numa_hit 72934 71277 numa_huge_pte_updates 42 0 numa_interleave 33 22 numa_local 72866 71218 numa_other 68 59 numa_pages_migrated 915 23 numa_pte_updates 42326 0 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,312,022 8,707,565 migrations 231,705 171,342 faults 310,242 310,820 cache-misses 402,324,573 136,115,400 sched:sched_move_numa 193 215 sched:sched_stick_numa 0 6 sched:sched_swap_numa 3 24 migrate:mm_migrate_pages 93 162 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 11838 8985 numa_hint_faults_local 11216 8154 numa_hit 90689 93819 numa_huge_pte_updates 0 0 numa_interleave 1579 882 numa_local 89634 93496 numa_other 1055 323 numa_pages_migrated 92 169 numa_pte_updates 12109 9217 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,170,481 2,152,072 migrations 10,126 10,704 faults 160,962 164,376 cache-misses 10,834,845 3,818,437 sched:sched_move_numa 10 16 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 7 migrate:mm_migrate_pages 2 199 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 403 2248 numa_hint_faults_local 358 1666 numa_hit 25898 25704 numa_huge_pte_updates 0 0 numa_interleave 207 200 numa_local 25860 25679 numa_other 38 25 numa_pages_migrated 2 197 numa_pte_updates 400 2234 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 110,339,633 93,330,595 migrations 4,139,812 4,122,061 faults 863,622 865,979 cache-misses 231,838,045,660 225,395,083,479 sched:sched_move_numa 2,196 2,372 sched:sched_stick_numa 33 24 sched:sched_swap_numa 544 769 migrate:mm_migrate_pages 2,469 1,677 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 85748 91638 numa_hint_faults_local 66831 78096 numa_hit 242213 242225 numa_huge_pte_updates 0 0 numa_interleave 0 2 numa_local 242211 242219 numa_other 2 6 numa_pages_migrated 2376 1515 numa_pte_updates 86233 92274 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 59,331,057 51,487,271 migrations 552,019 537,170 faults 266,586 256,921 cache-misses 73,796,312,990 70,073,831,187 sched:sched_move_numa 981 576 sched:sched_stick_numa 54 24 sched:sched_swap_numa 286 327 migrate:mm_migrate_pages 713 726 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 14807 12000 numa_hint_faults_local 5738 5024 numa_hit 36230 36470 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 36228 36465 numa_other 2 5 numa_pages_migrated 703 726 numa_pte_updates 14742 11930 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-7-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02mm/migrate: Use spin_trylock() while resetting rate limitSrikar Dronamraju
Since this spinlock will only serialize the migrate rate limiting, convert the spin_lock() to a spin_trylock(). If another thread is updating, this task can move on. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 205332 198512 -3.32145 1 319785 313559 -1.94693 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 8 74912 74761.9 -0.200368 1 206585 214874 4.01239 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 189162 180536 -4.56011 1 213760 210281 -1.62753 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 58736.8 56511.4 -3.78877 1 105419 104899 -0.49327 Avoiding stretching of window intervals may be the reason for the regression. Also code now uses READ_ONCE/WRITE_ONCE. That may also be hurting performance to some extent. Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 14,285,708 13,818,546 migrations 1,180,621 1,149,960 faults 339,114 385,583 cache-misses 55,205,631,894 55,259,546,768 sched:sched_move_numa 843 2,257 sched:sched_stick_numa 6 9 sched:sched_swap_numa 219 512 migrate:mm_migrate_pages 365 2,225 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 26907 72692 numa_hint_faults_local 24279 62270 numa_hit 239771 238762 numa_huge_pte_updates 0 48 numa_interleave 68 75 numa_local 239688 238676 numa_other 83 86 numa_pages_migrated 363 2225 numa_pte_updates 27415 98557 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,202,779 3,173,490 migrations 37,186 36,966 faults 106,076 108,776 cache-misses 12,024,873,744 12,200,075,320 sched:sched_move_numa 931 1,264 sched:sched_stick_numa 0 0 sched:sched_swap_numa 1 0 migrate:mm_migrate_pages 637 899 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 17409 21109 numa_hint_faults_local 14367 17120 numa_hit 73953 72934 numa_huge_pte_updates 20 42 numa_interleave 25 33 numa_local 73892 72866 numa_other 61 68 numa_pages_migrated 668 915 numa_pte_updates 27276 42326 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,474,013 8,312,022 migrations 254,934 231,705 faults 320,506 310,242 cache-misses 110,580,458 402,324,573 sched:sched_move_numa 725 193 sched:sched_stick_numa 0 0 sched:sched_swap_numa 7 3 migrate:mm_migrate_pages 145 93 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 22797 11838 numa_hint_faults_local 21539 11216 numa_hit 89308 90689 numa_huge_pte_updates 0 0 numa_interleave 865 1579 numa_local 88955 89634 numa_other 353 1055 numa_pages_migrated 149 92 numa_pte_updates 22930 12109 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,195,628 2,170,481 migrations 11,179 10,126 faults 149,656 160,962 cache-misses 8,117,515 10,834,845 sched:sched_move_numa 49 10 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 5 2 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 3577 403 numa_hint_faults_local 3476 358 numa_hit 26142 25898 numa_huge_pte_updates 0 0 numa_interleave 358 207 numa_local 26042 25860 numa_other 100 38 numa_pages_migrated 5 2 numa_pte_updates 3587 400 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 100,602,296 110,339,633 migrations 4,135,630 4,139,812 faults 789,256 863,622 cache-misses 226,160,621,058 231,838,045,660 sched:sched_move_numa 1,366 2,196 sched:sched_stick_numa 16 33 sched:sched_swap_numa 374 544 migrate:mm_migrate_pages 1,350 2,469 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 47857 85748 numa_hint_faults_local 39768 66831 numa_hit 240165 242213 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 240165 242211 numa_other 0 2 numa_pages_migrated 1224 2376 numa_pte_updates 48354 86233 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 58,515,496 59,331,057 migrations 564,845 552,019 faults 245,807 266,586 cache-misses 73,603,757,976 73,796,312,990 sched:sched_move_numa 996 981 sched:sched_stick_numa 10 54 sched:sched_swap_numa 193 286 migrate:mm_migrate_pages 646 713 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 13422 14807 numa_hint_faults_local 5619 5738 numa_hit 36118 36230 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 36116 36228 numa_other 2 2 numa_pages_migrated 616 703 numa_pte_updates 13374 14742 Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-6-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02sched/numa: Limit the conditions where scan period is resetMel Gorman
migrate_task_rq_fair() resets the scan rate for NUMA balancing on every cross-node migration. In the event of excessive load balancing due to saturation, this may result in the scan rate being pegged at maximum and further overloading the machine. This patch only resets the scan if NUMA balancing is active, a preferred node has been selected and the task is being migrated from the preferred node as these are the most harmful. For example, a migration to the preferred node does not justify a faster scan rate. Similarly, a migration between two nodes that are not preferred is probably bouncing due to over-saturation of the machine. In that case, scanning faster and trapping more NUMA faults will further overload the machine. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 203370 205332 0.964744 1 328431 319785 -2.63252 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 206070 206585 0.249915 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 188386 189162 0.41192 1 201566 213760 6.04963 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 59157.4 58736.8 -0.710985 1 105495 105419 -0.0720413 Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,825,492 14,285,708 migrations 1,152,509 1,180,621 faults 371,948 339,114 cache-misses 55,654,206,041 55,205,631,894 sched:sched_move_numa 1,856 843 sched:sched_stick_numa 4 6 sched:sched_swap_numa 428 219 migrate:mm_migrate_pages 898 365 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 57146 26907 numa_hint_faults_local 51612 24279 numa_hit 238164 239771 numa_huge_pte_updates 16 0 numa_interleave 63 68 numa_local 238085 239688 numa_other 79 83 numa_pages_migrated 883 363 numa_pte_updates 67540 27415 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,288,525 3,202,779 migrations 38,652 37,186 faults 111,678 106,076 cache-misses 12,111,197,376 12,024,873,744 sched:sched_move_numa 900 931 sched:sched_stick_numa 0 0 sched:sched_swap_numa 5 1 migrate:mm_migrate_pages 714 637 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 18572 17409 numa_hint_faults_local 14850 14367 numa_hit 73197 73953 numa_huge_pte_updates 11 20 numa_interleave 25 25 numa_local 73138 73892 numa_other 59 61 numa_pages_migrated 712 668 numa_pte_updates 24021 27276 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,451,543 8,474,013 migrations 202,804 254,934 faults 310,024 320,506 cache-misses 253,522,507 110,580,458 sched:sched_move_numa 213 725 sched:sched_stick_numa 0 0 sched:sched_swap_numa 2 7 migrate:mm_migrate_pages 88 145 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 11830 22797 numa_hint_faults_local 11301 21539 numa_hit 90038 89308 numa_huge_pte_updates 0 0 numa_interleave 855 865 numa_local 89796 88955 numa_other 242 353 numa_pages_migrated 88 149 numa_pte_updates 12039 22930 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,049,153 2,195,628 migrations 11,405 11,179 faults 162,309 149,656 cache-misses 7,203,343 8,117,515 sched:sched_move_numa 22 49 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 1 5 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 1693 3577 numa_hint_faults_local 1669 3476 numa_hit 25177 26142 numa_huge_pte_updates 0 0 numa_interleave 194 358 numa_local 24993 26042 numa_other 184 100 numa_pages_migrated 1 5 numa_pte_updates 1577 3587 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 94,515,937 100,602,296 migrations 4,203,554 4,135,630 faults 832,697 789,256 cache-misses 226,248,698,331 226,160,621,058 sched:sched_move_numa 1,730 1,366 sched:sched_stick_numa 14 16 sched:sched_swap_numa 432 374 migrate:mm_migrate_pages 1,398 1,350 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 80079 47857 numa_hint_faults_local 68620 39768 numa_hit 241187 240165 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 241186 240165 numa_other 1 0 numa_pages_migrated 1347 1224 numa_pte_updates 80729 48354 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 63,704,961 58,515,496 migrations 573,404 564,845 faults 230,878 245,807 cache-misses 76,568,222,781 73,603,757,976 sched:sched_move_numa 509 996 sched:sched_stick_numa 31 10 sched:sched_swap_numa 182 193 migrate:mm_migrate_pages 541 646 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 8501 13422 numa_hint_faults_local 2960 5619 numa_hit 35526 36118 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 35526 36116 numa_other 0 2 numa_pages_migrated 539 616 numa_pte_updates 8433 13374 Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-5-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02sched/numa: Reset scan rate whenever task moves across nodesSrikar Dronamraju
Currently task scan rate is reset when NUMA balancer migrates the task to a different node. If NUMA balancer initiates a swap, reset is only applicable to the task that initiates the swap. Similarly no scan rate reset is done if the task is migrated across nodes by traditional load balancer. Instead move the scan reset to the migrate_task_rq. This ensures the task moved out of its preferred node, either gets back to its preferred node quickly or finds a new preferred node. Doing so, would be fair to all tasks migrating across nodes. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 200668 203370 1.3465 1 321791 328431 2.06345 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 204848 206070 0.59654 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 188098 188386 0.153112 1 200351 201566 0.606436 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 58145.9 59157.4 1.73959 1 103798 105495 1.63491 Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,912,183 13,825,492 migrations 1,155,931 1,152,509 faults 367,139 371,948 cache-misses 54,240,196,814 55,654,206,041 sched:sched_move_numa 1,571 1,856 sched:sched_stick_numa 9 4 sched:sched_swap_numa 463 428 migrate:mm_migrate_pages 703 898 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 50155 57146 numa_hint_faults_local 45264 51612 numa_hit 239652 238164 numa_huge_pte_updates 36 16 numa_interleave 68 63 numa_local 239576 238085 numa_other 76 79 numa_pages_migrated 680 883 numa_pte_updates 71146 67540 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,156,720 3,288,525 migrations 30,354 38,652 faults 97,261 111,678 cache-misses 12,400,026,826 12,111,197,376 sched:sched_move_numa 4 900 sched:sched_stick_numa 0 0 sched:sched_swap_numa 1 5 migrate:mm_migrate_pages 20 714 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 272 18572 numa_hint_faults_local 186 14850 numa_hit 71362 73197 numa_huge_pte_updates 0 11 numa_interleave 23 25 numa_local 71299 73138 numa_other 63 59 numa_pages_migrated 2 712 numa_pte_updates 0 24021 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,606,824 8,451,543 migrations 155,352 202,804 faults 301,409 310,024 cache-misses 157,759,224 253,522,507 sched:sched_move_numa 168 213 sched:sched_stick_numa 0 0 sched:sched_swap_numa 3 2 migrate:mm_migrate_pages 125 88 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 4650 11830 numa_hint_faults_local 3946 11301 numa_hit 90489 90038 numa_huge_pte_updates 0 0 numa_interleave 892 855 numa_local 90034 89796 numa_other 455 242 numa_pages_migrated 124 88 numa_pte_updates 4818 12039 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,113,167 2,049,153 migrations 10,533 11,405 faults 142,727 162,309 cache-misses 5,594,192 7,203,343 sched:sched_move_numa 10 22 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 6 1 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 744 1693 numa_hint_faults_local 584 1669 numa_hit 25551 25177 numa_huge_pte_updates 0 0 numa_interleave 263 194 numa_local 25302 24993 numa_other 249 184 numa_pages_migrated 6 1 numa_pte_updates 744 1577 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 101,227,352 94,515,937 migrations 4,151,829 4,203,554 faults 745,233 832,697 cache-misses 224,669,561,766 226,248,698,331 sched:sched_move_numa 617 1,730 sched:sched_stick_numa 2 14 sched:sched_swap_numa 187 432 migrate:mm_migrate_pages 316 1,398 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 24195 80079 numa_hint_faults_local 21639 68620 numa_hit 238331 241187 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 238331 241186 numa_other 0 1 numa_pages_migrated 204 1347 numa_pte_updates 24561 80729 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 62,738,978 63,704,961 migrations 562,702 573,404 faults 228,465 230,878 cache-misses 75,778,067,952 76,568,222,781 sched:sched_move_numa 648 509 sched:sched_stick_numa 13 31 sched:sched_swap_numa 137 182 migrate:mm_migrate_pages 733 541 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 10281 8501 numa_hint_faults_local 3242 2960 numa_hit 36338 35526 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 36338 35526 numa_other 0 0 numa_pages_migrated 706 539 numa_pte_updates 10176 8433 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-4-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02sched/numa: Pass destination CPU as a parameter to migrate_task_rqSrikar Dronamraju
This additional parameter (new_cpu) is used later for identifying if task migration is across nodes. No functional change. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 203353 200668 -1.32036 1 328205 321791 -1.95427 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 214384 204848 -4.44809 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 188553 188098 -0.241311 1 196273 200351 2.07772 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 57581.2 58145.9 0.980702 1 103468 103798 0.318939 Brings out the variance between different specjbb2005 runs. Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,941,377 13,912,183 migrations 1,157,323 1,155,931 faults 382,175 367,139 cache-misses 54,993,823,500 54,240,196,814 sched:sched_move_numa 2,005 1,571 sched:sched_stick_numa 14 9 sched:sched_swap_numa 529 463 migrate:mm_migrate_pages 1,573 703 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 67099 50155 numa_hint_faults_local 58456 45264 numa_hit 240416 239652 numa_huge_pte_updates 18 36 numa_interleave 65 68 numa_local 240339 239576 numa_other 77 76 numa_pages_migrated 1574 680 numa_pte_updates 77182 71146 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,176,453 3,156,720 migrations 30,238 30,354 faults 87,869 97,261 cache-misses 12,544,479,391 12,400,026,826 sched:sched_move_numa 23 4 sched:sched_stick_numa 0 0 sched:sched_swap_numa 6 1 migrate:mm_migrate_pages 10 20 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 236 272 numa_hint_faults_local 201 186 numa_hit 72293 71362 numa_huge_pte_updates 0 0 numa_interleave 26 23 numa_local 72233 71299 numa_other 60 63 numa_pages_migrated 8 2 numa_pte_updates 0 0 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,478,820 8,606,824 migrations 171,323 155,352 faults 307,499 301,409 cache-misses 240,353,599 157,759,224 sched:sched_move_numa 214 168 sched:sched_stick_numa 0 0 sched:sched_swap_numa 4 3 migrate:mm_migrate_pages 89 125 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 5301 4650 numa_hint_faults_local 4745 3946 numa_hit 92943 90489 numa_huge_pte_updates 0 0 numa_interleave 899 892 numa_local 92345 90034 numa_other 598 455 numa_pages_migrated 88 124 numa_pte_updates 5505 4818 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,066,172 2,113,167 migrations 11,076 10,533 faults 149,544 142,727 cache-misses 10,398,067 5,594,192 sched:sched_move_numa 43 10 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 6 6 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 3552 744 numa_hint_faults_local 3347 584 numa_hit 25611 25551 numa_huge_pte_updates 0 0 numa_interleave 213 263 numa_local 25583 25302 numa_other 28 249 numa_pages_migrated 6 6 numa_pte_updates 3535 744 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 99,358,136 101,227,352 migrations 4,041,607 4,151,829 faults 749,653 745,233 cache-misses 225,562,543,251 224,669,561,766 sched:sched_move_numa 771 617 sched:sched_stick_numa 14 2 sched:sched_swap_numa 204 187 migrate:mm_migrate_pages 1,180 316 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 27409 24195 numa_hint_faults_local 20677 21639 numa_hit 239988 238331 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 239983 238331 numa_other 5 0 numa_pages_migrated 1016 204 numa_pte_updates 27916 24561 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 60,899,307 62,738,978 migrations 544,668 562,702 faults 270,834 228,465 cache-misses 74,543,455,635 75,778,067,952 sched:sched_move_numa 735 648 sched:sched_stick_numa 25 13 sched:sched_swap_numa 174 137 migrate:mm_migrate_pages 816 733 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 11059 10281 numa_hint_faults_local 4733 3242 numa_hit 41384 36338 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 41383 36338 numa_other 1 0 numa_pages_migrated 815 706 numa_pte_updates 11323 10176 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02sched/numa: Stop multiple tasks from moving to the CPU at the same timeSrikar Dronamraju
Task migration under NUMA balancing can happen in parallel. More than one task might choose to migrate to the same CPU at the same time. This can result in: - During task swap, choosing a task that was not part of the evaluation. - During task swap, task which just got moved into its preferred node, moving to a completely different node. - During task swap, task failing to move to the preferred node, will have to wait an extra interval for the next migrate opportunity. - During task movement, multiple task movements can cause load imbalance. This problem is more likely if there are more cores per node or more nodes in the system. Use a per run-queue variable to check if NUMA-balance is active on the run-queue. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 200194 203353 1.57797 1 311331 328205 5.41995 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 1 197654 214384 8.46429 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 192605 188553 -2.10379 1 213402 196273 -8.02664 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 52227.1 57581.2 10.2516 1 102529 103468 0.915838 There is a regression on power 9 box. If we look at the details, that box has a sudden jump in cache-misses with this patch. All other parameters seem to be pointing towards NUMA consolidation. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,345,784 13,941,377 migrations 1,127,820 1,157,323 faults 374,736 382,175 cache-misses 55,132,054,603 54,993,823,500 sched:sched_move_numa 1,923 2,005 sched:sched_stick_numa 52 14 sched:sched_swap_numa 595 529 migrate:mm_migrate_pages 1,932 1,573 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 60605 67099 numa_hint_faults_local 51804 58456 numa_hit 239945 240416 numa_huge_pte_updates 14 18 numa_interleave 60 65 numa_local 239865 240339 numa_other 80 77 numa_pages_migrated 1931 1574 numa_pte_updates 67823 77182 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,016,467 3,176,453 migrations 37,326 30,238 faults 115,342 87,869 cache-misses 11,692,155,554 12,544,479,391 sched:sched_move_numa 965 23 sched:sched_stick_numa 8 0 sched:sched_swap_numa 35 6 migrate:mm_migrate_pages 1,168 10 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 16286 236 numa_hint_faults_local 11863 201 numa_hit 112482 72293 numa_huge_pte_updates 33 0 numa_interleave 20 26 numa_local 112419 72233 numa_other 63 60 numa_pages_migrated 1144 8 numa_pte_updates 32859 0 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,629,724 8,478,820 migrations 221,052 171,323 faults 308,661 307,499 cache-misses 135,574,913 240,353,599 sched:sched_move_numa 147 214 sched:sched_stick_numa 0 0 sched:sched_swap_numa 2 4 migrate:mm_migrate_pages 64 89 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 11481 5301 numa_hint_faults_local 10968 4745 numa_hit 89773 92943 numa_huge_pte_updates 0 0 numa_interleave 1116 899 numa_local 89220 92345 numa_other 553 598 numa_pages_migrated 62 88 numa_pte_updates 11694 5505 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,272,887 2,066,172 migrations 12,206 11,076 faults 163,704 149,544 cache-misses 4,801,186 10,398,067 sched:sched_move_numa 44 43 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 17 6 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 2261 3552 numa_hint_faults_local 1993 3347 numa_hit 25726 25611 numa_huge_pte_updates 0 0 numa_interleave 239 213 numa_local 25498 25583 numa_other 228 28 numa_pages_migrated 17 6 numa_pte_updates 2266 3535 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 117,980,962 99,358,136 migrations 3,950,220 4,041,607 faults 736,979 749,653 cache-misses 224,976,072,879 225,562,543,251 sched:sched_move_numa 504 771 sched:sched_stick_numa 50 14 sched:sched_swap_numa 239 204 migrate:mm_migrate_pages 1,260 1,180 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 18293 27409 numa_hint_faults_local 11969 20677 numa_hit 240854 239988 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 240851 239983 numa_other 3 5 numa_pages_migrated 1190 1016 numa_pte_updates 18106 27916 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 61,053,158 60,899,307 migrations 551,586 544,668 faults 244,174 270,834 cache-misses 74,326,766,973 74,543,455,635 sched:sched_move_numa 344 735 sched:sched_stick_numa 24 25 sched:sched_swap_numa 140 174 migrate:mm_migrate_pages 568 816 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 6461 11059 numa_hint_faults_local 2283 4733 numa_hit 35661 41384 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 35661 41383 numa_other 0 1 numa_pages_migrated 568 815 numa_pte_updates 6518 11323 Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf eventsNatarajan, Janakarajan
In Family 17h, some L3 Cache Performance events require the ThreadMask and SliceMask to be set. For other events, these fields do not affect the count either way. Set ThreadMask and SliceMask to 0xFF and 0xF respectively. Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H . Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Suravee <Suravee.Suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/Message-ID: Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02perf/x86/intel/uncore: Fix PCI BDF address of M3UPI on SKXKan Liang
The counters on M3UPI Link 0 and Link 3 don't count properly, and writing 0 to these counters may causes system crash on some machines. The PCI BDF addresses of the M3UPI in the current code are incorrect. The correct addresses should be: D18:F1 0x204D D18:F2 0x204E D18:F5 0x204D Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: cd34cd97b7b4 ("perf/x86/intel/uncore: Add Skylake server uncore support") Link: http://lkml.kernel.org/r/1537538826-55489-1-git-send-email-kan.liang@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02perf/ring_buffer: Prevent concurent ring buffer accessJiri Olsa
Some of the scheduling tracepoints allow the perf_tp_event code to write to ring buffer under different cpu than the code is running on. This results in corrupted ring buffer data demonstrated in following perf commands: # perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.383 [sec] [ perf record: Woken up 8 times to write data ] 0x42b890 [0]: failed to process type: -1765585640 [ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ] # perf report --stdio 0x42b890 [0]: failed to process type: -1765585640 The reason for the corruption are some of the scheduling tracepoints, that have __perf_task dfined and thus allow to store data to another cpu ring buffer: sched_waking sched_wakeup sched_wakeup_new sched_stat_wait sched_stat_sleep sched_stat_iowait sched_stat_blocked The perf_tp_event function first store samples for current cpu related events defined for tracepoint: hlist_for_each_entry_rcu(event, head, hlist_entry) perf_swevent_event(event, count, &data, regs); And then iterates events of the 'task' and store the sample for any task's event that passes tracepoint checks: ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]); list_for_each_entry_rcu(event, &ctx->event_list, event_entry) { if (event->attr.type != PERF_TYPE_TRACEPOINT) continue; if (event->attr.config != entry->type) continue; perf_swevent_event(event, count, &data, regs); } Above code can race with same code running on another cpu, ending up with 2 cpus trying to store under the same ring buffer, which is specifically not allowed. This patch prevents the problem, by allowing only events with the same current cpu to receive the event. NOTE: this requires the use of (per-task-)per-cpu buffers for this feature to work; perf-record does this. Signed-off-by: Jiri Olsa <jolsa@kernel.org> [peterz: small edits to Changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrew Vagin <avagin@openvz.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: e6dab5ffab59 ("perf/trace: Add ability to set a target task for events") Link: http://lkml.kernel.org/r/20180923161343.GB15054@krava Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02perf/x86/intel/uncore: Use boot_cpu_data.phys_proc_id instead of hardcorded ↵Masayoshi Mizuma
physical package ID 0 Physical package id 0 doesn't always exist, we should use boot_cpu_data.phys_proc_id here. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masayoshi Mizuma <msys.mizuma@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/20180910144750.6782-1-msys.mizuma@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02perf/core: Fix perf_pmu_unregister() lockingPeter Zijlstra
When we unregister a PMU, we fail to serialize the @pmu_idr properly. Fix that by doing the entire thing under pmu_lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 2e80a82a49c4 ("perf: Dynamic pmu types") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-02selftests/x86: Add clock_gettime() tests to test_vdsoAndy Lutomirski
Now that the vDSO implementation of clock_gettime() is getting reworked, add a selftest for it. This tests that its output is consistent with the syscall version. This is marked for stable to serve as a test for commit 715bd9d12f84 ("x86/vdso: Fix asm constraints on vDSO syscall fallbacks") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/082399674de2619b2befd8c0dde49b260605b126.1538422295.git.luto@kernel.org
2018-10-01r8169: fix network stalls due to missing bit TXCFG_AUTO_FIFOHeiner Kallweit
Some of the chip-specific hw_start functions set bit TXCFG_AUTO_FIFO in register TxConfig. The original patch changed the order of some calls resulting in these changes being overwritten by rtl_set_tx_config_registers() in rtl_hw_start(). This eventually resulted in network stalls especially under high load. Analyzing the chip-specific hw_start functions all chip version from 34, with the exception of version 39, need this bit set. This patch moves setting this bit to rtl_set_tx_config_registers(). Fixes: 4fd48c4ac0a0 ("r8169: move common initializations to tp->hw_start") Reported-by: Ortwin Glück <odi@odi.ch> Reported-by: David Arendt <admin@prnet.org> Root-caused-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Tested-by: Tony Atkinson <tatkinson@linux.com> Tested-by: David Arendt <admin@prnet.org> Tested-by: Ortwin Glück <odi@odi.ch> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-02x86/vdso: Fix asm constraints on vDSO syscall fallbacksAndy Lutomirski
The syscall fallbacks in the vDSO have incorrect asm constraints. They are not marked as writing to their outputs -- instead, they are marked as clobbering "memory", which is useless. In particular, gcc is smart enough to know that the timespec parameter hasn't escaped, so a memory clobber doesn't clobber it. And passing a pointer as an asm *input* does not tell gcc that the pointed-to value is changed. Add in the fact that the asm instructions weren't volatile, and gcc was free to omit them entirely unless their sole output (the return value) is used. Which it is (phew!), but that stops happening with some upcoming patches. As a trivial example, the following code: void test_fallback(struct timespec *ts) { vdso_fallback_gettime(CLOCK_MONOTONIC, ts); } compiles to: 00000000000000c0 <test_fallback>: c0: c3 retq To add insult to injury, the RCX and R11 clobbers on 64-bit builds were missing. The "memory" clobber is also unnecessary -- no ordering with respect to other memory operations is needed, but that's going to be fixed in a separate not-for-stable patch. Fixes: 2aae950b21e4 ("x86_64: Add vDSO for x86-64 with gettimeofday/clock_gettime/getcpu") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/2c0231690551989d2fafa60ed0e7b5cc8b403908.1538422295.git.luto@kernel.org
2018-10-01Merge branch 'tun-races'David S. Miller
Eric Dumazet says: ==================== tun: address two syzbot reports Small changes addressing races discovered by syzbot. First patch is a cleanup. Second patch moves a mutex init sooner. Third patch makes sure each tfile gets its own napi enable flags. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01tun: napi flags belong to tfileEric Dumazet
Since tun->flags might be shared by multiple tfile structures, it is better to make sure tun_get_user() is using the flags for the current tfile. Presence of the READ_ONCE() in tun_napi_frags_enabled() gave a hint of what could happen, but we need something stronger to please syzbot. kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 13647 Comm: syz-executor5 Not tainted 4.19.0-rc5+ #59 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427 Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4 RSP: 0018:ffff8801c400f410 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8618d325 RDX: 000000000000001a RSI: ffffffff86189f97 RDI: 00000000000000d0 RBP: ffff8801c400f608 R08: ffff8801c8fb4300 R09: 0000000000000000 R10: ffffed0038801ed7 R11: 0000000000000003 R12: ffff8801d327d358 R13: 0000000000000000 R14: ffff8801c16dd8c0 R15: 0000000000000004 FS: 00007fe003615700(0000) GS:ffff8801dac00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe1f3c43db8 CR3: 00000001bebb2000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: napi_gro_frags+0x3f4/0xc90 net/core/dev.c:5715 tun_get_user+0x31d5/0x42a0 drivers/net/tun.c:1922 tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1967 call_write_iter include/linux/fs.h:1808 [inline] new_sync_write fs/read_write.c:474 [inline] __vfs_write+0x6b8/0x9f0 fs/read_write.c:487 vfs_write+0x1fc/0x560 fs/read_write.c:549 ksys_write+0x101/0x260 fs/read_write.c:598 __do_sys_write fs/read_write.c:610 [inline] __se_sys_write fs/read_write.c:607 [inline] __x64_sys_write+0x73/0xb0 fs/read_write.c:607 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457579 Code: 1d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fe003614c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457579 RDX: 0000000000000012 RSI: 0000000020000000 RDI: 000000000000000a RBP: 000000000072c040 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe0036156d4 R13: 00000000004c5574 R14: 00000000004d8e98 R15: 00000000ffffffff Modules linked in: RIP: 0010:dev_gro_receive+0x132/0x2720 net/core/dev.c:5427 Code: 48 c1 ea 03 80 3c 02 00 0f 85 6e 20 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 6e 10 49 8d bd d0 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 59 20 00 00 4d 8b a5 d0 00 00 00 31 ff 41 81 e4 RSP: 0018:ffff8801c400f410 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8618d325 RDX: 000000000000001a RSI: ffffffff86189f97 RDI: 00000000000000d0 RBP: ffff8801c400f608 R08: ffff8801c8fb4300 R09: 0000000000000000 R10: ffffed0038801ed7 R11: 0000000000000003 R12: ffff8801d327d358 R13: 0000000000000000 R14: ffff8801c16dd8c0 R15: 0000000000000004 FS: 00007fe003615700(0000) GS:ffff8801dac00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe1f3c43db8 CR3: 00000001bebb2000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01tun: initialize napi_mutex unconditionallyEric Dumazet
This is the first part to fix following syzbot report : console output: https://syzkaller.appspot.com/x/log.txt?x=145378e6400000 kernel config: https://syzkaller.appspot.com/x/.config?x=443816db871edd66 dashboard link: https://syzkaller.appspot.com/bug?extid=e662df0ac1d753b57e80 Following patch is fixing the race condition, but it seems safer to initialize this mutex at tfile creation anyway. Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+e662df0ac1d753b57e80@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01tun: remove unused parametersEric Dumazet
tun_napi_disable() and tun_napi_del() do not need a pointer to the tun_struct Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01bond: take rcu lock in netpoll_send_skb_on_devDave Jones
The bonding driver lacks the rcu lock when it calls down into netdev_lower_get_next_private_rcu from bond_poll_controller, which results in a trace like: WARNING: CPU: 2 PID: 179 at net/core/dev.c:6567 netdev_lower_get_next_private_rcu+0x34/0x40 CPU: 2 PID: 179 Comm: kworker/u16:15 Not tainted 4.19.0-rc5-backup+ #1 Workqueue: bond0 bond_mii_monitor RIP: 0010:netdev_lower_get_next_private_rcu+0x34/0x40 Code: 48 89 fb e8 fe 29 63 ff 85 c0 74 1e 48 8b 45 00 48 81 c3 c0 00 00 00 48 8b 00 48 39 d8 74 0f 48 89 45 00 48 8b 40 f8 5b 5d c3 <0f> 0b eb de 31 c0 eb f5 0f 1f 40 00 0f 1f 44 00 00 48 8> RSP: 0018:ffffc9000087fa68 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff880429614560 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 00000000ffffffff RDI: ffffffffa184ada0 RBP: ffffc9000087fa80 R08: 0000000000000001 R09: 0000000000000000 R10: ffffc9000087f9f0 R11: ffff880429798040 R12: ffff8804289d5980 R13: ffffffffa1511f60 R14: 00000000000000c8 R15: 00000000ffffffff FS: 0000000000000000(0000) GS:ffff88042f880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f4b78fce180 CR3: 000000018180f006 CR4: 00000000001606e0 Call Trace: bond_poll_controller+0x52/0x170 netpoll_poll_dev+0x79/0x290 netpoll_send_skb_on_dev+0x158/0x2c0 netpoll_send_udp+0x2d5/0x430 write_ext_msg+0x1e0/0x210 console_unlock+0x3c4/0x630 vprintk_emit+0xfa/0x2f0 printk+0x52/0x6e ? __netdev_printk+0x12b/0x220 netdev_info+0x64/0x80 ? bond_3ad_set_carrier+0xe9/0x180 bond_select_active_slave+0x1fc/0x310 bond_mii_monitor+0x709/0x9b0 process_one_work+0x221/0x5e0 worker_thread+0x4f/0x3b0 kthread+0x100/0x140 ? process_one_work+0x5e0/0x5e0 ? kthread_delayed_work_timer_fn+0x90/0x90 ret_from_fork+0x24/0x30 We're also doing rcu dereferences a layer up in netpoll_send_skb_on_dev before we call down into netpoll_poll_dev, so just take the lock there. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01rtnetlink: Fail dump if target netnsid is invalidDavid Ahern
Link dumps can return results from a target namespace. If the namespace id is invalid, then the dump request should fail if get_target_net fails rather than continuing with a dump of the current namespace. Fixes: 79e1ad148c844 ("rtnetlink: use netnsid to query interface") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01Revert "openvswitch: Fix template leak in error cases."Flavio Leitner
This reverts commit 90c7afc96cbbd77f44094b5b651261968e97de67. When the commit was merged, the code used nf_ct_put() to free the entry, but later on commit 76644232e612 ("openvswitch: Free tmpl with tmpl_free.") replaced that with nf_ct_tmpl_free which is a more appropriate. Now the original problem is removed. Then 44d6e2f27328 ("net: Replace NF_CT_ASSERT() with WARN_ON().") replaced a debug assert with a WARN_ON() which is trigged now. Signed-off-by: Flavio Leitner <fbl@redhat.com> Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Johan Hedberg says: ==================== pull request: bluetooth 2018-09-27 Here's one more Bluetooth fix for 4.19, fixing the handling of an attempt to unpair a device while pairing is in progress. Let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01tipc: ignore STATE_MSG on wrong link sessionLUU Duc Canh
The initial session number when a link is created is based on a random value, taken from struct tipc_net->random. It is then incremented for each link reset to avoid mixing protocol messages from different link sessions. However, when a bearer is reset all its links are deleted, and will later be re-created using the same random value as the first time. This means that if the link never went down between creation and deletion we will still sometimes have two subsequent sessions with the same session number. In virtual environments with potentially long transmission times this has turned out to be a real problem. We now fix this by randomizing the session number each time a link is created. With a session number size of 16 bits this gives a risk of session collision of 1/64k. To reduce this further, we also introduce a sanity check on the very first STATE message arriving at a link. If this has an acknowledge value differing from 0, which is logically impossible, we ignore the message. The final risk for session collision is hence reduced to 1/4G, which should be sufficient. Signed-off-by: LUU Duc Canh <canh.d.luu@dektech.com.au> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01net: sched: act_ipt: check for underflow in __tcf_ipt_init()Dan Carpenter
If "td->u.target_size" is larger than sizeof(struct xt_entry_target) we return -EINVAL. But we don't check whether it's smaller than sizeof(struct xt_entry_target) and that could lead to an out of bounds read. Fixes: 7ba699c604ab ("[NET_SCHED]: Convert actions from rtnetlink to new netlink API") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2018-10-01 1) Validate address prefix lengths in the xfrm selector, otherwise we may hit undefined behaviour in the address matching functions if the prefix is too big for the given address family. 2) Fix skb leak on local message size errors. From Thadeu Lima de Souza Cascardo. 3) We currently reset the transport header back to the network header after a transport mode transformation is applied. This leads to an incorrect transport header when multiple transport mode transformations are applied. Reset the transport header only after all transformations are already applied to fix this. From Sowmini Varadhan. 4) We only support one offloaded xfrm, so reset crypto_done after the first transformation in xfrm_input(). Otherwise we may call the wrong input method for subsequent transformations. From Sowmini Varadhan. 5) Fix NULL pointer dereference when skb_dst_force clears the dst_entry. skb_dst_force does not really force a dst refcount anymore, it might clear it instead. xfrm code did not expect this, add a check to not dereference skb_dst() if it was cleared by skb_dst_force. 6) Validate xfrm template mode, otherwise we can get a stack-out-of-bounds read in xfrm_state_find. From Sean Tranchetti. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01usb: xhci-mtk: resume USB3 roothub firstChunfeng Yun
Give USB3 devices a better chance to enumerate at USB3 speeds if they are connected to a suspended host. Porting from "671ffdff5b13 xhci: resume USB 3 roothub first" Cc: <stable@vger.kernel.org> Signed-off-by: Chunfeng Yun <chunfeng.yun@mediatek.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-01xhci: Add missing CAS workaround for Intel Sunrise Point xHCIMathias Nyman
The workaround for missing CAS bit is also needed for xHC on Intel sunrisepoint PCH. For more details see: Intel 100/c230 series PCH specification update Doc #332692-006 Errata #8 Cc: <stable@vger.kernel.org> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-01usb: cdc_acm: Do not leak URB buffersRomain Izard
When the ACM TTY port is disconnected, the URBs it uses must be killed, and then the buffers must be freed. Unfortunately a previous refactor removed the code freeing the buffers because it looked extremely similar to the code killing the URBs. As a result, there were many new leaks for each plug/unplug cycle of a CDC-ACM device, that were detected by kmemleak. Restore the missing code, and the memory leak is removed. Fixes: ba8c931ded8d ("cdc-acm: refactor killing urbs") Signed-off-by: Romain Izard <romain.izard.pro@gmail.com> Acked-by: Oliver Neukum <oneukum@suse.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-01Merge tag 'usb-serial-4.19-rc7' of ↵Greg Kroah-Hartman
https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus Johan writes: USB-serial fixes for v4.19-rc7 Here are some device-id patches for 4.19-rc7. Some Quectel modems have a vendor command which can be used to disable certain interfaces in their configurations, but unlike some other modems this also causes the interface numbers to change. These patches allow us to support all such interface permutations at least for the Quectel EP06. All have been in linux-next with no reported issues. Signed-off-by: Johan Hovold <johan@kernel.org> * tag 'usb-serial-4.19-rc7' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial: USB: serial: simple: add Motorola Tetra MTP6550 id USB: serial: option: add two-endpoints device-id flag USB: serial: option: improve Quectel EP06 detection
2018-10-01Merge tag 'arm64-fixes' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Will writes: "Late arm64 fixes - Fix handling of young contiguous ptes for hugetlb mappings - Fix livelock when taking access faults on contiguous hugetlb mappings - Tighten up register accesses via KVM SET_ONE_REG ioctl()s" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: KVM: Sanitize PSTATE.M when being set from userspace arm64: KVM: Tighten guest core register access from userspace arm64: hugetlb: Avoid unnecessary clearing in huge_ptep_set_access_flags arm64: hugetlb: Fix handling of young ptes
2018-10-01Merge tag 'armsoc-fixes' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Olof writes: "ARM: SoC fixes A handful of fixes that have been coming in the last couple of weeks: - Freescale fixes for on-chip accellerators - A DT fix for stm32 to avoid fallback to non-DMA SPI mode - Fixes for badly specified interrupts on BCM63xx SoCs - Allwinner A64 HDMI was incorrectly specified as fully compatble with R40 - Drive strength fix for SAMA5D2 NAND pins on one board" * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: ARM: dts: stm32: update SPI6 dmas property on stm32mp157c soc: fsl: qe: Fix copy/paste bug in ucc_get_tdm_sync_shift() soc: fsl: qbman: qman: avoid allocating from non existing gen_pool ARM: dts: BCM63xx: Fix incorrect interrupt specifiers MAINTAINERS: update the Annapurna Labs maintainer email ARM: dts: sun8i: drop A64 HDMI PHY fallback compatible from R40 DT ARM: dts: at91: sama5d2_ptc_ek: fix nand pinctrl
2018-10-01Merge tag 'pstore-v4.19-rc7' of ↵Greg Kroah-Hartman
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Kees writes: "Pstore fixes for v4.19-rc7 - Fix failure-path memory leak in ramoops_init (nixiaoming)" * tag 'pstore-v4.19-rc7' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: Fix failure-path memory leak in ramoops_init
2018-10-02lib/xz: Put CRC32_POLY_LE in xz_private.hJoel Stanley
This fixes a regression introduced by faa16bc404d72a5 ("lib: Use existing define with polynomial"). The cleanup added a dependency on include/linux, which broke the PowerPC boot wrapper/decompresser when KERNEL_XZ is enabled: BOOTCC arch/powerpc/boot/decompress.o In file included from arch/powerpc/boot/../../../lib/decompress_unxz.c:233, from arch/powerpc/boot/decompress.c:42: arch/powerpc/boot/../../../lib/xz/xz_crc32.c:18:10: fatal error: linux/crc32poly.h: No such file or directory #include <linux/crc32poly.h> ^~~~~~~~~~~~~~~~~~~ The powerpc decompresser is a hairy corner of the kernel. Even while building a 64-bit kernel it needs to build a 32-bit binary and therefore avoid including files from include/linux. This allows users of the xz library to avoid including headers from 'include/linux/' while still achieving the cleanup of the magic number. Fixes: faa16bc404d72a5 ("lib: Use existing define with polynomial") Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: kbuild test robot <lkp@intel.com> Suggested-by: Christophe LEROY <christophe.leroy@c-s.fr> Signed-off-by: Joel Stanley <joel@jms.id.au> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-01tcp/dccp: fix lockdep issue when SYN is backloggedEric Dumazet
In normal SYN processing, packets are handled without listener lock and in RCU protected ingress path. But syzkaller is known to be able to trick us and SYN packets might be processed in process context, after being queued into socket backlog. In commit 06f877d613be ("tcp/dccp: fix other lockdep splats accessing ireq_opt") I made a very stupid fix, that happened to work mostly because of the regular path being RCU protected. Really the thing protecting ireq->ireq_opt is RCU read lock, and the pseudo request refcnt is not relevant. This patch extends what I did in commit 449809a66c1d ("tcp/dccp: block BH for SYN processing") by adding an extra rcu_read_{lock|unlock} pair in the paths that might be taken when processing SYN from socket backlog (thus possibly in process context) Fixes: 06f877d613be ("tcp/dccp: fix other lockdep splats accessing ireq_opt") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree: 1) Skip ip_sabotage_in() for packet making into the VRF driver, otherwise packets are dropped, from David Ahern. 2) Clang compilation warning uncovering typo in the nft_validate_register_store() call from nft_osf, from Stefan Agner. 3) Double sizeof netlink message length calculations in ctnetlink, from zhong jiang. 4) Missing rb_erase() on batch full in rbtree garbage collector, from Taehee Yoo. 5) Calm down compilation warning in nf_hook(), from Florian Westphal. 6) Missing check for non-null sk in xt_socket before validating netns procedence, from Flavio Leitner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-01PCI: mvebu: Fix PCI I/O mapping creation sequenceThomas Petazzoni
Commit ee1604381a371 ("PCI: mvebu: Only remap I/O space if configured") had the side effect that the PCI I/O mapping was created much earlier than before, at a point where the probe() of the driver could still fail. This is for example a problem if one gets an -EPROBE_DEFER at some point during probe(), after pci_ioremap_io() has been called. Indeed, there is currently no function to undo what pci_ioremap_io() did, and switching to pci_remap_iospace() is not an option in pci-mvebu due to the need for special memory attributes on Armada 38x. Reverting ee1604381a371 ("PCI: mvebu: Only remap I/O space if configured") would be a possibility, but it would require also reverting 42342073e38b5 ("PCI: mvebu: Convert to use pci_host_bridge directly"). So instead, we use an open-coded version of pci_host_probe() that creates the PCI I/O mapping at a point where we are guaranteed not to fail anymore. Fixes: ee1604381a371 ("PCI: mvebu: Only remap I/O space if configured") Reported-by: Jan Kundrát <jan.kundrat@cesnet.cz> Tested-by: Jan Kundrát <jan.kundrat@cesnet.cz> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
2018-10-01net/mlx5e: Set vlan masks for all offloaded TC rulesJianbo Liu
In flow steering, if asked to, the hardware matches on the first ethertype which is not vlan. It's possible to set a rule as follows, which is meant to match on untagged packet, but will match on a vlan packet: tc filter add dev eth0 parent ffff: protocol ip flower ... To avoid this for packets with single tag, we set vlan masks to tell hardware to check the tags for every matched packet. Fixes: 095b6cfd69ce ('net/mlx5e: Add TC vlan match parsing') Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-01net/mlx5: E-Switch, Fix out of bound access when setting vport rateEran Ben Elisha
The code that deals with eswitch vport bw guarantee was going beyond the eswitch vport array limit, fix that. This was pointed out by the kernel address sanitizer (KASAN). The error from KASAN log: [2018-09-15 15:04:45] BUG: KASAN: slab-out-of-bounds in mlx5_eswitch_set_vport_rate+0x8c1/0xae0 [mlx5_core] Fixes: c9497c98901c ("net/mlx5: Add support for setting VF min rate") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-01net/mlx5e: Avoid unbounded peer devices when unpairing TC hairpin rulesAlaa Hleihel
If the peer device was already unbound, then do not attempt to modify it's resources, otherwise we will crash on dereferencing non-existing device. Fixes: 5c65c564c962 ("net/mlx5e: Support offloading TC NIC hairpin flows") Signed-off-by: Alaa Hleihel <alaa@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-01drm/i915: Avoid compiler warning for maybe unused gu_misc_iirChris Wilson
/kisskb/src/drivers/gpu/drm/i915/i915_irq.c: warning: 'gu_misc_iir' may be used uninitialized in this function [-Wuninitialized]: => 3120:10 Silence the compiler warning by ensuring that the local variable is initialised and removing the guard that is confusing the older gcc. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: df0d28c185ad ("drm/i915/icl: GSE interrupt moves from DE_MISC to GU_MISC") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Paulo Zanoni <paulo.r.zanoni@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180926104718.17462-1-chris@chris-wilson.co.uk (cherry picked from commit 7a90938332d80faf973fbcffdf6e674e7b8f0914) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2018-10-01drm/i915: Do not redefine the has_csr parameter.Anusha Srivatsa
Let us reuse the already defined has_csr check and not redefine it. The main difference is that in effect this will flip .has_csr to 1 (via GEN9_FEATURES which GEN11_FEATURES pulls in). Suggested-by: Imre Deak <imre.deak@intel.com> Cc: Imre Deak <imre.deak@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Anusha Srivatsa <anusha.srivatsa@intel.com> Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=107382 Reviewed-by: Imre Deak <imre.deak@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/1534527210-16841-1-git-send-email-anusha.srivatsa@intel.com (cherry picked from commit da4468a1aa75457e6134127b19761b7ba62ce945) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2018-10-01KVM: x86: fix L1TF's MMIO GFN calculationSean Christopherson
One defense against L1TF in KVM is to always set the upper five bits of the *legal* physical address in the SPTEs for non-present and reserved SPTEs, e.g. MMIO SPTEs. In the MMIO case, the GFN of the MMIO SPTE may overlap with the upper five bits that are being usurped to defend against L1TF. To preserve the GFN, the bits of the GFN that overlap with the repurposed bits are shifted left into the reserved bits, i.e. the GFN in the SPTE will be split into high and low parts. When retrieving the GFN from the MMIO SPTE, e.g. to check for an MMIO access, get_mmio_spte_gfn() unshifts the affected bits and restores the original GFN for comparison. Unfortunately, get_mmio_spte_gfn() neglects to mask off the reserved bits in the SPTE that were used to store the upper chunk of the GFN. As a result, KVM fails to detect MMIO accesses whose GPA overlaps the repurprosed bits, which in turn causes guest panics and hangs. Fix the bug by generating a mask that covers the lower chunk of the GFN, i.e. the bits that aren't shifted by the L1TF mitigation. The alternative approach would be to explicitly zero the five reserved bits that are used to store the upper chunk of the GFN, but that requires additional run-time computation and makes an already-ugly bit of code even more inscrutable. I considered adding a WARN_ON_ONCE(low_phys_bits-1 <= PAGE_SHIFT) to warn if GENMASK_ULL() generated a nonsensical value, but that seemed silly since that would mean a system that supports VMX has less than 18 bits of physical address space... Reported-by: Sakari Ailus <sakari.ailus@iki.fi> Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs") Cc: Junaid Shahid <junaids@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Reviewed-by: Junaid Shahid <junaids@google.com> Tested-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01tools/kvm_stat: cut down decimal places in update interval dialogStefan Raspl
We currently display the default number of decimal places for floats in _show_set_update_interval(), which is quite pointless. Cutting down to a single decimal place. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGSLiran Alon
L2 IA32_BNDCFGS should be updated with vmcs12->guest_bndcfgs only when VM_ENTRY_LOAD_BNDCFGS is specified in vmcs12->vm_entry_controls. Otherwise, L2 IA32_BNDCFGS should be set to vmcs01->guest_bndcfgs which is L1 IA32_BNDCFGS. Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01KVM: x86: Do not use kvm_x86_ops->mpx_supported() directlyLiran Alon
Commit a87036add092 ("KVM: x86: disable MPX if host did not enable MPX XSAVE features") introduced kvm_mpx_supported() to return true iff MPX is enabled in the host. However, that commit seems to have missed replacing some calls to kvm_x86_ops->mpx_supported() to kvm_mpx_supported(). Complete original commit by replacing remaining calls to kvm_mpx_supported(). Fixes: a87036add092 ("KVM: x86: disable MPX if host did not enable MPX XSAVE features") Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabledLiran Alon
Before this commit, KVM exposes MPX VMX controls to L1 guest only based on if KVM and host processor supports MPX virtualization. However, these controls should be exposed to guest only in case guest vCPU supports MPX. Without this change, a L1 guest running with kernel which don't have commit 691bd4340bef ("kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS") asserts in QEMU on the following: qemu-kvm: error: failed to set MSR 0xd90 to 0x0 qemu-kvm: .../qemu-2.10.0/target/i386/kvm.c:1801 kvm_put_msrs: Assertion 'ret == cpu->kvm_msr_buf->nmsrs failed' This is because L1 KVM kvm_init_msr_list() will see that vmx_mpx_supported() (As it only checks MPX VMX controls support) and therefore KVM_GET_MSR_INDEX_LIST IOCTL will include MSR_IA32_BNDCFGS. However, later when L1 will attempt to set this MSR via KVM_SET_MSRS IOCTL, it will fail because !guest_cpuid_has_mpx(vcpu). Therefore, fix the issue by exposing MPX VMX controls to L1 guest only when vCPU supports MPX. Fixes: 36be0b9deb23 ("KVM: x86: Add nested virtualization support for MPX") Reported-by: Eyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-01arm64: KVM: Sanitize PSTATE.M when being set from userspaceMarc Zyngier
Not all execution modes are valid for a guest, and some of them depend on what the HW actually supports. Let's verify that what userspace provides is compatible with both the VM settings and the HW capabilities. Cc: <stable@vger.kernel.org> Fixes: 0d854a60b1d7 ("arm64: KVM: enable initialization of a 32bit vcpu") Reviewed-by: Christoffer Dall <christoffer.dall@arm.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-10-01arm64: KVM: Tighten guest core register access from userspaceDave Martin
We currently allow userspace to access the core register file in about any possible way, including straddling multiple registers and doing unaligned accesses. This is not the expected use of the ABI, and nobody is actually using it that way. Let's tighten it by explicitly checking the size and alignment for each field of the register file. Cc: <stable@vger.kernel.org> Fixes: 2f4a07c5f9fe ("arm64: KVM: guest one-reg interface") Reviewed-by: Christoffer Dall <christoffer.dall@arm.com> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> [maz: rewrote Dave's initial patch to be more easily backported] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>