Age | Commit message (Collapse) | Author |
|
A CFS (SCHED_OTHER, SCHED_BATCH or SCHED_IDLE policy) task's
se->runnable_weight must always be in sync with its se->load.weight.
se->runnable_weight is set to se->load.weight when the task is
forked (init_entity_runnable_average()) or reniced (reweight_entity()).
There are two cases in set_load_weight() which since they currently only
set se->load.weight could lead to a situation in which se->load.weight
is different to se->runnable_weight for a CFS task:
(1) A task switches to SCHED_IDLE.
(2) A SCHED_FIFO, SCHED_RR or SCHED_DEADLINE task which has been reniced
(during which only its static priority gets set) switches to
SCHED_OTHER or SCHED_BATCH.
Set se->runnable_weight to se->load.weight in these two cases to prevent
this. This eliminates the need to explicitly set it to se->load.weight
during PELT updates in the CFS scheduler fastpath.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20180803140538.1178-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
LB_BIAS allows the adjustment on how conservative load should be
balanced.
The rq->cpu_load[idx] array is used for this functionality. It contains
weighted CPU load decayed average values over different intervals
(idx = 1..4). Idx = 0 is the weighted CPU load itself.
The values are updated during scheduler_tick, before idle balance and at
nohz exit.
There are 5 different types of idx's per sched domain (sd). Each of them
is used to index into the rq->cpu_load[idx] array in a specific scenario
(busy, idle and newidle for load balancing, forkexec for wake-up
slow-path load balancing and wake for affine wakeup based on weight).
Only the sd idx's for busy and idle load balancing are set to 2,3 or 1,2
respectively. All the other sd idx's are set to 0.
Conservative load balancing is achieved for sd idx's >= 1 by using the
min/max (source_load()/target_load()) value between the current weighted
CPU load and the rq->cpu_load[sd idx -1] for the busiest(idlest)/local
CPU load in load balancing or vice versa in the wake-up slow-path load
balancing.
There is no conservative balancing for sd idx = 0 since only current
weighted CPU load is used in this case.
It is very likely that LB_BIAS' influence on load balancing can be
neglected (see test results below). This is further supported by:
(1) Weighted CPU load today is by itself a decayed average value (PELT)
(cfs_rq->avg->runnable_load_avg) and not the instantaneous load
(rq->load.weight) it was when LB_BIAS was introduced.
(2) Sd imbalance_pct is used for CPU_NEWLY_IDLE and CPU_NOT_IDLE (relate
to sd's newidle and busy idx) in find_busiest_group() when comparing
busiest and local avg load to make load balancing even more
conservative.
(3) The sd forkexec and newidle idx are always set to 0 so there is no
adjustment on how conservatively load balancing is done here.
(4) Affine wakeup based on weight (wake_affine_weight()) will not be
impacted since the sd wake idx is always set to 0.
Let's disable LB_BIAS by default for a few kernel releases to make sure
that no workload and no scheduler topology is affected. The benefit of
being able to remove the LB_BIAS dependency from source_load() and
target_load() is that the entire rq->cpu_load[idx] code could be removed
in this case.
It is really hard to say if there is no regression w/o testing this with
a lot of different workloads on a lot of different platforms, especially
NUMA machines.
The following 104 LKP (Linux Kernel Performance) tests were run by the
0-Day guys mostly on multi-socket hosts with a larger number of logical
cpus (88, 192).
The base for the test was commit b3dae109fa89 ("sched/swait: Rename to
exclusive") (tip/sched/core v4.18-rc1).
Only 2 out of the 104 tests had a significant change in one of the
metrics (fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance +7%
files_per_sec, unixbench/300s-100%-syscall-performance -11% score).
Tests which showed a change in one of the metrics are marked with a '*'
and this change is listed as well.
(a) lkp-bdw-ep3:
88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 64G
dd-write/10m-1HDD-cfq-btrfs-100dd-performance
fsmark/1x-1t-1HDD-xfs-nfsv4-4M-60G-NoSync-performance
* fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance
7.50 7% 8.00 ± 6% fsmark.files_per_sec
fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-fsyncBeforeClose-performance
fsmark/1x-1t-1HDD-btrfs-4M-60G-NoSync-performance
fsmark/1x-1t-1HDD-btrfs-4M-60G-fsyncBeforeClose-performance
kbuild/300s-50%-vmlinux_prereq-performance
kbuild/300s-200%-vmlinux_prereq-performance
kbuild/300s-50%-vmlinux_prereq-performance-1HDD-ext4
kbuild/300s-200%-vmlinux_prereq-performance-1HDD-ext4
(b) lkp-skl-4sp1:
192 threads Intel(R) Xeon(R) Platinum 8160 768G
dbench/100%-performance
ebizzy/200%-100x-10s-performance
hackbench/1600%-process-pipe-performance
iperf/300s-cs-localhost-tcp-performance
iperf/300s-cs-localhost-udp-performance
perf-bench-numa-mem/2t-300M-performance
perf-bench-sched-pipe/10000000ops-process-performance
perf-bench-sched-pipe/10000000ops-threads-performance
schbench/2-16-300-30000-30000-performance
tbench/100%-cs-localhost-performance
(c) lkp-bdw-ep6:
88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 128G
stress-ng/100%-60s-pipe-performance
unixbench/300s-1-whetstone-double-performance
unixbench/300s-1-shell1-performance
unixbench/300s-1-shell8-performance
unixbench/300s-1-pipe-performance
* unixbench/300s-1-context1-performance
312 315 unixbench.score
unixbench/300s-1-spawn-performance
unixbench/300s-1-syscall-performance
unixbench/300s-1-dhry2reg-performance
unixbench/300s-1-fstime-performance
unixbench/300s-1-fsbuffer-performance
unixbench/300s-1-fsdisk-performance
unixbench/300s-100%-whetstone-double-performance
unixbench/300s-100%-shell1-performance
unixbench/300s-100%-shell8-performance
unixbench/300s-100%-pipe-performance
unixbench/300s-100%-context1-performance
unixbench/300s-100%-spawn-performance
* unixbench/300s-100%-syscall-performance
3571 ± 3% -11% 3183 ± 4% unixbench.score
unixbench/300s-100%-dhry2reg-performance
unixbench/300s-100%-fstime-performance
unixbench/300s-100%-fsbuffer-performance
unixbench/300s-100%-fsdisk-performance
unixbench/300s-1-execl-performance
unixbench/300s-100%-execl-performance
* will-it-scale/brk1-performance
365004 360387 will-it-scale.per_thread_ops
* will-it-scale/dup1-performance
432401 437596 will-it-scale.per_thread_ops
will-it-scale/eventfd1-performance
will-it-scale/futex1-performance
will-it-scale/futex2-performance
will-it-scale/futex3-performance
will-it-scale/futex4-performance
will-it-scale/getppid1-performance
will-it-scale/lock1-performance
will-it-scale/lseek1-performance
will-it-scale/lseek2-performance
* will-it-scale/malloc1-performance
47025 45817 will-it-scale.per_thread_ops
77499 76529 will-it-scale.per_process_ops
will-it-scale/malloc2-performance
* will-it-scale/mmap1-performance
123399 120815 will-it-scale.per_thread_ops
152219 149833 will-it-scale.per_process_ops
* will-it-scale/mmap2-performance
107327 104714 will-it-scale.per_thread_ops
136405 133765 will-it-scale.per_process_ops
will-it-scale/open1-performance
* will-it-scale/open2-performance
171570 168805 will-it-scale.per_thread_ops
532644 526202 will-it-scale.per_process_ops
will-it-scale/page_fault1-performance
will-it-scale/page_fault2-performance
will-it-scale/page_fault3-performance
will-it-scale/pipe1-performance
will-it-scale/poll1-performance
* will-it-scale/poll2-performance
176134 172848 will-it-scale.per_thread_ops
281361 275053 will-it-scale.per_process_ops
will-it-scale/posix_semaphore1-performance
will-it-scale/pread1-performance
will-it-scale/pread2-performance
will-it-scale/pread3-performance
will-it-scale/pthread_mutex1-performance
will-it-scale/pthread_mutex2-performance
will-it-scale/pwrite1-performance
will-it-scale/pwrite2-performance
will-it-scale/pwrite3-performance
* will-it-scale/read1-performance
1190563 1174833 will-it-scale.per_thread_ops
* will-it-scale/read2-performance
1105369 1080427 will-it-scale.per_thread_ops
will-it-scale/readseek1-performance
* will-it-scale/readseek2-performance
261818 259040 will-it-scale.per_thread_ops
will-it-scale/readseek3-performance
* will-it-scale/sched_yield-performance
2408059 2382034 will-it-scale.per_thread_ops
will-it-scale/signal1-performance
will-it-scale/unix1-performance
will-it-scale/unlink1-performance
will-it-scale/unlink2-performance
* will-it-scale/write1-performance
976701 961588 will-it-scale.per_thread_ops
* will-it-scale/writeseek1-performance
831898 822448 will-it-scale.per_thread_ops
* will-it-scale/writeseek2-performance
228248 225065 will-it-scale.per_thread_ops
* will-it-scale/writeseek3-performance
226670 224058 will-it-scale.per_thread_ops
will-it-scale/context_switch1-performance
aim7/performance-fork_test-2000
* aim7/performance-brk_test-3000
74869 76676 aim7.jobs-per-min
aim7/performance-disk_cp-3000
aim7/performance-disk_rd-3000
aim7/performance-sieve-3000
aim7/performance-page_test-3000
aim7/performance-creat-clo-3000
aim7/performance-mem_rtns_1-8000
aim7/performance-disk_wrt-8000
aim7/performance-pipe_cpy-8000
aim7/performance-ram_copy-8000
(d) lkp-avoton3:
8 threads Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 16G
netperf/ipv4-900s-200%-cs-localhost-TCP_STREAM-performance
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Li Zhijian <zhijianx.li@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180809135753.21077-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Create a config for enabling irq load tracking in the scheduler.
irq load tracking is useful only when irq or paravirtual time is
accounted but it's only possible with SMP for now.
Also use __maybe_unused to remove the compilation warning in
update_rq_clock_task() that has been introduced by:
2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
Suggested-by: Ingo Molnar <mingo@redhat.com>
Reported-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: dou_liyang@163.com
Fixes: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
If NUMA improvement from the task migration is going to be very
minimal, then avoid task migration.
Specjbb2005 results (8 warehouses)
Higher bops are better
2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 198512 205910 3.72673
1 313559 318491 1.57291
2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
8 74761.9 74935.9 0.232739
1 214874 226796 5.54837
2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 180536 189780 5.12031
1 210281 205695 -2.18089
4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 56511.4 60370 6.828
1 104899 108100 3.05151
1/7 cases is regressing, if we look at events migrate_pages seem
to vary the most especially in the regressing case. Also some
amount of variance is expected between different runs of
Specjbb2005.
Some events stats before and after applying the patch.
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,818,546 13,801,554
migrations 1,149,960 1,151,541
faults 385,583 433,246
cache-misses 55,259,546,768 55,168,691,835
sched:sched_move_numa 2,257 2,551
sched:sched_stick_numa 9 24
sched:sched_swap_numa 512 904
migrate:mm_migrate_pages 2,225 1,571
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 72692 113682
numa_hint_faults_local 62270 102163
numa_hit 238762 240181
numa_huge_pte_updates 48 36
numa_interleave 75 64
numa_local 238676 240103
numa_other 86 78
numa_pages_migrated 2225 1564
numa_pte_updates 98557 134080
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,173,490 3,079,150
migrations 36,966 31,455
faults 108,776 99,081
cache-misses 12,200,075,320 11,588,126,740
sched:sched_move_numa 1,264 1
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 899 36
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 21109 430
numa_hint_faults_local 17120 77
numa_hit 72934 71277
numa_huge_pte_updates 42 0
numa_interleave 33 22
numa_local 72866 71218
numa_other 68 59
numa_pages_migrated 915 23
numa_pte_updates 42326 0
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,312,022 8,707,565
migrations 231,705 171,342
faults 310,242 310,820
cache-misses 402,324,573 136,115,400
sched:sched_move_numa 193 215
sched:sched_stick_numa 0 6
sched:sched_swap_numa 3 24
migrate:mm_migrate_pages 93 162
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 11838 8985
numa_hint_faults_local 11216 8154
numa_hit 90689 93819
numa_huge_pte_updates 0 0
numa_interleave 1579 882
numa_local 89634 93496
numa_other 1055 323
numa_pages_migrated 92 169
numa_pte_updates 12109 9217
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,170,481 2,152,072
migrations 10,126 10,704
faults 160,962 164,376
cache-misses 10,834,845 3,818,437
sched:sched_move_numa 10 16
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 7
migrate:mm_migrate_pages 2 199
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 403 2248
numa_hint_faults_local 358 1666
numa_hit 25898 25704
numa_huge_pte_updates 0 0
numa_interleave 207 200
numa_local 25860 25679
numa_other 38 25
numa_pages_migrated 2 197
numa_pte_updates 400 2234
perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 110,339,633 93,330,595
migrations 4,139,812 4,122,061
faults 863,622 865,979
cache-misses 231,838,045,660 225,395,083,479
sched:sched_move_numa 2,196 2,372
sched:sched_stick_numa 33 24
sched:sched_swap_numa 544 769
migrate:mm_migrate_pages 2,469 1,677
vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 85748 91638
numa_hint_faults_local 66831 78096
numa_hit 242213 242225
numa_huge_pte_updates 0 0
numa_interleave 0 2
numa_local 242211 242219
numa_other 2 6
numa_pages_migrated 2376 1515
numa_pte_updates 86233 92274
perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 59,331,057 51,487,271
migrations 552,019 537,170
faults 266,586 256,921
cache-misses 73,796,312,990 70,073,831,187
sched:sched_move_numa 981 576
sched:sched_stick_numa 54 24
sched:sched_swap_numa 286 327
migrate:mm_migrate_pages 713 726
vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 14807 12000
numa_hint_faults_local 5738 5024
numa_hit 36230 36470
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 36228 36465
numa_other 2 5
numa_pages_migrated 703 726
numa_pte_updates 14742 11930
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-7-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
migrate_task_rq_fair() resets the scan rate for NUMA balancing on every
cross-node migration. In the event of excessive load balancing due to
saturation, this may result in the scan rate being pegged at maximum and
further overloading the machine.
This patch only resets the scan if NUMA balancing is active, a preferred
node has been selected and the task is being migrated from the preferred
node as these are the most harmful. For example, a migration to the preferred
node does not justify a faster scan rate. Similarly, a migration between two
nodes that are not preferred is probably bouncing due to over-saturation of
the machine. In that case, scanning faster and trapping more NUMA faults
will further overload the machine.
Specjbb2005 results (8 warehouses)
Higher bops are better
2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 203370 205332 0.964744
1 328431 319785 -2.63252
2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 206070 206585 0.249915
2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 188386 189162 0.41192
1 201566 213760 6.04963
4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 59157.4 58736.8 -0.710985
1 105495 105419 -0.0720413
Some events stats before and after applying the patch.
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,825,492 14,285,708
migrations 1,152,509 1,180,621
faults 371,948 339,114
cache-misses 55,654,206,041 55,205,631,894
sched:sched_move_numa 1,856 843
sched:sched_stick_numa 4 6
sched:sched_swap_numa 428 219
migrate:mm_migrate_pages 898 365
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 57146 26907
numa_hint_faults_local 51612 24279
numa_hit 238164 239771
numa_huge_pte_updates 16 0
numa_interleave 63 68
numa_local 238085 239688
numa_other 79 83
numa_pages_migrated 883 363
numa_pte_updates 67540 27415
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,288,525 3,202,779
migrations 38,652 37,186
faults 111,678 106,076
cache-misses 12,111,197,376 12,024,873,744
sched:sched_move_numa 900 931
sched:sched_stick_numa 0 0
sched:sched_swap_numa 5 1
migrate:mm_migrate_pages 714 637
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 18572 17409
numa_hint_faults_local 14850 14367
numa_hit 73197 73953
numa_huge_pte_updates 11 20
numa_interleave 25 25
numa_local 73138 73892
numa_other 59 61
numa_pages_migrated 712 668
numa_pte_updates 24021 27276
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,451,543 8,474,013
migrations 202,804 254,934
faults 310,024 320,506
cache-misses 253,522,507 110,580,458
sched:sched_move_numa 213 725
sched:sched_stick_numa 0 0
sched:sched_swap_numa 2 7
migrate:mm_migrate_pages 88 145
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 11830 22797
numa_hint_faults_local 11301 21539
numa_hit 90038 89308
numa_huge_pte_updates 0 0
numa_interleave 855 865
numa_local 89796 88955
numa_other 242 353
numa_pages_migrated 88 149
numa_pte_updates 12039 22930
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,049,153 2,195,628
migrations 11,405 11,179
faults 162,309 149,656
cache-misses 7,203,343 8,117,515
sched:sched_move_numa 22 49
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 1 5
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 1693 3577
numa_hint_faults_local 1669 3476
numa_hit 25177 26142
numa_huge_pte_updates 0 0
numa_interleave 194 358
numa_local 24993 26042
numa_other 184 100
numa_pages_migrated 1 5
numa_pte_updates 1577 3587
perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 94,515,937 100,602,296
migrations 4,203,554 4,135,630
faults 832,697 789,256
cache-misses 226,248,698,331 226,160,621,058
sched:sched_move_numa 1,730 1,366
sched:sched_stick_numa 14 16
sched:sched_swap_numa 432 374
migrate:mm_migrate_pages 1,398 1,350
vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 80079 47857
numa_hint_faults_local 68620 39768
numa_hit 241187 240165
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 241186 240165
numa_other 1 0
numa_pages_migrated 1347 1224
numa_pte_updates 80729 48354
perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 63,704,961 58,515,496
migrations 573,404 564,845
faults 230,878 245,807
cache-misses 76,568,222,781 73,603,757,976
sched:sched_move_numa 509 996
sched:sched_stick_numa 31 10
sched:sched_swap_numa 182 193
migrate:mm_migrate_pages 541 646
vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 8501 13422
numa_hint_faults_local 2960 5619
numa_hit 35526 36118
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 35526 36116
numa_other 0 2
numa_pages_migrated 539 616
numa_pte_updates 8433 13374
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-5-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Currently task scan rate is reset when NUMA balancer migrates the task
to a different node. If NUMA balancer initiates a swap, reset is only
applicable to the task that initiates the swap. Similarly no scan rate
reset is done if the task is migrated across nodes by traditional load
balancer.
Instead move the scan reset to the migrate_task_rq. This ensures the
task moved out of its preferred node, either gets back to its preferred
node quickly or finds a new preferred node. Doing so, would be fair to
all tasks migrating across nodes.
Specjbb2005 results (8 warehouses)
Higher bops are better
2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 200668 203370 1.3465
1 321791 328431 2.06345
2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 204848 206070 0.59654
2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 188098 188386 0.153112
1 200351 201566 0.606436
4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 58145.9 59157.4 1.73959
1 103798 105495 1.63491
Some events stats before and after applying the patch.
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,912,183 13,825,492
migrations 1,155,931 1,152,509
faults 367,139 371,948
cache-misses 54,240,196,814 55,654,206,041
sched:sched_move_numa 1,571 1,856
sched:sched_stick_numa 9 4
sched:sched_swap_numa 463 428
migrate:mm_migrate_pages 703 898
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 50155 57146
numa_hint_faults_local 45264 51612
numa_hit 239652 238164
numa_huge_pte_updates 36 16
numa_interleave 68 63
numa_local 239576 238085
numa_other 76 79
numa_pages_migrated 680 883
numa_pte_updates 71146 67540
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,156,720 3,288,525
migrations 30,354 38,652
faults 97,261 111,678
cache-misses 12,400,026,826 12,111,197,376
sched:sched_move_numa 4 900
sched:sched_stick_numa 0 0
sched:sched_swap_numa 1 5
migrate:mm_migrate_pages 20 714
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 272 18572
numa_hint_faults_local 186 14850
numa_hit 71362 73197
numa_huge_pte_updates 0 11
numa_interleave 23 25
numa_local 71299 73138
numa_other 63 59
numa_pages_migrated 2 712
numa_pte_updates 0 24021
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,606,824 8,451,543
migrations 155,352 202,804
faults 301,409 310,024
cache-misses 157,759,224 253,522,507
sched:sched_move_numa 168 213
sched:sched_stick_numa 0 0
sched:sched_swap_numa 3 2
migrate:mm_migrate_pages 125 88
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 4650 11830
numa_hint_faults_local 3946 11301
numa_hit 90489 90038
numa_huge_pte_updates 0 0
numa_interleave 892 855
numa_local 90034 89796
numa_other 455 242
numa_pages_migrated 124 88
numa_pte_updates 4818 12039
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,113,167 2,049,153
migrations 10,533 11,405
faults 142,727 162,309
cache-misses 5,594,192 7,203,343
sched:sched_move_numa 10 22
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 6 1
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 744 1693
numa_hint_faults_local 584 1669
numa_hit 25551 25177
numa_huge_pte_updates 0 0
numa_interleave 263 194
numa_local 25302 24993
numa_other 249 184
numa_pages_migrated 6 1
numa_pte_updates 744 1577
perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 101,227,352 94,515,937
migrations 4,151,829 4,203,554
faults 745,233 832,697
cache-misses 224,669,561,766 226,248,698,331
sched:sched_move_numa 617 1,730
sched:sched_stick_numa 2 14
sched:sched_swap_numa 187 432
migrate:mm_migrate_pages 316 1,398
vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 24195 80079
numa_hint_faults_local 21639 68620
numa_hit 238331 241187
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 238331 241186
numa_other 0 1
numa_pages_migrated 204 1347
numa_pte_updates 24561 80729
perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 62,738,978 63,704,961
migrations 562,702 573,404
faults 228,465 230,878
cache-misses 75,778,067,952 76,568,222,781
sched:sched_move_numa 648 509
sched:sched_stick_numa 13 31
sched:sched_swap_numa 137 182
migrate:mm_migrate_pages 733 541
vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 10281 8501
numa_hint_faults_local 3242 2960
numa_hit 36338 35526
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 36338 35526
numa_other 0 0
numa_pages_migrated 706 539
numa_pte_updates 10176 8433
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This additional parameter (new_cpu) is used later for identifying if
task migration is across nodes.
No functional change.
Specjbb2005 results (8 warehouses)
Higher bops are better
2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 203353 200668 -1.32036
1 328205 321791 -1.95427
2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 214384 204848 -4.44809
2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 188553 188098 -0.241311
1 196273 200351 2.07772
4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 57581.2 58145.9 0.980702
1 103468 103798 0.318939
Brings out the variance between different specjbb2005 runs.
Some events stats before and after applying the patch.
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,941,377 13,912,183
migrations 1,157,323 1,155,931
faults 382,175 367,139
cache-misses 54,993,823,500 54,240,196,814
sched:sched_move_numa 2,005 1,571
sched:sched_stick_numa 14 9
sched:sched_swap_numa 529 463
migrate:mm_migrate_pages 1,573 703
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 67099 50155
numa_hint_faults_local 58456 45264
numa_hit 240416 239652
numa_huge_pte_updates 18 36
numa_interleave 65 68
numa_local 240339 239576
numa_other 77 76
numa_pages_migrated 1574 680
numa_pte_updates 77182 71146
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,176,453 3,156,720
migrations 30,238 30,354
faults 87,869 97,261
cache-misses 12,544,479,391 12,400,026,826
sched:sched_move_numa 23 4
sched:sched_stick_numa 0 0
sched:sched_swap_numa 6 1
migrate:mm_migrate_pages 10 20
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 236 272
numa_hint_faults_local 201 186
numa_hit 72293 71362
numa_huge_pte_updates 0 0
numa_interleave 26 23
numa_local 72233 71299
numa_other 60 63
numa_pages_migrated 8 2
numa_pte_updates 0 0
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,478,820 8,606,824
migrations 171,323 155,352
faults 307,499 301,409
cache-misses 240,353,599 157,759,224
sched:sched_move_numa 214 168
sched:sched_stick_numa 0 0
sched:sched_swap_numa 4 3
migrate:mm_migrate_pages 89 125
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 5301 4650
numa_hint_faults_local 4745 3946
numa_hit 92943 90489
numa_huge_pte_updates 0 0
numa_interleave 899 892
numa_local 92345 90034
numa_other 598 455
numa_pages_migrated 88 124
numa_pte_updates 5505 4818
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,066,172 2,113,167
migrations 11,076 10,533
faults 149,544 142,727
cache-misses 10,398,067 5,594,192
sched:sched_move_numa 43 10
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 6 6
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 3552 744
numa_hint_faults_local 3347 584
numa_hit 25611 25551
numa_huge_pte_updates 0 0
numa_interleave 213 263
numa_local 25583 25302
numa_other 28 249
numa_pages_migrated 6 6
numa_pte_updates 3535 744
perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 99,358,136 101,227,352
migrations 4,041,607 4,151,829
faults 749,653 745,233
cache-misses 225,562,543,251 224,669,561,766
sched:sched_move_numa 771 617
sched:sched_stick_numa 14 2
sched:sched_swap_numa 204 187
migrate:mm_migrate_pages 1,180 316
vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 27409 24195
numa_hint_faults_local 20677 21639
numa_hit 239988 238331
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 239983 238331
numa_other 5 0
numa_pages_migrated 1016 204
numa_pte_updates 27916 24561
perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 60,899,307 62,738,978
migrations 544,668 562,702
faults 270,834 228,465
cache-misses 74,543,455,635 75,778,067,952
sched:sched_move_numa 735 648
sched:sched_stick_numa 25 13
sched:sched_swap_numa 174 137
migrate:mm_migrate_pages 816 733
vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 11059 10281
numa_hint_faults_local 4733 3242
numa_hit 41384 36338
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 41383 36338
numa_other 1 0
numa_pages_migrated 815 706
numa_pte_updates 11323 10176
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Task migration under NUMA balancing can happen in parallel. More than
one task might choose to migrate to the same CPU at the same time. This
can result in:
- During task swap, choosing a task that was not part of the evaluation.
- During task swap, task which just got moved into its preferred node,
moving to a completely different node.
- During task swap, task failing to move to the preferred node, will have
to wait an extra interval for the next migrate opportunity.
- During task movement, multiple task movements can cause load imbalance.
This problem is more likely if there are more cores per node or more
nodes in the system.
Use a per run-queue variable to check if NUMA-balance is active on the
run-queue.
Specjbb2005 results (8 warehouses)
Higher bops are better
2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 200194 203353 1.57797
1 311331 328205 5.41995
2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 197654 214384 8.46429
2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 192605 188553 -2.10379
1 213402 196273 -8.02664
4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 52227.1 57581.2 10.2516
1 102529 103468 0.915838
There is a regression on power 9 box. If we look at the details,
that box has a sudden jump in cache-misses with this patch.
All other parameters seem to be pointing towards NUMA
consolidation.
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,345,784 13,941,377
migrations 1,127,820 1,157,323
faults 374,736 382,175
cache-misses 55,132,054,603 54,993,823,500
sched:sched_move_numa 1,923 2,005
sched:sched_stick_numa 52 14
sched:sched_swap_numa 595 529
migrate:mm_migrate_pages 1,932 1,573
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 60605 67099
numa_hint_faults_local 51804 58456
numa_hit 239945 240416
numa_huge_pte_updates 14 18
numa_interleave 60 65
numa_local 239865 240339
numa_other 80 77
numa_pages_migrated 1931 1574
numa_pte_updates 67823 77182
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,016,467 3,176,453
migrations 37,326 30,238
faults 115,342 87,869
cache-misses 11,692,155,554 12,544,479,391
sched:sched_move_numa 965 23
sched:sched_stick_numa 8 0
sched:sched_swap_numa 35 6
migrate:mm_migrate_pages 1,168 10
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 16286 236
numa_hint_faults_local 11863 201
numa_hit 112482 72293
numa_huge_pte_updates 33 0
numa_interleave 20 26
numa_local 112419 72233
numa_other 63 60
numa_pages_migrated 1144 8
numa_pte_updates 32859 0
perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,629,724 8,478,820
migrations 221,052 171,323
faults 308,661 307,499
cache-misses 135,574,913 240,353,599
sched:sched_move_numa 147 214
sched:sched_stick_numa 0 0
sched:sched_swap_numa 2 4
migrate:mm_migrate_pages 64 89
vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 11481 5301
numa_hint_faults_local 10968 4745
numa_hit 89773 92943
numa_huge_pte_updates 0 0
numa_interleave 1116 899
numa_local 89220 92345
numa_other 553 598
numa_pages_migrated 62 88
numa_pte_updates 11694 5505
perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,272,887 2,066,172
migrations 12,206 11,076
faults 163,704 149,544
cache-misses 4,801,186 10,398,067
sched:sched_move_numa 44 43
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 17 6
vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 2261 3552
numa_hint_faults_local 1993 3347
numa_hit 25726 25611
numa_huge_pte_updates 0 0
numa_interleave 239 213
numa_local 25498 25583
numa_other 228 28
numa_pages_migrated 17 6
numa_pte_updates 2266 3535
perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 117,980,962 99,358,136
migrations 3,950,220 4,041,607
faults 736,979 749,653
cache-misses 224,976,072,879 225,562,543,251
sched:sched_move_numa 504 771
sched:sched_stick_numa 50 14
sched:sched_swap_numa 239 204
migrate:mm_migrate_pages 1,260 1,180
vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 18293 27409
numa_hint_faults_local 11969 20677
numa_hit 240854 239988
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 240851 239983
numa_other 3 5
numa_pages_migrated 1190 1016
numa_pte_updates 18106 27916
perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 61,053,158 60,899,307
migrations 551,586 544,668
faults 244,174 270,834
cache-misses 74,326,766,973 74,543,455,635
sched:sched_move_numa 344 735
sched:sched_stick_numa 24 25
sched:sched_swap_numa 140 174
migrate:mm_migrate_pages 568 816
vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 6461 11059
numa_hint_faults_local 2283 4733
numa_hit 35661 41384
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 35661 41383
numa_other 0 1
numa_pages_migrated 568 815
numa_pte_updates 6518 11323
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Some of the scheduling tracepoints allow the perf_tp_event
code to write to ring buffer under different cpu than the
code is running on.
This results in corrupted ring buffer data demonstrated in
following perf commands:
# perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 0.383 [sec]
[ perf record: Woken up 8 times to write data ]
0x42b890 [0]: failed to process type: -1765585640
[ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ]
# perf report --stdio
0x42b890 [0]: failed to process type: -1765585640
The reason for the corruption are some of the scheduling tracepoints,
that have __perf_task dfined and thus allow to store data to another
cpu ring buffer:
sched_waking
sched_wakeup
sched_wakeup_new
sched_stat_wait
sched_stat_sleep
sched_stat_iowait
sched_stat_blocked
The perf_tp_event function first store samples for current cpu
related events defined for tracepoint:
hlist_for_each_entry_rcu(event, head, hlist_entry)
perf_swevent_event(event, count, &data, regs);
And then iterates events of the 'task' and store the sample
for any task's event that passes tracepoint checks:
ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
if (event->attr.type != PERF_TYPE_TRACEPOINT)
continue;
if (event->attr.config != entry->type)
continue;
perf_swevent_event(event, count, &data, regs);
}
Above code can race with same code running on another cpu,
ending up with 2 cpus trying to store under the same ring
buffer, which is specifically not allowed.
This patch prevents the problem, by allowing only events with the same
current cpu to receive the event.
NOTE: this requires the use of (per-task-)per-cpu buffers for this
feature to work; perf-record does this.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
[peterz: small edits to Changelog]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: e6dab5ffab59 ("perf/trace: Add ability to set a target task for events")
Link: http://lkml.kernel.org/r/20180923161343.GB15054@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
When we unregister a PMU, we fail to serialize the @pmu_idr properly.
Fix that by doing the entire thing under pmu_lock.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 2e80a82a49c4 ("perf: Dynamic pmu types")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Commit 19483677684b ("jump_label: Annotate entries that operate on
__init code earlier") refactored the code that manages runtime
patching of jump labels in modules that are tied to static keys
defined in other modules or in the core kernel.
In the latter case, we may iterate over the static_key_mod linked
list until we hit the entry for the core kernel, whose 'mod' field
will be NULL, and attempt to dereference it to get at its 'state'
member.
So let's add a non-NULL check: this forces the 'init' argument of
__jump_label_update() to false for static keys that are defined in
the core kernel, which is appropriate given that __init annotated
jump_label entries in the core kernel should no longer be active
at this point (i.e., when loading modules).
Fixes: 19483677684b ("jump_label: Annotate entries that operate on ...")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20181001081324.11553-1-ard.biesheuvel@linaro.org
|
|
Previously, on typical consumer laptops, pressing a key on the keyboard
when the system is in suspend would cause it to wake up (default or
unconditional behaviour). This happens because the EC generates a SCI
interrupt in this scenario.
That is no longer true on modern laptops based on Intel WhiskeyLake,
including Acer Swift SF314-55G, Asus UX333FA, Asus UX433FN and Asus
UX533FD. We confirmed with Asus EC engineers that the "Modern Standby"
design has been modified so that the EC no longer generates a SCI
in this case; the keyboard controller itself should be used for wakeup.
In order to retain the standard behaviour of being able to use the
keyboard to wake up the system, enable serio wakeups by default on
platforms that are using s2idle.
Link: https://lkml.kernel.org/r/CAB4CAwfQ0mPMqCLp95TVjw4J0r5zKPWkSvvkK4cpZUGE--w8bQ@mail.gmail.com
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Daniel Drake <drake@endlessm.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
Merge -rc6 in, for two reasons:
1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
they aren't in the 4.20 branch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* tag 'v4.19-rc6': (780 commits)
Linux 4.19-rc6
MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
cpufreq: qcom-kryo: Fix section annotations
perf/core: Add sanity check to deal with pinned event failure
xen/blkfront: correct purging of persistent grants
Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
selftests/powerpc: Fix Makefiles for headers_install change
blk-mq: I/O and timer unplugs are inverted in blktrace
dax: Fix deadlock in dax_lock_mapping_entry()
x86/boot: Fix kexec booting failure in the SEV bit detection code
bcache: add separate workqueue for journal_write to avoid deadlock
drm/amd/display: Fix Edid emulation for linux
drm/amd/display: Fix Vega10 lightup on S3 resume
drm/amdgpu: Fix vce work queue was not cancelled when suspend
Revert "drm/panel: Add device_link from panel device to DRM device"
xen/blkfront: When purging persistent grants, keep them in the buffer
clocksource/drivers/timer-atmel-pit: Properly handle error cases
block: fix deadline elevator drain for zoned block devices
ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This way an architecture with less than 4G of RAM can support dma_mask
smaller than 32-bit without a ZONE_DMA. Apparently that is a common
case on powerpc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
|
|
Instead of rejecting devices with a too small bus_dma_mask we can handle
by taking the bus dma_mask into account for allocations and bounce
buffering decisions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
We need to take the DMA offset and encryption bit into account when
selecting a zone. User the opportunity to factor out the zone
selection into a helper for reuse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
|
|
This is somewhat modelled after the powerpc version, and differs from
the legacy fallback in use fls64 instead of pointlessly splitting up the
address into low and high dwords and in that it takes (__)phys_to_dma
into account.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
|
|
Explicitly forbid creating map of per-cpu cgroup local storages.
This behavior matches the behavior of shared cgroup storages.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
This commit introduced per-cpu cgroup local storage.
Per-cpu cgroup local storage is very similar to simple cgroup storage
(let's call it shared), except all the data is per-cpu.
The main goal of per-cpu variant is to implement super fast
counters (e.g. packet counters), which don't require neither
lookups, neither atomic operations.
>From userspace's point of view, accessing a per-cpu cgroup storage
is similar to other per-cpu map types (e.g. per-cpu hashmaps and
arrays).
Writing to a per-cpu cgroup storage is not atomic, but is performed
by copying longs, so some minimal atomicity is here, exactly
as with other per-cpu maps.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
To simplify the following introduction of per-cpu cgroup storage,
let's rework a bit a mechanism of passing a pointer to a cgroup
storage into the bpf_get_local_storage(). Let's save a pointer
to the corresponding bpf_cgroup_storage structure, instead of
a pointer to the actual buffer.
It will help us to handle per-cpu storage later, which has
a different way of accessing to the actual data.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
In order to introduce per-cpu cgroup storage, let's generalize
bpf cgroup core to support multiple cgroup storage types.
Potentially, per-node cgroup storage can be added later.
This commit is mostly a formal change that replaces
cgroup_storage pointer with a array of cgroup_storage pointers.
It doesn't actually introduce a new storage type,
it will be done later.
Each bpf program is now able to have one cgroup storage of each type.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
The sigaltstack(2) system call fails with -ENOMEM if the new alternative
signal stack is found to be smaller than SIGMINSTKSZ. On architectures
such as arm64, where the native value for SIGMINSTKSZ is larger than
the compat value, this can result in an unexpected error being reported
to a compat task. See, for example:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904385
This patch fixes the problem by extending do_sigaltstack to take the
minimum signal stack size as an additional parameter, allowing the
native and compat system call entry code to pass in their respective
values. COMPAT_SIGMINSTKSZ is just defined as SIGMINSTKSZ if it has not
been defined by the architecture.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Steve McIntyre <steve.mcintyre@arm.com>
Tested-by: Steve McIntyre <93sam@debian.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
On a DT based system, we use the of_node full name to name the
corresponding irq domain. We expect that name to be unique, so so that
domains with the same base name won't clash (this happens on multi-node
topologies, for example).
Since a7e4cfb0a7ca ("of/fdt: only store the device node basename in
full_name"), of_node_full_name() lies and only returns the basename. This
breaks the above requirement, and we end-up with only a subset of the
domains in /sys/kernel/debug/irq/domains.
Let's reinstate the feature by using the fancy new %pOF format specifier,
which happens to do the right thing.
Fixes: a7e4cfb0a7ca ("of/fdt: only store the device node basename in full_name")
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20181001100522.180054-3-marc.zyngier@arm.com
|
|
When removing a debugfs file for a given irq domain, we fail to clear the
corresponding field, meaning that the corresponding domain won't be created
again if we need to do so.
It turns out that this is exactly what irq_domain_update_bus_token does
(delete old file, update domain name, recreate file).
This doesn't have any impact other than making debug more difficult, but we
do value ease of debugging... So clear the debugfs_file field.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20181001100522.180054-2-marc.zyngier@arm.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Thomas writes:
"A single fix for a missing sanity check when a pinned event is tried
to be read on the wrong CPU due to a legit event scheduling failure."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Add sanity check to deal with pinned event failure
|
|
It is possible that a failure can occur during the scheduling of a
pinned event. The initial portion of perf_event_read_local() contains
the various error checks an event should pass before it can be
considered valid. Ensure that the potential scheduling failure
of a pinned event is checked for and have a credible error.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/6486385d1f30336e9973b24c8c65f5079543d3d3.1537377064.git.reinette.chatre@intel.com
|
|
tick_device_is_functional() is called early in tick_broadcast_control(), so
no need to call it again later.
Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@gmail.com
Link: https://lkml.kernel.org/r/1538150608-2599-1-git-send-email-penghao122@sina.com.cn
|
|
cgroup_storage_update_elem() shouldn't accept any flags
argument values except BPF_ANY and BPF_EXIST to guarantee
the backward compatibility, had a new flag value been added.
Fixes: de9cbbaadba5 ("bpf: introduce cgroup storage maps")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Currently, helper bpf_get_current_cgroup_id() is not permitted
for CGROUP_DEVICE type of programs. If the helper is used
in such cases, the verifier will log the following error:
0: (bf) r6 = r1
1: (69) r7 = *(u16 *)(r6 +0)
2: (85) call bpf_get_current_cgroup_id#80
unknown func bpf_get_current_cgroup_id#80
The bpf_get_current_cgroup_id() is useful for CGROUP_DEVICE
type of programs in order to customize action based on cgroup id.
This patch added such a support.
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
The __jump_table sections emitted into the core kernel and into
each module consist of statically initialized references into
other parts of the code, and with the exception of entries that
point into init code, which are defused at post-init time, these
data structures are never modified.
So let's move them into the ro_after_init section, to prevent them
from being corrupted inadvertently by buggy code, or deliberately
by an attacker.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Link: https://lkml.kernel.org/r/20180919065144.25010-9-ard.biesheuvel@linaro.org
|
|
Jump table entries are mostly read-only, with the exception of the
init and module loader code that defuses entries that point into init
code when the code being referred to is freed.
For robustness, it would be better to move these entries into the
ro_after_init section, but clearing the 'code' member of each jump
table entry referring to init code at module load time races with the
module_enable_ro() call that remaps the ro_after_init section read
only, so we'd like to do it earlier.
So given that whether such an entry refers to init code can be decided
much earlier, we can pull this check forward. Since we may still need
the code entry at this point, let's switch to setting a low bit in the
'key' member just like we do to annotate the default state of a jump
table entry.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jessica Yu <jeyu@kernel.org>
Link: https://lkml.kernel.org/r/20180919065144.25010-8-ard.biesheuvel@linaro.org
|
|
To reduce the size taken up by absolute references in jump label
entries themselves and the associated relocation records in the
.init segment, add support for emitting them as relative references
instead.
Note that this requires some extra care in the sorting routine, given
that the offsets change when entries are moved around in the jump_entry
table.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jessica Yu <jeyu@kernel.org>
Link: https://lkml.kernel.org/r/20180919065144.25010-3-ard.biesheuvel@linaro.org
|
|
In preparation of allowing architectures to use relative references
in jump_label entries [which can dramatically reduce the memory
footprint], introduce abstractions for references to the 'code' and
'key' members of struct jump_entry.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jessica Yu <jeyu@kernel.org>
Link: https://lkml.kernel.org/r/20180919065144.25010-2-ard.biesheuvel@linaro.org
|
|
STIBP is a feature provided by certain Intel ucodes / CPUs. This feature
(once enabled) prevents cross-hyperthread control of decisions made by
indirect branch predictors.
Enable this feature if
- the CPU is vulnerable to spectre v2
- the CPU supports SMT and has SMT siblings online
- spectre_v2 mitigation autoselection is enabled (default)
After some previous discussion, this leaves STIBP on all the time, as wrmsr
on crossing kernel boundary is a no-no. This could perhaps later be a bit
more optimized (like disabling it in NOHZ, experiment with disabling it in
idle, etc) if needed.
Note that the synchronization of the mask manipulation via newly added
spec_ctrl_mutex is currently not strictly needed, as the only updater is
already being serialized by cpu_add_remove_lock, but let's make this a
little bit more future-proof.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "WoodhouseDavid" <dwmw@amazon.co.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: "SchauflerCasey" <casey.schaufler@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438240.15880@cbobk.fhfr.pm
|
|
Currently, IBPB is only issued in cases when switching into a non-dumpable
process, the rationale being to protect such 'important and security
sensitive' processess (such as GPG) from data leaking into a different
userspace process via spectre v2.
This is however completely insufficient to provide proper userspace-to-userpace
spectrev2 protection, as any process can poison branch buffers before being
scheduled out, and the newly scheduled process immediately becomes spectrev2
victim.
In order to minimize the performance impact (for usecases that do require
spectrev2 protection), issue the barrier only in cases when switching between
processess where the victim can't be ptraced by the potential attacker (as in
such cases, the attacker doesn't have to bother with branch buffers at all).
[ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and
PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably
fine-grained ]
Fixes: 18bf3c3ea8 ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch")
Originally-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "WoodhouseDavid" <dwmw@amazon.co.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "SchauflerCasey" <casey.schaufler@intel.com>
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251437340.15880@cbobk.fhfr.pm
|
|
This is the only location on kernel that has wrong spelling
of the container_of() helper. Fix it.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
|
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-09-25
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Allow for RX stack hardening by implementing the kernel's flow
dissector in BPF. Idea was originally presented at netconf 2017 [0].
Quote from merge commit:
[...] Because of the rigorous checks of the BPF verifier, this
provides significant security guarantees. In particular, the BPF
flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate.
It cannot read outside of packet bounds, because all memory accesses
are checked. Also, with BPF the administrator can decide which
protocols to support, reducing potential attack surface. Rarely
encountered protocols can be excluded from dissection and the
program can be updated without kernel recompile or reboot if a
bug is discovered. [...]
Also, a sample flow dissector has been implemented in BPF as part
of this work, from Petar and Willem.
[0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
2) Add support for bpftool to list currently active attachment
points of BPF networking programs providing a quick overview
similar to bpftool's perf subcommand, from Yonghong.
3) Fix a verifier pruning instability bug where a union member
from the register state was not cleared properly leading to
branches not being pruned despite them being valid candidates,
from Alexei.
4) Various smaller fast-path optimizations in XDP's map redirect
code, from Jesper.
5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
in bpftool, from Roman.
6) Remove a duplicate check in libbpf that probes for function
storage, from Taeung.
7) Fix an issue in test_progs by avoid checking for errno since
on success its value should not be checked, from Mauricio.
8) Fix unused variable warning in bpf_getsockopt() helper when
CONFIG_INET is not configured, from Anders.
9) Fix a compilation failure in the BPF sample code's use of
bpf_flow_keys, from Prashant.
10) Minor cleanups in BPF code, from Yue and Zhong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The patch adding the infrastructure failed to actually add the symbol
declaration, oops..
Fixes: faef87723a ("dma-noncoherent: add a arch_sync_dma_for_cpu_all hook")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Paul Burton <paul.burton@mips.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Version bump conflict in batman-adv, take what's in net-next.
iavf conflict, adjustment of netdev_ops in net-next conflicting
with poll controller method removal in net.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
perf test improvements:
- Add watchpoint entry (Ravi Bangoria)
Build fixes:
- Initialize perf_data_file fd field to fix building the CTF (trace format)
converter with with gcc 4.8.4 on Ubuntu 14.04 (Jérémie Galarneau)
- Use -Wno-redundant-decls to build with PYTHON=python3 to
build the python binding, fixing the build in systems such
as Clear Linux (Arnaldo Carvalho de Melo)
Hardware tracing improvements:
- Suppress AUX/OVERWRITE records (Alexander Shishkin)
Infrastructure changes:
- Adopt PTR_ERR_OR_ZERO from the kernel and use it in
the bpf-loader instead of open coded equivalent (Ding Xiang)
- Improve the event ordering code to make it clear and fix
a bug related to freeing of events when using pipe mode
from 'record' to 'inject' (Jiri Olsa)
- Some prep work to facilitate per-cpu threads to write
record data to per-cpu files (Jiri Olsa)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Dave writes:
"Networking fixes:
1) Fix multiqueue handling of coalesce timer in stmmac, from Jose
Abreu.
2) Fix memory corruption in NFC, from Suren Baghdasaryan.
3) Don't write reserved bits in ravb driver, from Kazuya Mizuguchi.
4) SMC bug fixes from Karsten Graul, YueHaibing, and Ursula Braun.
5) Fix TX done race in mvpp2, from Antoine Tenart.
6) ipv6 metrics leak, from Wei Wang.
7) Adjust firmware version requirements in mlxsw, from Petr Machata.
8) Fix autonegotiation on resume in r8169, from Heiner Kallweit.
9) Fixed missing entries when dumping /proc/net/if_inet6, from Jeff
Barnhill.
10) Fix double free in devlink, from Dan Carpenter.
11) Fix ethtool regression from UFO feature removal, from Maciej
Żenczykowski.
12) Fix drivers that have a ndo_poll_controller() that captures the
cpu entirely on loaded hosts by trying to drain all rx and tx
queues, from Eric Dumazet.
13) Fix memory corruption with jumbo frames in aquantia driver, from
Friedemann Gerold."
* gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net: (79 commits)
net: mvneta: fix the remaining Rx descriptor unmapping issues
ip_tunnel: be careful when accessing the inner header
mpls: allow routes on ip6gre devices
net: aquantia: memory corruption on jumbo frames
tun: remove ndo_poll_controller
nfp: remove ndo_poll_controller
bnxt: remove ndo_poll_controller
bnx2x: remove ndo_poll_controller
mlx5: remove ndo_poll_controller
mlx4: remove ndo_poll_controller
i40evf: remove ndo_poll_controller
ice: remove ndo_poll_controller
igb: remove ndo_poll_controller
ixgb: remove ndo_poll_controller
fm10k: remove ndo_poll_controller
ixgbevf: remove ndo_poll_controller
ixgbe: remove ndo_poll_controller
bonding: use netpoll_poll_dev() helper
netpoll: make ndo_poll_controller() optional
rds: Fix build regression.
...
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
We assume to have only one reference counter for one uprobe.
Don't allow user to add multiple trace_uprobe entries having
same inode+offset but different reference counter.
Ex,
# echo "p:sdt_tick/loop2 /home/ravi/tick:0x6e4(0x10036)" > uprobe_events
# echo "p:sdt_tick/loop2_1 /home/ravi/tick:0x6e4(0xfffff)" >> uprobe_events
bash: echo: write error: Invalid argument
# dmesg
trace_kprobe: Reference counter offset mismatch.
There is one exception though:
When user is trying to replace the old entry with the new
one, we allow this if the new entry does not conflict with
any other existing entries.
Link: http://lkml.kernel.org/r/20180820044250.11659-4-ravi.bangoria@linux.ibm.com
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
We assume to have only one reference counter for one uprobe.
Don't allow user to register multiple uprobes having same
inode+offset but different reference counter.
Link: http://lkml.kernel.org/r/20180820044250.11659-3-ravi.bangoria@linux.ibm.com
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. Applications like PostgreSQL, MySQL,
Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc
have these markers embedded in them. These markers are added by developer
at important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. In case the overhead is more, execution of it can be
omitted by runtime if() condition when no one is tracing on the marker:
if (reference_counter > 0) {
Execute marker instructions;
}
Default value of reference counter is 0. Tracer has to increment the
reference counter before tracing on a marker and decrement it when
done with the tracing.
Implement the reference counter logic in core uprobe. User will be
able to use it from trace_uprobe as well as from kernel module. New
trace_uprobe definition with reference counter will now be:
<path>:<offset>[(ref_ctr_offset)]
where ref_ctr_offset is an optional field. For kernel module, new
variant of uprobe_register() has been introduced:
uprobe_register_refctr(inode, offset, ref_ctr_offset, consumer)
No new variant for uprobe_unregister() because it's assumed to have
only one reference counter for one uprobe.
[1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
Note: 'reference counter' is called as 'semaphore' in original Dtrace
(or Systemtap, bcc and even in ELF) documentation and code. But the
term 'semaphore' is misleading in this context. This is just a counter
used to hold number of tracers tracing on a marker. This is not really
used for any synchronization. So we are calling it a 'reference counter'
in kernel / perf code.
Link: http://lkml.kernel.org/r/20180820044250.11659-2-ravi.bangoria@linux.ibm.com
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
[Only trace_uprobe.c]
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
By utilizing a temporary variable, we can avoid adding another call to
strchr(). Instead, save the first call to a temp variable, and then use that
variable as the reference to set the event variable.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
The previous patch in this series removed carrying around a pointer to
the css in blkg. However, the blkg association logic still relied on
taking a reference on the css to ensure we wouldn't fail in getting a
reference for the blkg.
Here the implicit dependency on the css is removed. The association
continues to rely on the tryget logic walking up the blkg tree. This
streamlines the three ways that association can happen: normal, swap,
and writeback.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Prior patches ensured that all bios are now associated with some blkg.
This now makes bio->bi_css unnecessary as blkg maintains a reference to
the blkcg already.
This patch removes the field bi_css and transfers corresponding uses to
access via bi_blkg.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It is possible (via shutdown()) for TCP socks to go trough TCP_CLOSE
state via tcp_disconnect() without actually calling tcp_close which
would then call our bpf_tcp_close() callback. Because of this a user
could disconnect a socket then put it in a LISTEN state which would
break our assumptions about sockets always being ESTABLISHED state.
To resolve this rely on the unhash hook, which is called in the
disconnect case, to remove the sock from the sockmap.
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
|