Age | Commit message (Collapse) | Author |
|
commit f66c0447cca1281116224d474cdb37d6a18e4b5b upstream.
Set the unoptimized flag after confirming the code is completely
unoptimized. Without this fix, when a kprobe hits the intermediate
modified instruction (the first byte is replaced by an INT3, but
later bytes can still be a jump address operand) while unoptimizing,
it can return to the middle byte of the modified code, which causes
an invalid instruction exception in the kernel.
Usually, this is a rare case, but if we put a probe on the function
call while text patching, it always causes a kernel panic as below:
# echo p text_poke+5 > kprobe_events
# echo 1 > events/kprobes/enable
# echo 0 > events/kprobes/enable
invalid opcode: 0000 [#1] PREEMPT SMP PTI
RIP: 0010:text_poke+0x9/0x50
Call Trace:
arch_unoptimize_kprobe+0x22/0x28
arch_unoptimize_kprobes+0x39/0x87
kprobe_optimizer+0x6e/0x290
process_one_work+0x2a0/0x610
worker_thread+0x28/0x3d0
? process_one_work+0x610/0x610
kthread+0x10d/0x130
? kthread_park+0x80/0x80
ret_from_fork+0x3a/0x50
text_poke() is used for patching the code in optprobes.
This can happen even if we blacklist text_poke() and other functions,
because there is a small time window during which we show the intermediate
code to other CPUs.
[ mingo: Edited the changelog. ]
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing")
Link: https://lkml.kernel.org/r/157483422375.25881.13508326028469515760.stgit@devnote2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 60588bfa223ff675b95f866249f90616613fbe31 upstream.
select_idle_cpu() will scan the LLC domain for idle CPUs,
it's always expensive. so the next commit :
1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")
introduces a way to limit how many CPUs we scan.
But it consume some CPUs out of 'nr' that are not allowed
for the task and thus waste our attempts. The function
always return nr_cpumask_bits, and we can't find a CPU
which our task is allowed to run.
Cpumask may be too big, similar to select_idle_core(), use
per_cpu_ptr 'select_idle_mask' to prevent stack overflow.
Fixes: 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 78041c0c9e935d9ce4086feeff6c569ed88ddfd4 upstream.
The tracing seftests checks various aspects of the tracing infrastructure,
and one is filtering. If trace_printk() is active during a self test, it can
cause the filtering to fail, which will disable that part of the trace.
To keep the selftests from failing because of trace_printk() calls,
trace_printk() checks the variable tracing_selftest_running, and if set, it
does not write to the tracing buffer.
As some tracers were registered earlier in boot, the selftest they triggered
would fail because not all the infrastructure was set up for the full
selftest. Thus, some of the tests were post poned to when their
infrastructure was ready (namely file system code). The postpone code did
not set the tracing_seftest_running variable, and could fail if a
trace_printk() was added and executed during their run.
Cc: stable@vger.kernel.org
Fixes: 9afecfbb95198 ("tracing: Postpone tracer start-up tests till the system is more robust")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 756125289285f6e55a03861bf4b6257aa3d19a93 upstream.
This patch ensures that we always check the netlink payload length
in audit_receive_msg() before we take any action on the payload
itself.
Cc: stable@vger.kernel.org
Reported-by: syzbot+399c44bf1f43b8747403@syzkaller.appspotmail.com
Reported-by: syzbot+e4b12d8d202701f08b6d@syzkaller.appspotmail.com
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2ad3e17ebf94b7b7f3f64c050ff168f9915345eb upstream.
Commit 219ca39427bf ("audit: use union for audit_field values since
they are mutually exclusive") combined a number of separate fields in
the audit_field struct into a single union. Generally this worked
just fine because they are generally mutually exclusive.
Unfortunately in audit_data_to_entry() the overlap can be a problem
when a specific error case is triggered that causes the error path
code to attempt to cleanup an audit_field struct and the cleanup
involves attempting to free a stored LSM string (the lsm_str field).
Currently the code always has a non-NULL value in the
audit_field.lsm_str field as the top of the for-loop transfers a
value into audit_field.val (both .lsm_str and .val are part of the
same union); if audit_data_to_entry() fails and the audit_field
struct is specified to contain a LSM string, but the
audit_field.lsm_str has not yet been properly set, the error handling
code will attempt to free the bogus audit_field.lsm_str value that
was set with audit_field.val at the top of the for-loop.
This patch corrects this by ensuring that the audit_field.val is only
set when needed (it is cleared when the audit_field struct is
allocated with kcalloc()). It also corrects a few other issues to
ensure that in case of error the proper error code is returned.
Cc: stable@vger.kernel.org
Fixes: 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive")
Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 2a4b03ffc69f2dedc6388e9a6438b5f4c133a40d ]
When a running task is moved on a throttled task group and there is no
other task enqueued on the CPU, the task can keep running using 100% CPU
whatever the allocated bandwidth for the group and although its cfs rq is
throttled. Furthermore, the group entity of the cfs_rq and its parents are
not enqueued but only set as curr on their respective cfs_rqs.
We have the following sequence:
sched_move_task
-dequeue_task: dequeue task and group_entities.
-put_prev_task: put task and group entities.
-sched_change_group: move task to new group.
-enqueue_task: enqueue only task but not group entities because cfs_rq is
throttled.
-set_next_task : set task and group_entities as current sched_entity of
their cfs_rq.
Another impact is that the root cfs_rq runnable_load_avg at root rq stays
null because the group_entities are not enqueued. This situation will stay
the same until an "external" event triggers a reschedule. Let trigger it
immediately instead.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ebc0f83c78a2d26384401ecf2d2fa48063c0ee27 ]
The way loadavg is tracked during nohz only pays attention to the load
upon entering nohz. This can be particularly noticeable if full nohz is
entered while non-idle, and then the cpu goes idle and stays that way for
a long time.
Use the remote tick to ensure that full nohz cpus report their deltas
within a reasonable time.
[ swood: Added changelog and removed recheck of stopped tick. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 488603b815a7514c7009e6fc339d74ed4a30f343 ]
This will be used in the next patch to get a loadavg update from
nohz cpus. The delta check is skipped because idle_sched_class
doesn't update se.exec_start.
Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1578736419-14628-2-git-send-email-swood@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit e20d3a055a457a10a4c748ce5b7c2ed3173a1324 upstream.
This if guards whether user-space wants a copy of the offload-jited
bytecode and whether this bytecode exists. By erroneously doing a bitwise
AND instead of a logical AND on user- and kernel-space buffer-size can lead
to no data being copied to user-space especially when user-space size is a
power of two and bigger then the kernel-space buffer.
Fixes: fcfb126defda ("bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info")
Signed-off-by: Johannes Krude <johannes@krude.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20200212193227.GA3769@phlox.h.transitiv.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cba6437a1854fde5934098ec3bd0ee83af3129f5 upstream.
Qian Cai reported that the WARN_ON() in the x86/msi affinity setting code,
which catches cases where the affinity setting is not done on the CPU which
is the current target of the interrupt, triggers during CPU hotplug stress
testing.
It turns out that the warning which was added with the commit addressing
the MSI affinity race unearthed yet another long standing bug.
If user space writes a bogus affinity mask, i.e. it contains no online CPUs,
then it calls irq_select_affinity_usr(). This was introduced for ALPHA in
eee45269b0f5 ("[PATCH] Alpha: convert to generic irq framework (generic part)")
and subsequently made available for all architectures in
18404756765c ("genirq: Expose default irq affinity mask (take 3)")
which introduced the circumvention of the affinity setting restrictions for
interrupt which cannot be moved in process context.
The whole exercise is bogus in various aspects:
1) If the interrupt is already started up then there is absolutely
no point to honour a bogus interrupt affinity setting from user
space. The interrupt is already assigned to an online CPU and it
does not make any sense to reassign it to some other randomly
chosen online CPU.
2) If the interupt is not yet started up then there is no point
either. A subsequent startup of the interrupt will invoke
irq_setup_affinity() anyway which will chose a valid target CPU.
So the only correct solution is to just return -EINVAL in case user space
wrote an affinity mask which does not contain any online CPUs, except for
ALPHA which has it's own magic sauce for this.
Fixes: 18404756765c ("genirq: Expose default irq affinity mask (take 3)")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qian Cai <cai@lca.pw>
Link: https://lkml.kernel.org/r/878sl8xdbm.fsf@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6fcca0fa48118e6d63733eb4644c6cd880c15b8f upstream.
Issuing write() with count parameter set to 0 on any file under
/proc/pressure/ will cause an OOB write because of the access to
buf[buf_size-1] when NUL-termination is performed. Fix this by checking
for buf_size to be non-zero.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 6722b23e7a2ace078344064a9735fb73e554e9ef ]
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Without patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
# Available triggers:
# traceon traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
6+1 records in
6+1 records out
206 bytes copied, 0.00027916 s, 738 kB/s
Notice the printing of "# Available triggers:..." after the line.
With the patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
2+1 records in
2+1 records out
88 bytes copied, 0.000526867 s, 167 kB/s
It only prints the end of the file, and does not restart.
Link: http://lkml.kernel.org/r/3c35ee24-dd3a-8119-9c19-552ed253388a@virtuozzo.com
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e4075e8bdffd93a9b6d6e1d52fabedceeca5a91b ]
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Without patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
no pid
2+1 records in
2+1 records out
10 bytes copied, 0.000213285 s, 46.9 kB/s
Notice the "id" followed by "no pid".
With the patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
0+1 records in
0+1 records out
3 bytes copied, 0.000202112 s, 14.8 kB/s
Notice that it only prints "id" and not the "no pid" afterward.
Link: http://lkml.kernel.org/r/4f87c6ad-f114-30bb-8506-c32274ce2992@virtuozzo.com
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 90435a7891a2259b0f74c5a1bc5600d0d64cba8f ]
If seq_file .next fuction does not change position index,
read after some lseek can generate an unexpected output.
See also: https://bugzilla.kernel.org/show_bug.cgi?id=206283
v1 -> v2: removed missed increment in end of function
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/eca84fdd-c374-a154-d874-6c7b55fc3bc4@virtuozzo.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c79108bd19a8490315847e0c95ac6526fcd8e770 ]
The alarmtimer_suspend() function will fail if an RTC device is on a bus
such as SPI or i2c and that RTC device registers and probes after
alarmtimer_init() registers and probes the 'alarmtimer' platform device.
This is because system wide suspend suspends devices in the reverse order
of their probe. When alarmtimer_suspend() attempts to program the RTC for a
wakeup it will try to program an RTC device on a bus that has already been
suspended.
Move the alarmtimer device registration to happen when the RTC which is
used for wakeup is registered. Register the 'alarmtimer' platform device as
a child of the RTC device too, so that it can be guaranteed that the RTC
device won't be suspended when alarmtimer_suspend() is called.
Reported-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200124055849.154411-2-swboyd@chromium.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
info->mod->name
[ Upstream commit 708e0ada1916be765b7faa58854062f2bc620bbf ]
In setup_load_info(), info->name (which contains the name of the module,
mostly used for early logging purposes before the module gets set up)
gets unconditionally assigned if .modinfo is missing despite the fact
that there is an if (!info->name) check near the end of the function.
Avoid assigning a placeholder string to info->name if .modinfo doesn't
exist, so that we can fall back to info->mod->name later on.
Fixes: 5fdc7db6448a ("module: setup load info before module_sig_check()")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 11e31f608b499f044f24b20be73f1dcab3e43f8a ]
Robert reported that during boot the watchdog timestamp is set to 0 for one
second which is the indicator for a watchdog reset.
The reason for this is that the timestamp is in seconds and the time is
taken from sched clock and divided by ~1e9. sched clock starts at 0 which
means that for the first second during boot the watchdog timestamp is 0,
i.e. reset.
Use ULONG_MAX as the reset indicator value so the watchdog works correctly
right from the start. ULONG_MAX would only conflict with a real timestamp
if the system reaches an uptime of 136 years on 32bit and almost eternity
on 64bit.
Reported-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87o8v3uuzl.fsf@nanos.tec.linutronix.de
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ccf74128d66ce937876184ad55db2e0276af08d3 ]
topology.c::get_group() relies on the assumption that non-NUMA domains do
not partially overlap. Zeng Tao pointed out in [1] that such topology
descriptions, while completely bogus, can end up being exposed to the
scheduler.
In his example (8 CPUs, 2-node system), we end up with:
MC span for CPU3 == 3-7
MC span for CPU4 == 4-7
The first pass through get_group(3, sdd@MC) will result in the following
sched_group list:
3 -> 4 -> 5 -> 6 -> 7
^ /
`----------------'
And a later pass through get_group(4, sdd@MC) will "corrupt" that to:
3 -> 4 -> 5 -> 6 -> 7
^ /
`-----------'
which will completely break things like 'while (sg != sd->groups)' when
using CPU3's base sched_domain.
There already are some architecture-specific checks in place such as
x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
in the core scheduler, so it seems worthwhile to do so.
Warn and abort the construction of the sched domains if such a broken
topology description is detected. Note that this is somewhat
expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
gated under SCHED_DEBUG if deemed necessary.
Testing
=======
Dietmar managed to reproduce this using the following qemu incantation:
$ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
-append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1
alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
needed if '-smp cores=X, sockets=Y' would work with qemu):
8<---
@@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid)
if (cpuid_topo->package_id != cpu_topo->package_id)
continue;
+ if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+ continue;
+
cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
8<---
[1]: https://lkml.kernel.org/r/1577088979-8545-1-git-send-email-prime.zeng@hisilicon.com
Reported-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dcd6dffb0a75741471297724640733fa4e958d72 ]
rq::uclamp is an array of struct uclamp_rq, make sure we clear the
whole thing.
Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga")
Signed-off-by: Li Guanglei <guanglei.li@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 894c9ef9780c5cf2f143415e867ee39a33ecb75d ]
Configuring an instance's parallel mask without any online CPUs...
echo 2 > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
echo 0 > /sys/devices/system/cpu/cpu1/online
...makes tcrypt mode=215 crash like this:
divide error: 0000 [#1] SMP PTI
CPU: 4 PID: 283 Comm: modprobe Not tainted 5.4.0-rc8-padata-doc-v2+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191013_105130-anatol 04/01/2014
RIP: 0010:padata_do_parallel+0x114/0x300
Call Trace:
pcrypt_aead_encrypt+0xc0/0xd0 [pcrypt]
crypto_aead_encrypt+0x1f/0x30
do_mult_aead_op+0x4e/0xdf [tcrypt]
test_mb_aead_speed.constprop.0.cold+0x226/0x564 [tcrypt]
do_test+0x28c2/0x4d49 [tcrypt]
tcrypt_mod_init+0x55/0x1000 [tcrypt]
...
cpumask_weight() in padata_cpu_hash() returns 0 because the mask has no
CPUs. The problem is __padata_remove_cpu() checks for valid masks too
early and so doesn't mark the instance PADATA_INVALID as expected, which
would have made padata_do_parallel() return error before doing the
division.
Fix by introducing a second padata CPU hotplug state before
CPUHP_BRINGUP_CPU so that __padata_remove_cpu() sees the online mask
without @cpu. No need for the second argument to padata_replace() since
@cpu is now already missing from the online mask.
Fixes: 33e54450683c ("padata: Handle empty padata cpumasks")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bf08949cc8b98b7d1e20cfbba169a5938d42dae8 ]
While running kprobe module test, find_module_all() caused
a suspicious RCU usage warning.
-----
=============================
WARNING: suspicious RCU usage
5.4.0-next-20191202+ #63 Not tainted
-----------------------------
kernel/module.c:619 RCU-list traversed in non-reader section!!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by rmmod/642:
#0: ffffffff8227da80 (module_mutex){+.+.}, at: __x64_sys_delete_module+0x9a/0x230
stack backtrace:
CPU: 0 PID: 642 Comm: rmmod Not tainted 5.4.0-next-20191202+ #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x71/0xa0
find_module_all+0xc1/0xd0
__x64_sys_delete_module+0xac/0x230
? do_syscall_64+0x12/0x1f0
do_syscall_64+0x50/0x1f0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4b6d49
-----
This is because list_for_each_entry_rcu(modules) is called
without rcu_read_lock(). This is safe because the module_mutex
is locked.
Pass lockdep_is_held(&module_mutex) to the list_for_each_entry_rcu()
to suppress this warning, This also fixes similar issue in
mod_find() and each_symbol_section().
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b527b638fd63ba791dc90a0a6e9a3035b10df52b ]
In the process of adding better error messages for sorting, I realized
that strsep was being used incorrectly and some of the error paths I
was expecting to be hit weren't and just fell through to the common
invalid key error case.
It also became obvious that for keyword assignments, it wasn't
necessary to save the full assignment and reparse it later, and having
a common empty-assignment check would also make more sense in terms of
error processing.
Change the code to fix these problems and simplify it for new error
message changes in a subsequent patch.
Link: http://lkml.kernel.org/r/1c3ef0b6655deaf345f6faee2584a0298ac2d743.1561743018.git.zanussi@kernel.org
Fixes: e62347d24534 ("tracing: Add hist trigger support for user-defined sorting ('sort=' param)")
Fixes: 7ef224d1d0e3 ("tracing: Add 'hist' event trigger command")
Fixes: a4072fe85ba3 ("tracing: Add a clock attribute for hist triggers")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit dfb6cd1e654315168e36d947471bd2a0ccd834ae ]
Looking through old emails in my INBOX, I came across a patch from Luis
Henriques that attempted to fix a race of two stat tracers registering the
same stat trace (extremely unlikely, as this is done in the kernel, and
probably doesn't even exist). The submitted patch wasn't quite right as it
needed to deal with clean up a bit better (if two stat tracers were the
same, it would have the same files).
But to make the code cleaner, all we needed to do is to keep the
all_stat_sessions_mutex held for most of the registering function.
Link: http://lkml.kernel.org/r/1410299375-20068-1-git-send-email-luis.henriques@canonical.com
Fixes: 002bb86d8d42f ("tracing/ftrace: separate events tracing and stats tracing engine")
Reported-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit afccc00f75bbbee4e4ae833a96c2d29a7259c693 ]
tracing_stat_init() was always returning '0', even on the error paths. It
now returns -ENODEV if tracing_init_dentry() fails or -ENOMEM if it fails
to created the 'trace_stat' debugfs directory.
Link: http://lkml.kernel.org/r/1410299381-20108-1-git-send-email-luis.henriques@canonical.com
Fixes: ed6f1c996bfe4 ("tracing: Check return value of tracing_init_dentry()")
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f6d061d617124abbd55396a3bc37b9bf7d33233c ]
In module_add_modinfo_attrs() if sysfs_create_file() fails
on the first iteration of the loop (so i = 0), we forget to
free the modinfo_attrs.
Fixes: bc6f2a757d52 ("kernel/module: Fix mem leak in module_add_modinfo_attrs")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit def97da136515cb289a14729292c193e0a93bc64 ]
Commit f92b070f2dc8 ("printk: Do not miss new messages when replaying
the log") introduced a new variable @exclusive_console_stop_seq to
store when an exclusive console should stop printing. It should be
set to the @console_seq value at registration. However, @console_seq
is previously set to @syslog_seq so that the exclusive console knows
where to begin. This results in the exclusive console immediately
reactivating all the other consoles and thus repeating the messages
for those consoles.
Set @console_seq after @exclusive_console_stop_seq has stored the
current @console_seq value.
Fixes: f92b070f2dc8 ("printk: Do not miss new messages when replaying the log")
Link: http://lkml.kernel.org/r/20191219115322.31160-1-john.ogness@linutronix.de
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 45178ac0cea853fe0e405bf11e101bdebea57b15 ]
Paul reported a very sporadic, rcutorture induced, workqueue failure.
When the planets align, the workqueue rescuer's self-migrate fails and
then triggers a WARN for running a work on the wrong CPU.
Tejun then figured that set_cpus_allowed_ptr()'s stop_one_cpu() call
could be ignored! When stopper->enabled is false, stop_machine will
insta complete the work, without actually doing the work. Worse, it
will not WARN about this (we really should fix this).
It turns out there is a small window where a freshly online'ed CPU is
marked 'online' but doesn't yet have the stopper task running:
BP AP
bringup_cpu()
__cpu_up(cpu, idle) --> start_secondary()
...
cpu_startup_entry()
bringup_wait_for_ap()
wait_for_ap_thread() <-- cpuhp_online_idle()
while (1)
do_idle()
... available to run kthreads ...
stop_machine_unpark()
stopper->enable = true;
Close this by moving the stop_machine_unpark() into
cpuhp_online_idle(), such that the stopper thread is ready before we
start the idle loop and schedule.
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Debugged-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6cf539a87a61a4fbc43f625267dbcbcf283872ed ]
This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.
This data-race was found with KCSAN:
==================================================================
BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter
write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
kthread+0x1b5/0x200 kernel/kthread.c:255
<snip>
read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
irq_enter+0x5/0x50 kernel/softirq.c:347
<snip>
Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
==================================================================
Signed-off-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fd6bc19d7676a060a171d1cf3dcbf6fd797eb05f ]
Tasks waiting within exp_funnel_lock() for an expedited grace period to
elapse can be starved due to the following sequence of events:
1. Tasks A and B both attempt to start an expedited grace
period at about the same time. This grace period will have
completed when the lower four bits of the rcu_state structure's
->expedited_sequence field are 0b'0100', for example, when the
initial value of this counter is zero. Task A wins, and thus
does the actual work of starting the grace period, including
acquiring the rcu_state structure's .exp_mutex and sets the
counter to 0b'0001'.
2. Because task B lost the race to start the grace period, it
waits on ->expedited_sequence to reach 0b'0100' inside of
exp_funnel_lock(). This task therefore blocks on the rcu_node
structure's ->exp_wq[1] field, keeping in mind that the
end-of-grace-period value of ->expedited_sequence (0b'0100')
is shifted down two bits before indexing the ->exp_wq[] field.
3. Task C attempts to start another expedited grace period,
but blocks on ->exp_mutex, which is still held by Task A.
4. The aforementioned expedited grace period completes, so that
->expedited_sequence now has the value 0b'0100'. A kworker task
therefore acquires the rcu_state structure's ->exp_wake_mutex
and starts awakening any tasks waiting for this grace period.
5. One of the first tasks awakened happens to be Task A. Task A
therefore releases the rcu_state structure's ->exp_mutex,
which allows Task C to start the next expedited grace period,
which causes the lower four bits of the rcu_state structure's
->expedited_sequence field to become 0b'0101'.
6. Task C's expedited grace period completes, so that the lower four
bits of the rcu_state structure's ->expedited_sequence field now
become 0b'1000'.
7. The kworker task from step 4 above continues its wakeups.
Unfortunately, the wake_up_all() refetches the rcu_state
structure's .expedited_sequence field:
wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
This results in the wakeup being applied to the rcu_node
structure's ->exp_wq[2] field, which is unfortunate given that
Task B is instead waiting on ->exp_wq[1].
On a busy system, no harm is done (or at least no permanent harm is done).
Some later expedited grace period will redo the wakeup. But on a quiet
system, such as many embedded systems, it might be a good long time before
there was another expedited grace period. On such embedded systems,
this situation could therefore result in a system hang.
This issue manifested as DPM device timeout during suspend (which
usually qualifies as a quiet time) due to a SCSI device being stuck in
_synchronize_rcu_expedited(), with the following stack trace:
schedule()
synchronize_rcu_expedited()
synchronize_rcu()
scsi_device_quiesce()
scsi_bus_suspend()
dpm_run_callback()
__device_suspend()
This commit therefore prevents such delays, timeouts, and hangs by
making rcu_exp_wait_wake() use its "s" argument consistently instead of
refetching from rcu_state.expedited_sequence.
Fixes: 3b5f668e715b ("rcu: Overlap wakeups with next expedited grace period")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 610dea36d3083a977e4f156206cbe1eaa2a532f0 ]
Commit 18cd8c93e69e ("rcu/nocb: Print gp/cb kthread hierarchy if
dump_tree") added print statements to rcu_organize_nocb_kthreads for
debugging, but incorrectly guarded them, causing the function to always
spew out its message.
This patch fixes it by guarding both pr_alert statements with dump_tree,
while also changing the second pr_alert to a pr_cont, to print the
hierarchy in a single line (assuming that's how it was supposed to
work).
Fixes: 18cd8c93e69e ("rcu/nocb: Print gp/cb kthread hierarchy if dump_tree")
Signed-off-by: Stefan Reiter <stefan@pimaker.at>
[ paulmck: Make single-nocbs-CPU GP kthreads look less erroneous. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit b562d140649966d4daedd0483a8fe59ad3bb465a upstream.
The check to ensure that the new written value into cpu.uclamp.{min,max}
is within range, [0:100], wasn't working because of the signed
comparison
7301 if (req.percent > UCLAMP_PERCENT_SCALE) {
7302 req.ret = -ERANGE;
7303 return req;
7304 }
# echo -1 > cpu.uclamp.min
# cat cpu.uclamp.min
42949671.96
Cast req.percent into u64 to force the comparison to be unsigned and
work as intended in capacity_from_percent().
# echo -1 > cpu.uclamp.min
sh: write error: Numerical result out of range
Fixes: 2480c093130f ("sched/uclamp: Extend CPU's cgroup controller")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e3728b50cd9be7d4b1469447cdf1feb93e3b7adb upstream.
It is theoretically possible for the ACPI EC GPE to be set after the
s2idle_ops->wake() called from s2idle_loop() has returned and before
the subsequent pm_wakeup_pending() check is carried out. If that
happens, the resulting wakeup event will cause the system to resume
even though it may be a spurious one.
To avoid that race, first make the ->wake() callback in struct
platform_s2idle_ops return a bool value indicating whether or not
to let the system resume and rearrange s2idle_loop() to use that
value instad of the direct pm_wakeup_pending() call if ->wake() is
present.
Next, rework acpi_s2idle_wake() to process EC events and check
pm_wakeup_pending() before re-arming the SCI for system wakeup
to prevent it from triggering prematurely and add comments to
that function to explain the rationale for the new code flow.
Fixes: 56b991849009 ("PM: sleep: Simplify suspend-to-idle control flow")
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7226017ad37a888915628e59a84a2d1e57b40707 upstream.
When a new cgroup is created, the effective uclamp value wasn't updated
with a call to cpu_util_update_eff() that looks at the hierarchy and
update to the most restrictive values.
Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
becomes online.
Without this change, the newly created cgroup uses the default
root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
which will cause the rq to to be clamped to max, hence cause the
system to run at max frequency.
The problem was observed on Ubuntu server and was reproduced on Debian
and Buildroot rootfs.
By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
and add all tasks to it - which creates enough noise to keep the rq
uclamp value at max most of the time. Imitating this behavior makes the
problem visible in Buildroot too which otherwise looks fine since it's a
minimal userspace.
Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps")
Reported-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Doug Smythies <dsmythies@telus.net>
Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 003461559ef7a9bd0239bae35a22ad8924d6e9ad upstream.
Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
a perf ring buffer may lead to an integer underflow in locked memory
accounting. This may lead to the undesired behaviors, such as failures in
BPF map creation.
Address this by adjusting the accounting logic to take into account the
possibility that the amount of already locked memory may exceed the
current limit.
Fixes: c4b75479741c ("perf/core: Make the mlock accounting simple again")
Suggested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit febac332a819f0e764aa4da62757ba21d18c182b upstream.
Kernel crashes inside QEMU/KVM are observed:
kernel BUG at kernel/time/timer.c:1154!
BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().
At the same time another cpu got:
general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:
__hlist_del at include/linux/list.h:681
(inlined by) detach_timer at kernel/time/timer.c:818
(inlined by) expire_timers at kernel/time/timer.c:1355
(inlined by) __run_timers at kernel/time/timer.c:1686
(inlined by) run_timer_softirq at kernel/time/timer.c:1699
Unfortunately kernel logs are badly scrambled, stacktraces are lost.
Printing the timer->function before the BUG_ON() pointed to
clocksource_watchdog().
The execution of clocksource_watchdog() can race with a sequence of
clocksource_stop_watchdog() .. clocksource_start_watchdog():
expire_timers()
detach_timer(timer, true);
timer->entry.pprev = NULL;
raw_spin_unlock_irq(&base->lock);
call_timer_fn
clocksource_watchdog()
clocksource_watchdog_kthread() or
clocksource_unbind()
spin_lock_irqsave(&watchdog_lock, flags);
clocksource_stop_watchdog();
del_timer(&watchdog_timer);
watchdog_running = 0;
spin_unlock_irqrestore(&watchdog_lock, flags);
spin_lock_irqsave(&watchdog_lock, flags);
clocksource_start_watchdog();
add_timer_on(&watchdog_timer, ...);
watchdog_running = 1;
spin_unlock_irqrestore(&watchdog_lock, flags);
spin_lock(&watchdog_lock);
add_timer_on(&watchdog_timer, ...);
BUG_ON(timer_pending(timer) || !timer->function);
timer_pending() -> true
BUG()
I.e. inside clocksource_watchdog() watchdog_timer could be already armed.
Check timer_pending() before calling add_timer_on(). This is sufficient as
all operations are synchronized by watchdog_lock.
Fixes: 75c5158f70c0 ("timekeeping: Update clocksource with stop_machine")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6f1a4891a5928a5969c87fa5a584844c983ec823 upstream.
Evan tracked down a subtle race between the update of the MSI message and
the device raising an interrupt internally on PCI devices which do not
support MSI masking. The update of the MSI message is non-atomic and
consists of either 2 or 3 sequential 32bit wide writes to the PCI config
space.
- Write address low 32bits
- Write address high 32bits (If supported by device)
- Write data
When an interrupt is migrated then both address and data might change, so
the kernel attempts to mask the MSI interrupt first. But for MSI masking is
optional, so there exist devices which do not provide it. That means that
if the device raises an interrupt internally between the writes then a MSI
message is sent built from half updated state.
On x86 this can lead to spurious interrupts on the wrong interrupt
vector when the affinity setting changes both address and data. As a
consequence the device interrupt can be lost causing the device to
become stuck or malfunctioning.
Evan tried to handle that by disabling MSI accross an MSI message
update. That's not feasible because disabling MSI has issues on its own:
If MSI is disabled the PCI device is routing an interrupt to the legacy
INTx mechanism. The INTx delivery can be disabled, but the disablement is
not working on all devices.
Some devices lose interrupts when both MSI and INTx delivery are disabled.
Another way to solve this would be to enforce the allocation of the same
vector on all CPUs in the system for this kind of screwed devices. That
could be done, but it would bring back the vector space exhaustion problems
which got solved a few years ago.
Fortunately the high address (if supported by the device) is only relevant
when X2APIC is enabled which implies interrupt remapping. In the interrupt
remapping case the affinity setting is happening at the interrupt remapping
unit and the PCI MSI message is programmed only once when the PCI device is
initialized.
That makes it possible to solve it with a two step update:
1) Target the MSI msg to the new vector on the current target CPU
2) Target the MSI msg to the new vector on the new target CPU
In both cases writing the MSI message is only changing a single 32bit word
which prevents the issue of inconsistency.
After writing the final destination it is necessary to check whether the
device issued an interrupt while the intermediate state #1 (new vector,
current CPU) was in effect.
This is possible because the affinity change is always happening on the
current target CPU. The code runs with interrupts disabled, so the
interrupt can be detected by checking the IRR of the local APIC. If the
vector is pending in the IRR then the interrupt is retriggered on the new
target CPU by sending an IPI for the associated vector on the target CPU.
This can cause spurious interrupts on both the local and the new target
CPU.
1) If the new vector is not in use on the local CPU and the device
affected by the affinity change raised an interrupt during the
transitional state (step #1 above) then interrupt entry code will
ignore that spurious interrupt. The vector is marked so that the
'No irq handler for vector' warning is supressed once.
2) If the new vector is in use already on the local CPU then the IRR check
might see an pending interrupt from the device which is using this
vector. The IPI to the new target CPU will then invoke the handler of
the device, which got the affinity change, even if that device did not
issue an interrupt
3) If the new vector is in use already on the local CPU and the device
affected by the affinity change raised an interrupt during the
transitional state (step #1 above) then the handler of the device which
uses that vector on the local CPU will be invoked.
expose issues in device driver interrupt handlers which are not prepared to
handle a spurious interrupt correctly. This not a regression, it's just
exposing something which was already broken as spurious interrupts can
happen for a lot of reasons and all driver handlers need to be able to deal
with them.
Reported-by: Evan Green <evgreen@chromium.org>
Debugged-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 485ec2ea9cf556e9c120e07961b7b459d776a115 upstream.
head is traversed using hlist_for_each_entry_rcu outside an RCU
read-side critical section but under the protection of dtab->index_lock.
Hence, add corresponding lockdep expression to silence false-positive
lockdep warnings, and harden RCU lists.
Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200123120437.26506-1-frextrite@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bbefa1dd6a6d53537c11624752219e39959d04fb upstream.
If the pcrypt template is used multiple times in an algorithm, then a
deadlock occurs because all pcrypt instances share the same
padata_instance, which completes requests in the order submitted. That
is, the inner pcrypt request waits for the outer pcrypt request while
the outer request is already waiting for the inner.
This patch fixes this by allocating a set of queues for each pcrypt
instance instead of using two global queues. In order to maintain
the existing user-space interface, the pinst structure remains global
so any sysfs modifications will apply to every pcrypt instance.
Note that when an update occurs we have to allocate memory for
every pcrypt instance. Should one of the allocations fail we
will abort the update without rolling back changes already made.
The new per-instance data structure is called padata_shell and is
essentially a wrapper around parallel_data.
Reproducer:
#include <linux/if_alg.h>
#include <sys/socket.h>
#include <unistd.h>
int main()
{
struct sockaddr_alg addr = {
.salg_type = "aead",
.salg_name = "pcrypt(pcrypt(rfc4106-gcm-aesni))"
};
int algfd, reqfd;
char buf[32] = { 0 };
algfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
bind(algfd, (void *)&addr, sizeof(addr));
setsockopt(algfd, SOL_ALG, ALG_SET_KEY, buf, 20);
reqfd = accept(algfd, 0, 0);
write(reqfd, buf, 32);
read(reqfd, buf, 16);
}
Reported-by: syzbot+56c7151cad94eec37c521f0e47d2eee53f9361c4@syzkaller.appspotmail.com
Fixes: 5068c7a883d1 ("crypto: pcrypt - Add pcrypt crypto parallelization wrapper")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 54a16ff6f2e50775145b210bcd94d62c3c2af117 ]
As function_graph tracer can run when RCU is not "watching", it can not be
protected by synchronize_rcu() it requires running a task on each CPU before
it can be freed. Calling schedule_on_each_cpu(ftrace_sync) needs to be used.
Link: https://lore.kernel.org/r/20200205131110.GT2935@paulmck-ThinkPad-P72
Cc: stable@vger.kernel.org
Fixes: b9b0c831bed26 ("ftrace: Convert graph filter to use hash tables")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 16052dd5bdfa16dbe18d8c1d4cde2ddab9d23177 ]
Because the function graph tracer can execute in sections where RCU is not
"watching", the rcu_dereference_sched() for the has needs to be open coded.
This is fine because the RCU "flavor" of the ftrace hash is protected by
its own RCU handling (it does its own little synchronization on every CPU
and does not rely on RCU sched).
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fd0e6852c407dd9aefc594f54ddcc21d84803d3b ]
Fix following instances of sparse error
kernel/trace/ftrace.c:5667:29: error: incompatible types in comparison
kernel/trace/ftrace.c:5813:21: error: incompatible types in comparison
kernel/trace/ftrace.c:5868:36: error: incompatible types in comparison
kernel/trace/ftrace.c:5870:25: error: incompatible types in comparison
Use rcu_dereference_protected to dereference the newly annotated pointer.
Link: http://lkml.kernel.org/r/20200205055701.30195-1-frextrite@gmail.com
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 24a9729f831462b1d9d61dc85ecc91c59037243f ]
Fix following instances of sparse error
kernel/trace/ftrace.c:5664:29: error: incompatible types in comparison
kernel/trace/ftrace.c:5785:21: error: incompatible types in comparison
kernel/trace/ftrace.c:5864:36: error: incompatible types in comparison
kernel/trace/ftrace.c:5866:25: error: incompatible types in comparison
Use rcu_dereference_protected to access the __rcu annotated pointer.
Link: http://lkml.kernel.org/r/20200201072703.17330-1-frextrite@gmail.com
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 07928d9bfc81640bab36f5190e8725894d93b659 upstream.
The function padata_flush_queues is fundamentally broken because
it cannot force padata users to complete the request that is
underway. IOW padata has to passively wait for the completion
of any outstanding work.
As it stands flushing is used in two places. Its use in padata_stop
is simply unnecessary because nothing depends on the queues to
be flushed afterwards.
The other use in padata_replace is more substantial as we depend
on it to free the old pd structure. This patch instead uses the
pd->refcnt to dynamically free the pd structure once all requests
are complete.
Fixes: 2b73b07ab8a4 ("padata: Flush the padata queues actively")
Cc: <stable@vger.kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6b6d188aae79a630957aefd88ff5c42af6553ee3 upstream.
The alarmtimer_rtc_add_device() function creates a wakeup source and then
tries to grab a module reference. If that fails the function returns early
with an error code, but fails to remove the wakeup source.
Cleanup this exit path so there is no dangling wakeup source, which is
named 'alarmtime' left allocated which will conflict with another RTC
device that may be registered later.
Fixes: 51218298a25e ("alarmtimer: Ensure RTC module is not unloaded")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200109155910.907-2-swboyd@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d380dcde9a07ca5de4805dee11f58a98ec0ad6ff ]
The patch 'tracing: Fix histogram code when expression has same var as
value' added code to return an existing variable reference when
creating a new variable reference, which resulted in var_ref_vals
slots being reused instead of being duplicated.
The implementation of the trace action assumes that the end of the
var_ref_vals array starting at action_data.var_ref_idx corresponds to
the values that will be assigned to the trace params. The patch
mentioned above invalidates that assumption, which means that each
param needs to explicitly specify its index into var_ref_vals.
This fix changes action_data.var_ref_idx to an array of var ref
indexes to account for that.
Link: https://lore.kernel.org/r/1580335695.6220.8.camel@kernel.org
Fixes: 8bcebc77e85f ("tracing: Fix histogram code when expression has same var as value")
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 0f394daef89b38d58c91118a2b08b8a1b316703b upstream.
Fix a memory leak reported by kmemleak:
unreferenced object 0xffff000bc6f50e80 (size 128):
comm "kworker/23:2", pid 201, jiffies 4294894947 (age 942.132s)
hex dump (first 32 bytes):
00 00 00 00 41 00 00 00 86 c0 03 00 00 00 00 00 ....A...........
00 a0 b2 c6 0b 00 ff ff 40 51 fd 10 00 80 ff ff ........@Q......
backtrace:
[<00000000e62d2240>] kmem_cache_alloc_trace+0x1a4/0x320
[<00000000279143c9>] irq_domain_push_irq+0x7c/0x188
[<00000000d9f4c154>] thunderx_gpio_probe+0x3ac/0x438
[<00000000fd09ec22>] pci_device_probe+0xe4/0x198
[<00000000d43eca75>] really_probe+0xdc/0x320
[<00000000d3ebab09>] driver_probe_device+0x5c/0xf0
[<000000005b3ecaa0>] __device_attach_driver+0x88/0xc0
[<000000004e5915f5>] bus_for_each_drv+0x7c/0xc8
[<0000000079d4db41>] __device_attach+0xe4/0x140
[<00000000883bbda9>] device_initial_probe+0x18/0x20
[<000000003be59ef6>] bus_probe_device+0x98/0xa0
[<0000000039b03d3f>] deferred_probe_work_func+0x74/0xa8
[<00000000870934ce>] process_one_work+0x1c8/0x470
[<00000000e3cce570>] worker_thread+0x1f8/0x428
[<000000005d64975e>] kthread+0xfc/0x128
[<00000000f0eaa764>] ret_from_fork+0x10/0x18
Fixes: 495c38d3001f ("irqdomain: Add irq_domain_{push,pop}_irq() functions")
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200120043547.22271-1-haokexin@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c51f83c315c392d9776c33eb16a2fe1349d65c7f upstream.
The rcu_node structure's ->expmask field is updated only when holding the
->lock, but is also accessed locklessly. This means that all ->expmask
updates must use WRITE_ONCE() and all reads carried out without holding
->lock must use READ_ONCE(). This commit therefore changes the lockless
->expmask read in rcu_read_unlock_special() to use READ_ONCE().
Reported-by: syzbot+99f4ddade3c22ab0cf23@syzkaller.appspotmail.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 844a378de3372c923909681706d62336d702531e upstream.
The ->srcu_last_gp_end field is accessed from any CPU at any time
by synchronize_srcu(), so non-initialization references need to use
READ_ONCE() and WRITE_ONCE(). This commit therefore makes that change.
Reported-by: syzbot+08f3e9d26e5541e1ecf2@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6935c3983b246d5fbfebd3b891c825e65c118f2d upstream.
The rcu_gp_fqs_check_wake() function uses rcu_preempt_blocked_readers_cgp()
to read ->gp_tasks while other cpus might overwrite this field.
We need READ_ONCE()/WRITE_ONCE() pairs to avoid compiler
tricks and KCSAN splats like the following :
BUG: KCSAN: data-race in rcu_gp_fqs_check_wake / rcu_preempt_deferred_qs_irqrestore
write to 0xffffffff85a7f190 of 8 bytes by task 7317 on cpu 0:
rcu_preempt_deferred_qs_irqrestore+0x43d/0x580 kernel/rcu/tree_plugin.h:507
rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
__rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
rcu_read_unlock include/linux/rcupdate.h:645 [inline]
__ip_queue_xmit+0x3b0/0xa40 net/ipv4/ip_output.c:533
ip_queue_xmit+0x45/0x60 include/net/ip.h:236
__tcp_transmit_skb+0xdeb/0x1cd0 net/ipv4/tcp_output.c:1158
__tcp_send_ack+0x246/0x300 net/ipv4/tcp_output.c:3685
tcp_send_ack+0x34/0x40 net/ipv4/tcp_output.c:3691
tcp_cleanup_rbuf+0x130/0x360 net/ipv4/tcp.c:1575
tcp_recvmsg+0x633/0x1a30 net/ipv4/tcp.c:2179
inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
sock_recvmsg_nosec net/socket.c:871 [inline]
sock_recvmsg net/socket.c:889 [inline]
sock_recvmsg+0x92/0xb0 net/socket.c:885
sock_read_iter+0x15f/0x1e0 net/socket.c:967
call_read_iter include/linux/fs.h:1864 [inline]
new_sync_read+0x389/0x4f0 fs/read_write.c:414
read to 0xffffffff85a7f190 of 8 bytes by task 10 on cpu 1:
rcu_gp_fqs_check_wake kernel/rcu/tree.c:1556 [inline]
rcu_gp_fqs_check_wake+0x93/0xd0 kernel/rcu/tree.c:1546
rcu_gp_fqs_loop+0x36c/0x580 kernel/rcu/tree.c:1611
rcu_gp_kthread+0x143/0x220 kernel/rcu/tree.c:1768
kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 10 Comm: rcu_preempt Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
[ paulmck: Added another READ_ONCE() for RCU CPU stall warnings. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 15c7c972cd26d89a26788e609c53b5a465324a6c upstream.
The rcu_node structure's ->expmask field is accessed locklessly when
starting a new expedited grace period and when reporting an expedited
RCU CPU stall warning. This commit therefore handles the former by
taking a snapshot of ->expmask while the lock is held and the latter
by applying READ_ONCE() to lockless reads and WRITE_ONCE() to the
corresponding updates.
Link: https://lore.kernel.org/lkml/CANpmjNNmSOagbTpffHr4=Yedckx9Rm2NuGqC9UqE+AOz5f1-ZQ@mail.gmail.com
Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|