Age | Commit message (Collapse) | Author |
|
Generally the hypervisor decides to allocate a window on different
VAS instances. But if user space wishes to allocate on the current VAS
instance where the process is executing, the kernel has to pass
associativity domain IDs to allocate VAS window HCALL.
To determine the associativity domain IDs for the current CPU,
smp_processor_id() is passed to node associativity HCALL which may
return H_P2 (-55) error during DLPAR CPU event. This is because Linux
CPU numbers (smp_processor_id()) are not the same as the hypervisor's
view of CPU numbers.
Fix the issue by passing hard_smp_processor_id() with
VPHN_FLAG_VCPU flag (PAPR 14.11.6.1 H_HOME_NODE_ASSOCIATIVITY).
Fixes: b22f2d88e435 ("powerpc/pseries/vas: Integrate API with open/close windows")
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
[mpe: Update change log to mention Linux vs HV CPU numbers]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/55380253ea0c11341824cd4c0fc6bbcfc5752689.camel@linux.ibm.com
|
|
PAPR specifies accumulated virtual processor wait intervals that relate
to partition scheduling interval times. Implement these counters in the
same way as they are repoted by dtl.
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220908132545.4085849-5-npiggin@gmail.com
|
|
Merge some KVM commits we are keeping in our topic branch.
|
|
The data storage interrupt (DSI) error will be generated when the
paste operation is issued on the suspended Nest Accelerator (NX)
window due to NX state changes. The hypervisor expects the
partition to ignore this error during page fault handling.
To differentiate DSI caused by an actual HW configuration or by
the NX window, a new “ibm,pi-features” type value is defined.
Byte 0, bit 3 of pi-attribute-specifier-type is now defined to
indicate this DSI error. If this error is not ignored, the user
space can get SIGBUS when the NX request is issued.
This patch adds changes to read ibm,pi-features property and ignore
DSI error during page fault handling if MMU_FTR_NX_DSI is defined.
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
[mpe: Mention PAPR version in comment]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b9cd844b85eb8f70459109ce1b14e44c4cc85fa7.camel@linux.ibm.com
|
|
On little endian the stack frame marker appears reversed when dumping
memory sequentially, as is typical in xmon or gdb, eg:
c000000004733e40 0000000000000000 0000000000000000 |................|
c000000004733e50 0000000000000000 0000000000000000 |................|
c000000004733e60 0000000000000000 0000000000000000 |................|
c000000004733e70 5347455200000000 0000000000000000 |SGER............|
c000000004733e80 a700000000000000 708897f7ff7f0000 |........p.......|
c000000004733e90 0073428fff7f0000 208997f7ff7f0000 |.sB..... .......|
c000000004733ea0 0100000000000000 ffffffffffffffff |................|
c000000004733eb0 0000000000000000 0000000000000000 |................|
To make it easier to recognise, reverse the value on little endian, so
it always appears as "REGS", eg:
c000000004733e70 5245475300000000 0000000000000000 |REGS............|
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220927150419.1503001-2-mpe@ellerman.id.au
|
|
Now that the stack frame regs marker is only 32-bits it is not as
obvious in memory dumps and easier to miss, eg:
c000000004733e40 0000000000000000 0000000000000000 |................|
c000000004733e50 0000000000000000 0000000000000000 |................|
c000000004733e60 0000000000000000 0000000000000000 |................|
c000000004733e70 7367657200000000 0000000000000000 |sger............|
c000000004733e80 a700000000000000 708897f7ff7f0000 |........p.......|
c000000004733e90 0073428fff7f0000 208997f7ff7f0000 |.sB..... .......|
c000000004733ea0 0100000000000000 ffffffffffffffff |................|
c000000004733eb0 0000000000000000 0000000000000000 |................|
So make it upper case to make it stand out a bit more:
c000000004733e70 5347455200000000 0000000000000000 |SGER............|
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220927150419.1503001-1-mpe@ellerman.id.au
|
|
I found a null pointer reference in arch_prepare_kprobe():
# echo 'p cmdline_proc_show' > kprobe_events
# echo 'p cmdline_proc_show+16' >> kprobe_events
Kernel attempted to read user page (0) - exploit attempt? (uid: 0)
BUG: Kernel NULL pointer dereference on read at 0x00000000
Faulting instruction address: 0xc000000000050bfc
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 122 Comm: sh Not tainted 6.0.0-rc3-00007-gdcf8e5633e2e #10
NIP: c000000000050bfc LR: c000000000050bec CTR: 0000000000005bdc
REGS: c0000000348475b0 TRAP: 0300 Not tainted (6.0.0-rc3-00007-gdcf8e5633e2e)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 88002444 XER: 20040006
CFAR: c00000000022d100 DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0
...
NIP arch_prepare_kprobe+0x10c/0x2d0
LR arch_prepare_kprobe+0xfc/0x2d0
Call Trace:
0xc0000000012f77a0 (unreliable)
register_kprobe+0x3c0/0x7a0
__register_trace_kprobe+0x140/0x1a0
__trace_kprobe_create+0x794/0x1040
trace_probe_create+0xc4/0xe0
create_or_delete_trace_kprobe+0x2c/0x80
trace_parse_run_command+0xf0/0x210
probes_write+0x20/0x40
vfs_write+0xfc/0x450
ksys_write+0x84/0x140
system_call_exception+0x17c/0x3a0
system_call_vectored_common+0xe8/0x278
--- interrupt: 3000 at 0x7fffa5682de0
NIP: 00007fffa5682de0 LR: 0000000000000000 CTR: 0000000000000000
REGS: c000000034847e80 TRAP: 3000 Not tainted (6.0.0-rc3-00007-gdcf8e5633e2e)
MSR: 900000000280f033 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 44002408 XER: 00000000
The address being probed has some special:
cmdline_proc_show: Probe based on ftrace
cmdline_proc_show+16: Probe for the next instruction at the ftrace location
The ftrace-based kprobe does not generate kprobe::ainsn::insn, it gets
set to NULL. In arch_prepare_kprobe() it will check for:
...
prev = get_kprobe(p->addr - 1);
preempt_enable_no_resched();
if (prev && ppc_inst_prefixed(ppc_inst_read(prev->ainsn.insn))) {
...
If prev is based on ftrace, 'ppc_inst_read(prev->ainsn.insn)' will occur
with a null pointer reference. At this point prev->addr will not be a
prefixed instruction, so the check can be skipped.
Check if prev is ftrace-based kprobe before reading 'prev->ainsn.insn'
to fix this problem.
Fixes: b4657f7650ba ("powerpc/kprobes: Don't allow breakpoints on suffixes")
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
[mpe: Trim oops]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220923093253.177298-1-lihuafei1@huawei.com
|
|
Return the value opal_npu_spa_clear_cache() directly instead of storing
it in another redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220906072006.337099-1-ye.xingchen@zte.com.cn
|
|
Return the value vas_register_coproc_api() directly instead of storing it
in another redundant variable.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220825072657.229168-1-ye.xingchen@zte.com.cn
|
|
At boot time, it is not necessary to delay between polls of
cpu_callin_map when waiting for a kicked CPU to come up. Remove the
delay intervals, but preserve the overall deadline (five seconds).
At run time, the first poll result is usually negative and we incur a
sleeping wait. If we spin on the callin word for a short time first,
we can reduce __cpu_up() from dozens of milliseconds to under 1ms in
the common case on a P9 LPAR:
$ ppc64_cpu --smt=off
$ bpftrace -e 'kprobe:__cpu_up {
@start[tid] = nsecs;
}
kretprobe:__cpu_up /@start[tid]/ {
@us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}' -c 'ppc64_cpu --smt=on'
Before:
@us:
[16K, 32K) 85 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K) 13 |@@@@@@@ |
After:
@us:
[128, 256) 95 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512) 3 |@ |
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926220250.157022-1-nathanl@linux.ibm.com
|
|
The error injection facility on pseries VMs allows corruption of
arbitrary guest memory, potentially enabling a sufficiently privileged
user to disable lockdown or perform other modifications of the running
kernel via the rtas syscall.
Block the PAPR error injection facility from being opened or called
when locked down.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Acked-by: Paul Moore <paul@paul-moore.com> (LSM)
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926131643.146502-3-nathanl@linux.ibm.com
|
|
The /proc/powerpc/ofdt interface allows the root user to freely alter
the in-kernel device tree, enabling arbitrary physical address writes
via drivers that could bind to malicious device nodes, thus making it
possible to disable lockdown.
Historically this interface has been used on the pseries platform to
facilitate the runtime addition and removal of processor, memory, and
device resources (aka Dynamic Logical Partitioning or DLPAR). Years
ago, the processor and memory use cases were migrated to designs that
happen to be lockdown-friendly: device tree updates are communicated
directly to the kernel from firmware without passing through untrusted
user space. I/O device DLPAR via the "drmgr" command in powerpc-utils
remains the sole legitimate user of /proc/powerpc/ofdt, but it is
already broken in lockdown since it uses /dev/mem to allocate argument
buffers for the rtas syscall. So only illegitimate uses of the
interface should see a behavior change when running on a locked down
kernel.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Acked-by: Paul Moore <paul@paul-moore.com> (LSM)
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926131643.146502-2-nathanl@linux.ibm.com
|
|
'extern' keyword is pointless and deprecated for function prototypes.
Signed-off-by: Pali Rohár <pali@kernel.org>
Suggested-by: Gabriel Paubert <paubert@iram.es>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220822231751.16973-1-pali@kernel.org
|
|
uImage boot wrapper should not use SPE instructions, like kernel itself.
Boot wrapper has already disabled Altivec and VSX instructions but not SPE.
Options -mno-spe and -mspe=no already set when compilation of kernel, but
not when compiling uImage wrapper yet. Fix it.
Cc: stable@vger.kernel.org
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220827134454.17365-1-pali@kernel.org
|
|
There are still some board device tree files without Power ISA properties
which have Freescale e500v1 cores, namely those which are based on
Freescale mpc8540, mpc8541, mpc8555 and mpc8560 processors.
So include newly introduced e500v1_power_isa.dtsi file in devices tree
files with those processors.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220902212103.22534-2-pali@kernel.org
|
|
Commit 2eb28006431c ("powerpc/e500v2: Add Power ISA properties to comply
with ePAPR 1.1") introduced new include file e500v2_power_isa.dtsi and
should have used it for all e500v2 platforms. But apparently it was used
also for e500v1 platforms mpc8540, mpc8541, mpc8555 and mpc8560.
e500v1 cores compared to e500v2 do not support double precision floating
point SPE instructions. Hence power-isa-sp.fd should not be set on e500v1
platforms, which is in e500v2_power_isa.dtsi include file.
Fix this issue by introducing a new e500v1_power_isa.dtsi include file and
use it in all e500v1 device tree files.
Fixes: 2eb28006431c ("powerpc/e500v2: Add Power ISA properties to comply with ePAPR 1.1")
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220902212103.22534-1-pali@kernel.org
|
|
For PERF_SAMPLE_BRANCH_STACK sample type, different branch_sample_type,
ie branch filters are supported. The testcase "bhrb_filter_map_test"
tests the valid and invalid filter maps in different powerpc platforms.
Update this testcase to include scenario to cover multiple branch
filters at sametime. Since powerpc doesn't support multiple filters at
sametime, expect failure during perf_event_open.
Reported-by: Disha Goel <disgoel@linux.vnet.ibm.com>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220921145255.20972-3-atrajeev@linux.vnet.ibm.com
|
|
For PERF_SAMPLE_BRANCH_STACK sample type, different branch_sample_type
ie branch filters are supported. The branch filters are requested via
event attribute "branch_sample_type". Multiple branch filters can be
passed in event attribute. eg:
$ perf record -b -o- -B --branch-filter any,ind_call true
None of the Power PMUs support having multiple branch filters at
the same time. Branch filters for branch stack sampling is set via MMCRA
IFM bits [32:33]. But currently when requesting for multiple filter
types, the "perf record" command does not report any error.
eg:
$ perf record -b -o- -B --branch-filter any,save_type true
$ perf record -b -o- -B --branch-filter any,ind_call true
The "bhrb_filter_map" function in PMU driver code does the validity
check for supported branch filters. But this check is done for single
filter. Hence "perf record" will proceed here without reporting any
error.
Fix power_pmu_event_init() to return EOPNOTSUPP when multiple branch
filters are requested in the event attr.
After the fix:
$ perf record --branch-filter any,ind_call -- ls
Error:
cycles: PMU Hardware doesn't support sampling/overflow-interrupts.
Try 'perf stat'
Reported-by: Disha Goel <disgoel@linux.vnet.ibm.com>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Tested-by: Disha Goel<disgoel@linux.vnet.ibm.com>
Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
[mpe: Tweak comment and change log wording]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220921145255.20972-1-atrajeev@linux.vnet.ibm.com
|
|
Ensure r13 is zero from very early in boot until it gets set to the
boot paca pointer. This allows early program and mce handlers to halt
if there is no valid paca, rather than potentially run off into the
weeds. This preserves register and memory contents for low level
debugging tools.
Nothing could be printed to console at this point in any case because
even udbg is only set up after the boot paca is set, so this shouldn't
be missed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926055620.2676869-6-npiggin@gmail.com
|
|
The idea is to get to the point where if r13 is non-zero, then it should
contain a reasonable paca. This can be used in early boot program check
and machine check handlers to avoid running off into the weeds if they
hit before r13 has a paca.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926055620.2676869-5-npiggin@gmail.com
|
|
relocate() uses r13 in early boot before it is used for the paca. Use
a different register for this so r13 is kept unchanged until it is
set to the paca pointer.
Avoid r14 as well while we're here, there's no reason not to use the
volatile registers which is a bit less surprising, and r14 could be used
as another fixed reg one day.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926055620.2676869-4-npiggin@gmail.com
|
|
Use the early boot interrupt fixup in the machine check handler to allow
the machine check handler to run before interrupt endian is set up.
Branch to an early boot handler that just does a basic crash, which
allows it to run before ppc_md is set up. MSR[ME] is enabled on the boot
CPU earlier, and the machine check stack is temporarily set to the
middle of the init task stack.
This allows machine checks (e.g., due to invalid data access in real
mode) to print something useful earlier in boot (as soon as udbg is set
up, if CONFIG_PPC_EARLY_DEBUG=y).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926055620.2676869-3-npiggin@gmail.com
|
|
In preparation for using this sequence in machine check interrupt, move
it into a macro, with a small change to make it position independent.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926055620.2676869-2-npiggin@gmail.com
|
|
The interrupt entry code carefully saves a minimal number of registers,
so in some places the TOC is required, it is loaded into a different
register, so provide a macro that can supply an alternate TOC register.
This continues to use got addressing because TOC-relative results in
"got/toc optimization is not supported" messages by the linker. Having
r2 be one of the saved registers and using that for TOC addressing may
be the best way to avoid that and switch this to TOC addressing.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926034057.2360083-6-npiggin@gmail.com
|
|
A later change stops the kernel using r2 and loads it with a poison
value. Provide a PACATOC loading abstraction which can hide this
detail.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926034057.2360083-5-npiggin@gmail.com
|
|
There is no need to use GOT addressing within the kernel.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926034057.2360083-4-npiggin@gmail.com
|
|
Use helper macros to access global variables, and place them in .data
sections rather than in .toc. Putting addresses in TOC is not required
because the kernel is linked with a single TOC.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926034057.2360083-3-npiggin@gmail.com
|
|
Using a 32-bit constant for this marker allows it to be loaded with
two ALU instructions, like 32-bit. This avoids a TOC entry and a
TOC load that depends on the r2 value that has just been loaded from
the PACA.
This changes the value for 32-bit as well, so both have the same
value in the low 4 bytes and 64-bit has 0 in the top bytes.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926034057.2360083-2-npiggin@gmail.com
|
|
This adds basic POWER10_CPU option, which builds with -mcpu=power10.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220923033004.536127-1-npiggin@gmail.com
|
|
When the migration is initiated, the hypervisor changes VAS
mappings as part of pre-migration event. Then the OS gets the
migration event which closes all VAS windows before the migration
starts. NX generates continuous faults until windows are closed
and the user space can not differentiate these NX faults coming
from the actual migration. So to reduce this time window, close
VAS windows first in pseries_migrate_partition().
Signed-off-by: Haren Myneni <haren@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d8efade91dda831c9ed4abb226dab627da594c5f.camel@linux.ibm.com
|
|
irq replay is quite complicated because of softirq processing which
itself enables and disables irqs. Several considerations need to be
accounted for due to this, and they are not clearly documented.
Refactor the irq replay code a bit to tidy and deduplicate some common
functions. Add comments, debug checks.
This has a minor functional change that irq tracing enable/disable is
done after each interrupt replayed, rather than after a batch. It also
re-sets state to IRQS_ALL_DISABLED after an interrupt, which doesn't
matter much because interrupts are hard disabled at this point, but it
is more consistent with how interrupt handlers are called.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926054305.2671436-8-npiggin@gmail.com
|
|
BUG/WARN are handled with a program interrupt which can turn into an
infinite recursion when there are bugs in interrupt handler entry
(which can be irritated by bugs in other parts of the code).
There is one feeble attempt to avoid this recursion, but it misses
several cases. Make a tidier macro for this and switch most bugs in
the interrupt entry wrapper over to use it.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926054305.2671436-7-npiggin@gmail.com
|
|
Prior changes eliminated cases of masked PACA_IRQ_MUST_HARD_MASK
interrupts that re-fire due to MSR[EE] being enabled while they are
pending. Add a debug check in the masked interrupt handler to catch
if this occurs.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926054305.2671436-6-npiggin@gmail.com
|
|
When irqs are soft-disabled, MSR[EE] is volatile and can change from
1 to 0 asynchronously (if a PACA_IRQ_MUST_HARD_MASK interrupt hits).
So it can not be used to check hard IRQ enabled status, except to
confirm it is disabled.
ppc64_runlatch_on/off functions use MSR this way to decide whether to
re-enable MSR[EE] after disabling it, which leads to MSR[EE] being
enabled when it shouldn't be (when a PACA_IRQ_MUST_HARD_MASK had
disabled it between reading the MSR and clearing EE).
This has been tolerated in the kernel previously, and it doesn't seem
to cause a problem, but it is unexpected and may trip warnings or cause
other problems as we tighten up this state management. Fix this by only
re-enabling if PACA_IRQ_HARD_DIS is clear.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926054305.2671436-5-npiggin@gmail.com
|
|
becomes pending
If a synchronous interrupt (e.g., hash fault) is taken inside an
irqs-disabled region which has MSR[EE]=1, then an asynchronous interrupt
that is PACA_IRQ_MUST_HARD_MASK (e.g., PMI) is taken inside the
synchronous interrupt handler, then the synchronous interrupt will
return with MSR[EE]=1 and the asynchronous interrupt fires again.
If the asynchronous interrupt is a PMI and the original context does not
have PMIs disabled (only Linux IRQs), the asynchronous interrupt will
fire despite having the PMI marked soft pending. This can confuse the
perf code and cause warnings.
This patch changes the interrupt return so that irqs-disabled MSR[EE]=1
contexts will be returned to with MSR[EE]=0 if a PACA_IRQ_MUST_HARD_MASK
interrupt has become pending in the meantime.
The longer explanation for what happens:
1. local_irq_disable()
2. Hash fault interrupt fires, do_hash_fault handler runs
3. interrupt_enter_prepare() sets IRQS_ALL_DISABLED
4. interrupt_enter_prepare() sets MSR[EE]=1
5. PMU interrupt fires, masked handler runs
6. Masked handler marks PMI pending
7. Masked handler returns with PACA_IRQ_HARD_DIS set, MSR[EE]=0
8. do_hash_fault interrupt return handler runs
9. interrupt_exit_kernel_prepare() clears PACA_IRQ_HARD_DIS
10. interrupt returns with MSR[EE]=1
11. PMU interrupt fires, perf handler runs
Fixes: 4423eb5ae32e ("powerpc/64/interrupt: make normal synchronous interrupts enable MSR[EE] if possible")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926054305.2671436-4-npiggin@gmail.com
|
|
This prevents interrupts in early boot (e.g., program check) from
enabling MSR[EE], potentially causing endian mismatch or other
crashes when reporting early boot traps.
Fixes: 4423eb5ae32ec ("powerpc/64/interrupt: make normal synchronous interrupts enable MSR[EE] if possible")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926054305.2671436-3-npiggin@gmail.com
|
|
Commit 171476775d32 ("context_tracking: Convert state to atomic_t")
added a CONTEXT_IDLE state which can be encountered by interrupts from
kernel mode in the idle thread, causing a false positive warning.
Fixes: 171476775d32 ("context_tracking: Convert state to atomic_t")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926054305.2671436-2-npiggin@gmail.com
|
|
KFENCE support was added for ppc32 in commit 90cbac0e995d
("powerpc: Enable KFENCE for PPC32").
Enable KFENCE on ppc64 architecture with hash and radix MMUs.
It uses the same mechanism as debug pagealloc to
protect/unprotect pages. All KFENCE kunit tests pass on both
MMUs.
KFENCE memory is initially allocated using memblock but is
later marked as SLAB allocated. This necessitates the change
to __pud_free to ensure that the KFENCE pages are freed
appropriately.
Based on previous work by Christophe Leroy and Jordan Niethe.
Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Reviewed-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926075726.2846-4-nicholas@linux.ibm.com
|
|
If the page is already mapped resp. already unmapped, bail out.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926075726.2846-3-nicholas@linux.ibm.com
|
|
debug_pagealloc_enabled() is always defined and constant folds to
'false' when CONFIG_DEBUG_PAGEALLOC is not enabled.
Remove the #ifdefs, the code and associated static variables will
be optimised out by the compiler when CONFIG_DEBUG_PAGEALLOC is
not defined.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926075726.2846-2-nicholas@linux.ibm.com
|
|
There is support for DEBUG_PAGEALLOC on hash but not on radix.
Add support on radix.
Signed-off-by: Nicholas Miehlbradt <nicholas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220926075726.2846-1-nicholas@linux.ibm.com
|
|
Update the 64s GENERIC_CPU option. POWER4 support has been dropped, so
make that clear in the option name. The POWER5_CPU option is dropped
because it's uncommon, and GENERIC_CPU covers it.
-mtune= before power8 is dropped because the minimum gcc version
supports power8, and tuning is made consistent between big and little
endian.
A 970 option is added for PowerPC 970 / G5 because they still have a
user base, and -mtune=power8 does not generate good code for the 970.
This also updates the ISA versions document to add Power4/Power4+
because I didn't realise Power4+ used 2.01.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220921014103.587954-2-npiggin@gmail.com
|
|
Big-endian GENERIC_CPU supports 970, but builds with -mcpu=power5.
POWER5 is ISA v2.02 whereas 970 is v2.01 plus Altivec. 2.02 added
the popcntb instruction which a compiler might use.
Use -mcpu=power4.
Fixes: 471d7ff8b51b ("powerpc/64s: Remove POWER4 support")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220921014103.587954-1-npiggin@gmail.com
|
|
We want to move away from using SMT priority updates for cpu_relax, and
use a 'wait' instruction which is similar to x86. As well as being a
much better fit for what everybody else uses and tests with, priority
nops are stateful which is nasty (interrupts have to consider they might
be taken at a different priority), and they're expensive to execute,
similar to a mtSPR which can effect other threads in the pipe.
This has shown to give results that are less affected by code alignment
on benchmarks that cause a lot of spin waiting (e.g., rwsem contention
on unixbench filesystem benchmarks) on POWER10.
QEMU TCG only supports this instruction correctly since v7.1, versions
without the fix may cause hangs whne running POWER10 CPUs.
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix checkpatch warnings RE the macros]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220920122259.363092-2-npiggin@gmail.com
|
|
The wait instruction encoding changed between ISA v2.07 and ISA v3.0.
In v3.1 the instruction gained a new field.
Update the PPC_WAIT macro to the current encoding. Rename the older
incompatible one with a _v203 suffix as it was introduced in v2.03
(the WC field was introduced in v2.07 but the kernel only uses WC=0).
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220920122259.363092-1-npiggin@gmail.com
|
|
Setting DEC to maximum at the start of the timer interrupt is not
necessary and can be avoided for performance when MSR[EE] is not
enabled during the handler as explained in commit 0faf20a1ad16
("powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless
perf is in use"), where this change was first attempted.
The idea is that the timer interrupt runs with MSR[EE]=0, and at the end
of the interrupt DEC is programmed to the next timer interval, so there
is no need to clear the decrementer exception before then.
When the above commit was merged, that was not quite true. The low res
timer subsystem had some cases in the oneshot timer code where if the
tick was to be stopped and no timers active, the clock device would not
get the ->set_state_oneshot_stopped() call, so DEC would not get
reprogrammed, and this would hang taking continual timer interrupts.
So this was reverted in commit d2b9be1f4af5 ("powerpc/time: Always set
decrementer in timer_interrupt()"), which was a partial revert of the
above commit.
Commit 62c1256d5447 ("timers/nohz: Switch to ONESHOT_STOPPED in the
low-res handler when the tick is stopped") was later merged to fix this
missing case in the timer subsystem, so now the behaviour can be
restored.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220909142457.278032-1-npiggin@gmail.com
|
|
Currently powerpc early debugging contains lot of platform specific
options, but does not support standard UART / serial 16550 console.
Later legacy_serial.c code supports registering UART as early debug console
from device tree but it is not early during booting, but rather later after
machine description code finishes.
So for real early debugging via UART is current code unsuitable.
Add support for new early debugging option CONFIG_PPC_EARLY_DEBUG_16550
which enable Serial 16550 console on address defined by new option
CONFIG_PPC_EARLY_DEBUG_16550_PHYSADDR and by stride by option
CONFIG_PPC_EARLY_DEBUG_16550_STRIDE.
With this change it is possible to debug powerpc machine descriptor code.
For example this early debugging code can print on serial console also
"No suitable machine description found" error which is done before
legacy_serial.c code.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220822231501.16827-1-pali@kernel.org
|
|
Since commit e641eb03ab2b0 ("powerpc: Fix up the kdump base cap to
128M") memory for kdump kernel has been reserved at an offset of 128MB.
This held up well for a long time before running into boot failure on
LPARs having a lot of cores. Commit 7c5ed82b800d8 ("powerpc: Set
crashkernel offset to mid of RMA region") fixed this boot failure by
moving the offset to mid of RMA region. This change meant the offset is
either 256MB or 512MB on LPARs as ppc64_rma_size was 512MB or 1024MB
owing to commit 103a8542cb35b ("powerpc/book3s64/ radix: Fix boot
failure with large amount of guest memory").
But ppc64_rma_size can be larger as well with newer f/w. So, limit
crashkernel reservation offset to 512MB to avoid running into boot
failures during kdump kernel boot, due to RTAS or other allocation
restrictions.
Also, while here, use SZ_128M instead of opening coding it.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220912065031.57416-1-hbathini@linux.ibm.com
|
|
Implement syscall wrapper as per s390, x86, arm64. When enabled
cause handlers to accept parameters from a stack frame rather than
from user scratch register state. This allows for user registers to be
safely cleared in order to reduce caller influence on speculation
within syscall routine. The wrapper is a macro that emits syscall
handler symbols that call into the target handler, obtaining its
parameters from a struct pt_regs on the stack.
As registers are already saved to the stack prior to calling
system_call_exception, it appears that this function is executed more
efficiently with the new stack-pointer convention than with parameters
passed by registers, avoiding the allocation of a stack frame for this
method. On a 32-bit system, we see >20% performance increases on the
null_syscall microbenchmark, and on a Power 8 the performance gains
amortise the cost of clearing and restoring registers which is
implemented at the end of this series, seeing final result of ~5.6%
performance improvement on null_syscall.
Syscalls are wrapped in this fashion on all platforms except for the
Cell processor as this commit does not provide SPU support. This can be
quickly fixed in a successive patch, but requires spu_sys_callback to
allocate a pt_regs structure to satisfy the wrapped calling convention.
Co-developed-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmai.com>
[mpe: Make incompatible with COMPAT to retain clearing of high bits of args]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220921065605.1051927-22-rmclure@linux.ibm.com
|
|
Change system_call_exception arguments to pass a pointer to a stack
frame container caller state, as well as the original r0, which
determines the number of the syscall. This has been observed to yield
improved performance to passing them by registers, circumventing the
need to allocate a stack frame.
Signed-off-by: Rohan McLure <rmclure@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Retain clearing of high bits of args for compat tasks]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220921065605.1051927-21-rmclure@linux.ibm.com
|