Age | Commit message (Collapse) | Author |
|
When a large VM, specifically one that holds a significant number of PTEs,
gets abruptly destroyed, the following warning is seen during the
page-table walk:
sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule
CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE
Tainted: [O]=OOT_MODULE
Call trace:
show_stack+0x20/0x38 (C)
dump_stack_lvl+0x3c/0xb8
dump_stack+0x18/0x30
resched_latency_warn+0x7c/0x88
sched_tick+0x1c4/0x268
update_process_times+0xa8/0xd8
tick_nohz_handler+0xc8/0x168
__hrtimer_run_queues+0x11c/0x338
hrtimer_interrupt+0x104/0x308
arch_timer_handler_phys+0x40/0x58
handle_percpu_devid_irq+0x8c/0x1b0
generic_handle_domain_irq+0x48/0x78
gic_handle_irq+0x1b8/0x408
call_on_irq_stack+0x24/0x30
do_interrupt_handler+0x54/0x78
el1_interrupt+0x44/0x88
el1h_64_irq_handler+0x18/0x28
el1h_64_irq+0x84/0x88
stage2_free_walker+0x30/0xa0 (P)
__kvm_pgtable_walk+0x11c/0x258
__kvm_pgtable_walk+0x180/0x258
__kvm_pgtable_walk+0x180/0x258
__kvm_pgtable_walk+0x180/0x258
kvm_pgtable_walk+0xc4/0x140
kvm_pgtable_stage2_destroy+0x5c/0xf0
kvm_free_stage2_pgd+0x6c/0xe8
kvm_uninit_stage2_mmu+0x24/0x48
kvm_arch_flush_shadow_all+0x80/0xa0
kvm_mmu_notifier_release+0x38/0x78
__mmu_notifier_release+0x15c/0x250
exit_mmap+0x68/0x400
__mmput+0x38/0x1c8
mmput+0x30/0x68
exit_mm+0xd4/0x198
do_exit+0x1a4/0xb00
do_group_exit+0x8c/0x120
get_signal+0x6d4/0x778
do_signal+0x90/0x718
do_notify_resume+0x70/0x170
el0_svc+0x74/0xd8
el0t_64_sync_handler+0x60/0xc8
el0t_64_sync+0x1b0/0x1b8
The warning is seen majorly on the host kernels that are configured
not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this,
instead of walking the entire page-table in one go, split it into
smaller ranges, by checking for cond_resched() between each range.
Since the path is executed during VM destruction, after the
page-table structure is unlinked from the KVM MMU, relying on
cond_resched_rwlock_write() isn't necessary.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250820162242.2624752-3-rananta@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Split kvm_pgtable_stage2_destroy() into two:
- kvm_pgtable_stage2_destroy_range(), that performs the
page-table walk and free the entries over a range of addresses.
- kvm_pgtable_stage2_destroy_pgd(), that frees the PGD.
This refactoring enables subsequent patches to free large page-tables
in chunks, calling cond_resched() between each chunk, to yield the
CPU as necessary.
Existing callers of kvm_pgtable_stage2_destroy(), that probably cannot
take advantage of this (such as nVMHE), will continue to function as is.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250820162242.2624752-2-rananta@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
When kernel lockdown is active, debugfs_locked_down() blocks access to
hypfs files that register ioctl callbacks, even if the ioctl interface
is not required for a function. This unnecessarily breaks userspace
tools that only rely on read operations.
Resolve this by registering a minimal set of file operations during
lockdown, avoiding ioctl registration and preserving access for affected
tooling.
Note that this change restores hypfs functionality when lockdown is
active from early boot (e.g. via lockdown=integrity kernel parameter),
but does not apply to scenarios where lockdown is enabled dynamically
while Linux is running.
Tested-by: Mete Durlu <meted@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Fixes: 5496197f9b08 ("debugfs: Restrict debugfs when the kernel is locked down")
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Currently, hypfs registers ioctl callbacks for all debugfs files,
despite only one file requiring them. This leads to unintended exposure
of unused interfaces to user space and can trigger side effects such as
restricted access when kernel lockdown is enabled.
Restrict ioctl registration to only those files that implement ioctl
functionality to avoid interface clutter and unnecessary access
restrictions.
Tested-by: Mete Durlu <meted@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Fixes: 5496197f9b08 ("debugfs: Restrict debugfs when the kernel is locked down")
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
The DTS code coding style expects exactly one space around '=' or '{'
characters.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250819131752.86948-2-krzysztof.kozlowski@linaro.org
Signed-off-by: Chen-Yu Tsai <wens@csie.org>
|
|
With pch_pic device, its register is based on MMIO address space,
different access size 1/2/4/8 is supported. And base address should
be naturally aligned with its access size, here add alignment check
in its register access emulation function.
Cc: stable@vger.kernel.org
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Since using kvm_get_vcpu() may fail to retrieve the vCPU context,
kvm_get_vcpu_by_id() should be used instead.
Fixes: 8e3054261bc3 ("LoongArch: KVM: Add IPI user mode read and write function")
Fixes: 3956a52bc05b ("LoongArch: KVM: Add EIOINTC read and write functions")
Reviewed-by: Yanteng Si <siyanteng@cqsoftware.com.cm>
Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Function kvm_io_bus_read() is called in function send_ipi_data(), buffer
size of parameter *val should be at least 8 bytes. Since some emulation
functions like loongarch_ipi_readl() and kvm_eiointc_read() will write
the buffer *val with 8 bytes signed extension regardless parameter len.
Otherwise there will be buffer overflow issue when CONFIG_STACKPROTECTOR
is enabled. The bug report is shown as follows:
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: send_ipi_data+0x194/0x1a0 [kvm]
CPU: 11 UID: 107 PID: 2692 Comm: CPU 0/KVM Not tainted 6.17.0-rc1+ #102 PREEMPT(full)
Stack : 9000000005901568 0000000000000000 9000000003af371c 900000013c68c000
900000013c68f850 900000013c68f858 0000000000000000 900000013c68f998
900000013c68f990 900000013c68f990 900000013c68f6c0 fffffffffffdb058
fffffffffffdb0e0 900000013c68f858 911e1d4d39cf0ec2 9000000105657a00
0000000000000001 fffffffffffffffe 0000000000000578 282049464555206e
6f73676e6f6f4c20 0000000000000001 00000000086b4000 0000000000000000
0000000000000000 0000000000000000 9000000005709968 90000000058f9000
900000013c68fa68 900000013c68fab4 90000000029279f0 900000010153f940
900000010001f360 0000000000000000 9000000003af3734 000000004390000c
00000000000000b0 0000000000000004 0000000000000000 0000000000071c1d
...
Call Trace:
[<9000000003af3734>] show_stack+0x5c/0x180
[<9000000003aed168>] dump_stack_lvl+0x6c/0x9c
[<9000000003ad0ab0>] vpanic+0x108/0x2c4
[<9000000003ad0ca8>] panic+0x3c/0x40
[<9000000004eb0a1c>] __stack_chk_fail+0x14/0x18
[<ffff8000023473f8>] send_ipi_data+0x190/0x1a0 [kvm]
[<ffff8000023313e4>] __kvm_io_bus_write+0xa4/0xe8 [kvm]
[<ffff80000233147c>] kvm_io_bus_write+0x54/0x90 [kvm]
[<ffff80000233f9f8>] kvm_emu_iocsr+0x180/0x310 [kvm]
[<ffff80000233fe08>] kvm_handle_gspr+0x280/0x478 [kvm]
[<ffff8000023443e8>] kvm_handle_exit+0xc0/0x130 [kvm]
Cc: stable@vger.kernel.org
Fixes: daee2f9cae551 ("LoongArch: KVM: Add IPI read and write function")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Add the flag KVM_LARCH_LBT checking in function kvm_own_lbt(), so that
it can be called safely rather than duplicated enabling again.
Cc: stable@vger.kernel.org
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Since the identity mapping is pinned to address zero the lowcore is always
also mapped to address zero, this happens regardless of the relocate_lowcore
command line option. If the option is specified the lowcore is mapped
twice, instead of only once.
This means that NULL pointer accesses will succeed instead of causing an
exception (low address protection still applies, but covers only parts).
To fix this never map the first two pages of physical memory with the
identity mapping.
Fixes: 32db401965f1 ("s390/mm: Pin identity mapping base to zero")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Similar to powerpc set HZ to 1000. See also commit a206d2334012
("powerpc/defconfigs: Set HZ=1000 on ppc64 and powernv defconfigs").
Besides other this will reduce the latency seen with synchronize_rcu().
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
The usual defconfig updates. Notable changes:
- Enable ZONE_DEVICE, and with that DEV_DAX, FS_DAX, and FUSE_DAX
- Enable CRYPTO_SELFTESTS_FULL for debug_defconfig
- Enable CRASH_DM_CRYPT
- Disable legacy IP_NF_FILTER
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
|
|
Commit 57fbad15c2eee772 ("stackleak: Rename STACKLEAK to KSTACK_ERASE")
misses the stackframe.h part for LoongArch, so fix it.
Fixes: 57fbad15c2eee772 ("stackleak: Rename STACKLEAK to KSTACK_ERASE")
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Now if preemption happens between protected_save_fpu_context() and
protected_save_lbt_context(), FTOP context is lost. Because FTOP is
saved by protected_save_lbt_context() but protected_save_fpu_context()
disables TM before that. So save LBT before FPU in setup_sigcontext()
to avoid this potential risk.
Signed-off-by: Hanlu Li <lihanlu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
When enabling CONFIG_KASAN, CONFIG_PREEMPT_VOLUNTARY_BUILD and
CONFIG_PREEMPT_VOLUNTARY at the same time, there will be soft deadlock,
the relevant logs are as follows:
rcu: INFO: rcu_sched self-detected stall on CPU
...
Call Trace:
[<900000000024f9e4>] show_stack+0x5c/0x180
[<90000000002482f4>] dump_stack_lvl+0x94/0xbc
[<9000000000224544>] rcu_dump_cpu_stacks+0x1fc/0x280
[<900000000037ac80>] rcu_sched_clock_irq+0x720/0xf88
[<9000000000396c34>] update_process_times+0xb4/0x150
[<90000000003b2474>] tick_nohz_handler+0xf4/0x250
[<9000000000397e28>] __hrtimer_run_queues+0x1d0/0x428
[<9000000000399b2c>] hrtimer_interrupt+0x214/0x538
[<9000000000253634>] constant_timer_interrupt+0x64/0x80
[<9000000000349938>] __handle_irq_event_percpu+0x78/0x1a0
[<9000000000349a78>] handle_irq_event_percpu+0x18/0x88
[<9000000000354c00>] handle_percpu_irq+0x90/0xf0
[<9000000000348c74>] handle_irq_desc+0x94/0xb8
[<9000000001012b28>] handle_cpu_irq+0x68/0xa0
[<9000000001def8c0>] handle_loongarch_irq+0x30/0x48
[<9000000001def958>] do_vint+0x80/0xd0
[<9000000000268a0c>] kasan_mem_to_shadow.part.0+0x2c/0x2a0
[<90000000006344f4>] __asan_load8+0x4c/0x120
[<900000000025c0d0>] module_frob_arch_sections+0x5c8/0x6b8
[<90000000003895f0>] load_module+0x9e0/0x2958
[<900000000038b770>] __do_sys_init_module+0x208/0x2d0
[<9000000001df0c34>] do_syscall+0x94/0x190
[<900000000024d6fc>] handle_syscall+0xbc/0x158
After analysis, this is because the slow speed of loading the amdgpu
module leads to the long time occupation of the cpu and then the soft
deadlock.
When loading a module, module_frob_arch_sections() tries to figure out
the number of PLTs/GOTs that will be needed to handle all the RELAs. It
will call the count_max_entries() to find in an out-of-order date which
counting algorithm has O(n^2) complexity.
To make it faster, we sort the relocation list by info and addend. That
way, to check for a duplicate relocation, it just needs to compare with
the previous entry. This reduces the complexity of the algorithm to O(n
log n), as done in commit d4e0340919fb ("arm64/module: Optimize module
load time by optimizing PLT counting"). This gives sinificant reduction
in module load time for modules with large number of relocations.
After applying this patch, the soft deadlock problem has been solved,
and the kernel starts normally without "Call Trace".
Using the default configuration to test some modules, the results are as
follows:
Module Size
ip_tables 36K
fat 143K
radeon 2.5MB
amdgpu 16MB
Without this patch:
Module Module load time (ms) Count(PLTs/GOTs)
ip_tables 18 59/6
fat 0 162/14
radeon 54 1221/84
amdgpu 1411 4525/1098
With this patch:
Module Module load time (ms) Count(PLTs/GOTs)
ip_tables 18 59/6
fat 0 162/14
radeon 22 1221/84
amdgpu 45 4525/1098
Fixes: fcdfe9d22bed ("LoongArch: Add ELF and module support")
Signed-off-by: Kanglong Wang <wangkanglong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
When the CPU is offline, the timer of LoongArch is not correctly closed.
This is harmless for real machines, but resulting in an excessively high
cpu usage rate of the offline vCPU thread in the virtual machines.
To correctly close the timer, we have made the following modifications:
Register the cpu hotplug event (CPUHP_AP_LOONGARCH_ARCH_TIMER_STARTING)
for LoongArch. This event's hooks will be called to close the timer when
the CPU is offline.
Clear the timer interrupt when the timer is turned off. Since before the
timer is turned off, there may be a timer interrupt that has already been
in the pending state due to the interruption of the disabled, which also
affects the halt state of the offline vCPU.
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
The default COMMAND_LINE_SIZE of 512, inherited from asm-generic, is
too small for modern use cases. For example, kdump configurations or
extensive debugging parameters can easily exceed this limit.
Therefore, increase the command line size to 4096 bytes, aligning
LoongArch with the MIPS architecture. This change follows a broader
trend among architectures to raise this limit to support modern needs;
for instance, PowerPC increased its value for similar reasons in the
commit a5980d064fe2 ("powerpc: Bump COMMAND_LINE_SIZE to 2048").
Similar to the change made for RISC-V in the commit 61fc1ee8be26
("riscv: Bump COMMAND_LINE_SIZE value to 1024"), this is considered
a safe change. The broader kernel community has reached a consensus
that modifying COMMAND_LINE_SIZE from UAPI headers does not constitute
a uABI breakage, as well-behaved userspace applications should not
rely on this macro.
Suggested-by: Huang Cun <cunhuang@tencent.com>
Signed-off-by: Ming Wang <wangming01@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
When compiling with LLVM and CONFIG_LTO_CLANG is set, there exist many
objtool warnings "sibling call from callable instruction with modified
stack frame".
For this special case, the related object file shows that there is no
generated relocation section '.rela.discard.tablejump_annotate' for the
table jump instruction jirl, thus objtool can not know that what is the
actual destination address.
It needs to do something on the LLVM side to make sure that there is the
relocation section '.rela.discard.tablejump_annotate' if LTO is enabled,
but in order to maintain compatibility for the current LLVM compiler,
this can be done in the kernel Makefile for now. Ensure it is aware of
linker with LTO, '--loongarch-annotate-tablejump' needs to be passed via
'-mllvm' to ld.lld.
Note that it should also pass the compiler option -mannotate-tablejump
rather than only pass '-mllvm --loongarch-annotate-tablejump' to ld.lld
if LTO is enabled, otherwise there are no jump info for some table jump
instructions.
Fixes: e20ab7d454ee ("LoongArch: Enable jump table for objtool")
Closes: https://lore.kernel.org/loongarch/20250731175655.GA1455142@ax162/
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Co-developed-by: WANG Rui <wangrui@loongson.cn>
Signed-off-by: WANG Rui <wangrui@loongson.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
Fix shellcheck warning such as
"Double quote to prevent globbing and word splitting." and
Use $(...) notation instead of legacy backticks `...`.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250518044107.39928-4-maddy@linux.ibm.com
|
|
Fix "Double quote to prevent globbing and word splitting."
warning from shellcheck
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Stephen Rothwell <sfr@cab.auug.org.au>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250518044107.39928-3-maddy@linux.ibm.com
|
|
When compiling for pseries or powernv defconfig with "make C=1",
these warning were reported bu sparse tool in powerpc/kernel/kvm.c
arch/powerpc/kernel/kvm.c:635:9: warning: switch with no cases
arch/powerpc/kernel/kvm.c:646:9: warning: switch with no cases
Currently #ifdef were added after the switch case which are specific
for BOOKE and PPC_BOOK3S_32. These are not enabled in pseries/powernv
defconfig. Fix it by moving the #ifdef before switch(){}
Fixes: cbe487fac7fc0 ("KVM: PPC: Add mtsrin PV code")
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250518044107.39928-1-maddy@linux.ibm.com
|
|
There are two CONFIG_POWERPC64_CPU entries in the "CPU selection"
choice block.
I guess the intent is to display a different prompt depending on
CPU_LITTLE_ENDIAN: "Generic (POWER5 and PowerPC 970 and above)" for big
endian, and "Generic (POWER8 and above)" for little endian.
I stumbled on this tricky use case, and worked around it on Kconfig with
commit 4d46b5b623e0 ("kconfig: fix infinite loop in sym_calc_choice()").
However, I doubt that supporting multiple entries with the same symbol
in a choice block is worth the complexity - this is the only such case
in the kernel tree.
This commit merges the two entries. Once this cleanup is accepted in
the powerpc subsystem, I will proceed to refactor the Kconfig parser.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250607102005.1965409-1-masahiroy@kernel.org
|
|
The extra-y syntax is planned for deprecation because it is similar
to always-y.
When building the boot wrapper, always-y and extra-y are equivalent.
Use always-y instead.
In arch/powerpc/kernel/Makefile, I added ifdef KBUILD_BUILTIN to
keep the current behavior: prom_init_check is skipped when building
only modular objects.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250602163302.478765-1-masahiroy@kernel.org
|
|
Simplify the code to enhance readability and maintain a consistent
coding style.
Signed-off-by: Xichao Zhao <zhao.xichao@vivo.com>
Acked-by: Gautam Menghani <gautam@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250801035908.370463-1-zhao.xichao@vivo.com
|
|
irq_domain_create_simple() takes fwnode as the first argument. It can be
extracted from the struct device using dev_fwnode() helper instead of
using of_node with of_fwnode_handle().
So use the dev_fwnode() helper.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Acked-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/all/4bc0e1ca-a523-424a-8759-59e353317fba@kernel.org/
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250723062842.1831271-1-jirislaby@kernel.org
|
|
Until commit 6c85f52b10fd ("kvm/ppc: IRQ disabling cleanup"),
kvmppc_prepare_to_enter() was called with interrupts already disabled by
the caller, which was documented in the comment above the function.
Post-cleanup, the function is now called with interrupts enabled, and
disables interrupts itself.
Fix the comment to reflect the current behaviour.
Fixes: 6c85f52b10fd ("kvm/ppc: IRQ disabling cleanup")
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Amit Machhiwal <amachhiw@linux.ibm.com>
Reviewed-by: Gautam Menghani <gautam@linux.ibm.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
[Fixed the double colon in Reviewed-by line]
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250806055607.17081-1-ajd@linux.ibm.com
|
|
The current GDS mitigation logic incorrectly returns early when the
attack vector mitigation is turned off, which leads to two problems:
1. CPUs without ARCH_CAP_GDS_CTRL support are incorrectly marked with
GDS_MITIGATION_OFF when they should be marked as
GDS_MITIGATION_UCODE_NEEDED.
2. The mitigation state checks and locking verification that follow are
skipped, which means:
- fail to detect if the mitigation was locked
- miss the warning when trying to disable a locked mitigation
Remove the early return to ensure proper mitigation state handling. This
allows:
- Proper mitigation classification for non-ARCH_CAP_GDS_CTRL CPUs
- Complete mitigation state verification
This also addresses the failed MSR 0x123 write attempt at boot on
non-ARCH_CAP_GDS_CTRL CPUs:
unchecked MSR access error: WRMSR to 0x123 (tried to write 0x0000000000000010) at rIP: ... (update_gds_msr)
Call Trace:
identify_secondary_cpu
start_secondary
common_startup_64
WARNING: CPU: 1 PID: 0 at arch/x86/kernel/cpu/bugs.c:1053 update_gds_msr
[ bp: Massage, zap superfluous braces. ]
Fixes: 8c7261abcb7ad ("x86/bugs: Add attack vector controls for GDS")
Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lore.kernel.org/20250819023356.2012-1-lirongqing@baidu.com
|
|
Drop KVM's check on an empty cpumask when flushing caches when memory is
being reclaimed from an SEV VM, as smp_call_function_many_cond() naturally
(and correctly) handles an empty cpumask. This avoids an extra O(n)
lookup in the common case where at least one pCPU has enterred the guest,
which could be noticeable in some setups, e.g. if a small VM is pinned to
the last few pCPUs in the system.
Fixes: 6f38f8c57464 ("KVM: SVM: Flush cache only on CPUs running SEV guest")
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
[sean: rewrite changelog to capture performance angle]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The reset reason value may be "all bits set", e.g. 0xFFFFFFFF. This is a
commonly used error response from hardware. This may occur due to a real
hardware issue or when running in a VM.
The user will see all reset reasons reported in this case.
Check for an error response value and return early to avoid decoding
invalid data.
Also, adjust the data variable type to match the hardware register size.
Fixes: ab8131028710 ("x86/CPU/AMD: Print the reason for the last reset")
Reported-by: Libing He <libhe@redhat.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250721181155.3536023-1-yazen.ghannam@amd.com
|
|
__ADDRESSABLE_ASM_STR() is where the necessary stringification happens.
As long as "sym" doesn't contain any odd characters, no quoting is
required for its use with .quad / .long. In fact the quotation gets in
the way with gas 2.25; it's only from 2.26 onwards that quoted symbols
are half-way properly supported.
However, assembly being different from C anyway, drop
__ADDRESSABLE_ASM_STR() and its helper macro altogether. A simple
.global directive will suffice to get the symbol "declared", i.e. into
the symbol table. While there also stop open-coding STATIC_CALL_TRAMP()
and STATIC_CALL_KEY().
Fixes: 0ef8047b737d ("x86/static-call: provide a way to do very early static-call updates")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <609d2c74-de13-4fae-ab1a-1ec44afb948d@suse.com>
|
|
Since
923f3a2b48bd ("x86/resctrl: Query LLC monitoring properties once during boot")
resctrl_cpu_detect() has been moved from common CPU initialization code to
the vendor-specific BSP init helper, while Hygon didn't put that call in their
code.
This triggers a division by zero fault during early booting stage on our
machines with X86_FEATURE_CQM* supported, where get_rdt_mon_resources() tries
to calculate mon_l3_config with uninitialized boot_cpu_data.x86_cache_occ_scale.
Add the missing resctrl_cpu_detect() in the Hygon BSP init helper.
[ bp: Massage commit message. ]
Fixes: 923f3a2b48bd ("x86/resctrl: Query LLC monitoring properties once during boot")
Signed-off-by: Tianxiang Peng <txpeng@tencent.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Hui Li <caelli@tencent.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/20250623093153.3016937-1-txpeng@tencent.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Remove a transitional asm/cpuid.h header which was added only as a
fallback during cpuid helpers reorg
- Initialize reserved fields in the SVSM page validation calls
structure to zero in order to allow for future structure extensions
- Have the sev-guest driver's buffers used in encryption operations be
in linear mapping space as the encryption operation can be offloaded
to an accelerator
- Have a read-only MSR write when in an AMD SNP guest trap to the
hypervisor as it is usually done. This makes the guest user
experience better by simply raising a #GP instead of terminating said
guest
- Do not output AVX512 elapsed time for kernel threads because the data
is wrong and fix a NULL pointer dereferencing in the process
- Adjust the SRSO mitigation selection to the new attack vectors
* tag 'x86_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpuid: Remove transitional <asm/cpuid.h> header
x86/sev: Ensure SVSM reserved fields in a page validation entry are initialized to zero
virt: sev-guest: Satisfy linear mapping requirement in get_derived_key()
x86/sev: Improve handling of writes to intercepted TSC MSRs
x86/fpu: Fix NULL dereference in avx512_status()
x86/bugs: Select best SRSO mitigation
|
|
vcpu_write_sys_reg()'s signature is not totally obvious, and it
is rather easy to write something that looks correct, except that...
Oh wait...
Swap addr and FAR_EL2 to restore some sanity in the nested SEA
department.
Fixes: 9aba641b9ec2a ("KVM: arm64: nv: Respect exception routing rules for SEAs")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250813163747.2591317-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
min and dest_id are guest-controlled indices. Using array_index_nospec()
after the bounds checks clamps these values to mitigate speculative execution
side-channels.
Signed-off-by: Thijs Raymakers <thijs@raymakers.nl>
Cc: stable@vger.kernel.org
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 715062970f37 ("KVM: X86: Implement PV sched yield hypercall")
Fixes: bdf7ffc89922 ("KVM: LAPIC: Fix pv ipis out-of-bounds access")
Fixes: 4180bf1b655a ("KVM: X86: Implement "send IPI" hypercall")
Link: https://lore.kernel.org/r/20250804064405.4802-1-thijs@raymakers.nl
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
All CPUID call sites were updated at commit:
968e30006807 ("x86/cpuid: Set <asm/cpuid/api.h> as the main CPUID header")
to include <asm/cpuid/api.h> instead of <asm/cpuid.h>.
The <asm/cpuid.h> header was still retained as a wrapper, just in case
some new code in -next started using it. Now that everything is merged
to Linus' tree, remove the header.
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250815070227.19981-2-darwi@linutronix.de
|
|
initialized to zero
In order to support future versions of the SVSM_CORE_PVALIDATE call, all
reserved fields within a PVALIDATE entry must be set to zero as an SVSM should
be ensuring all reserved fields are zero in order to support future usage of
reserved areas based on the protocol version.
Fixes: fcd042e86422 ("x86/sev: Perform PVALIDATE using the SVSM when not at VMPL0")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/7cde412f8b057ea13a646fb166b1ca023f6a5031.1755098819.git.thomas.lendacky@amd.com
|
|
Cross-thread attacks are generally harder as they require the victim to be
co-located on a core. However, with VMSCAPE the adversary targets belong to
the same guest execution, that are more likely to get co-located. In
particular, a thread that is currently executing userspace hypervisor
(after the IBPB) may still be targeted by a guest execution from a sibling
thread.
Issue a warning about the potential risk, except when:
- SMT is disabled
- STIBP is enabled system-wide
- Intel eIBRS is enabled (which implies STIBP protection)
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
|
|
cpu_bugs_smt_update() uses global variables from different mitigations. For
SMT updates it can't currently use vmscape_mitigation that is defined after
it.
Since cpu_bugs_smt_update() depends on many other mitigations, move it
after all mitigations are defined. With that, it can use vmscape_mitigation
in a moment.
No functional change.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
|
|
Enable the previously added mitigation for VMscape. Add the cmdline
vmscape={off|ibpb|force} and sysfs reporting.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
|
|
VMSCAPE is a vulnerability that exploits insufficient branch predictor
isolation between a guest and a userspace hypervisor (like QEMU). Existing
mitigations already protect kernel/KVM from a malicious guest. Userspace
can additionally be protected by flushing the branch predictors after a
VMexit.
Since it is the userspace that consumes the poisoned branch predictors,
conditionally issue an IBPB after a VMexit and before returning to
userspace. Workloads that frequently switch between hypervisor and
userspace will incur the most overhead from the new IBPB.
This new IBPB is not integrated with the existing IBPB sites. For
instance, a task can use the existing speculation control prctl() to
get an IBPB at context switch time. With this implementation, the
IBPB is doubled up: one at context switch and another before running
userspace.
The intent is to integrate and optimize these cases post-embargo.
[ dhansen: elaborate on suboptimal IBPB solution ]
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
|
|
The VMSCAPE vulnerability may allow a guest to cause Branch Target
Injection (BTI) in userspace hypervisors.
Kernels (both host and guest) have existing defenses against direct BTI
attacks from guests. There are also inter-process BTI mitigations which
prevent processes from attacking each other. However, the threat in this
case is to a userspace hypervisor within the same process as the attacker.
Userspace hypervisors have access to their own sensitive data like disk
encryption keys and also typically have access to all guest data. This
means guest userspace may use the hypervisor as a confused deputy to attack
sensitive guest kernel data. There are no existing mitigations for these
attacks.
Introduce X86_BUG_VMSCAPE for this vulnerability and set it on affected
Intel and AMD CPUs.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from Netfilter and IPsec.
Current release - regressions:
- netfilter: nft_set_pipapo:
- don't return bogus extension pointer
- fix null deref for empty set
Current release - new code bugs:
- core: prevent deadlocks when enabling NAPIs with mixed kthread
config
- eth: netdevsim: Fix wild pointer access in nsim_queue_free().
Previous releases - regressions:
- page_pool: allow enabling recycling late, fix false positive
warning
- sched: ets: use old 'nbands' while purging unused classes
- xfrm:
- restore GSO for SW crypto
- bring back device check in validate_xmit_xfrm
- tls: handle data disappearing from under the TLS ULP
- ptp: prevent possible ABBA deadlock in ptp_clock_freerun()
- eth:
- bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE
- hv_netvsc: fix panic during namespace deletion with VF
Previous releases - always broken:
- netfilter: fix refcount leak on table dump
- vsock: do not allow binding to VMADDR_PORT_ANY
- sctp: linearize cloned gso packets in sctp_rcv
- eth:
- hibmcge: fix the division by zero issue
- microchip: fix KSZ8863 reset problem"
* tag 'net-6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
net: usb: asix_devices: add phy_mask for ax88772 mdio bus
net: kcm: Fix race condition in kcm_unattach()
selftests: net/forwarding: test purge of active DWRR classes
net/sched: ets: use old 'nbands' while purging unused classes
bnxt: fill data page pool with frags if PAGE_SIZE > BNXT_RX_PAGE_SIZE
netdevsim: Fix wild pointer access in nsim_queue_free().
net: mctp: Fix bad kfree_skb in bind lookup test
netfilter: nf_tables: reject duplicate device on updates
ipvs: Fix estimator kthreads preferred affinity
netfilter: nft_set_pipapo: fix null deref for empty set
selftests: tls: test TCP stealing data from under the TLS socket
tls: handle data disappearing from under the TLS ULP
ptp: prevent possible ABBA deadlock in ptp_clock_freerun()
ixgbe: prevent from unwanted interface name changes
devlink: let driver opt out of automatic phys_port_name generation
net: prevent deadlocks when enabling NAPIs with mixed kthread config
net: update NAPI threaded config even for disabled NAPIs
selftests: drv-net: don't assume device has only 2 queues
docs: Fix name for net.ipv4.udp_child_hash_entries
riscv: dts: thead: Add APB clocks for TH1520 GMACs
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
- Add a mitigation for a cache coherency vulnerability when running an
SNP guest which makes sure all cache lines belonging to a 4K page are
evicted after latter has been converted to a guest-private page
[ SNP: Secure Nested Paging - not to be confused with Single Nucleotide
Polymorphism, which is the more common use of that TLA. I am on a
mission to write out the more obscure TLAs in order to keep track of
them.
Because while math tells us that there are only about 17k different
combinations of three-letter acronyms using English letters (26^3), I
am convinced that somehow Intel, AMD and ARM have together figured out
new mathematics, and have at least a million different TLAs that they
use. - Linus ]
* tag 'snp_cache_coherency' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Evict cache lines during SNP memory validation
|
|
Describe perisys-apb4-hclk as the APB clock for TH1520 SoC, which is
essential for accessing GMAC glue registers.
Fixes: 7e756671a664 ("riscv: dts: thead: Add TH1520 ethernet nodes")
Signed-off-by: Yao Zi <ziyao@disroot.org>
Reviewed-by: Drew Fustini <fustini@kernel.org>
Tested-by: Drew Fustini <fustini@kernel.org>
Link: https://patch.msgid.link/20250808093655.48074-5-ziyao@disroot.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently, when a Secure TSC enabled SNP guest attempts to write to the
intercepted GUEST_TSC_FREQ MSR (a read-only MSR), the guest kernel response
incorrectly implies a VMM configuration error, when in fact it is the usual
VMM configuration to intercept writes to read-only MSRs, unless explicitly
documented.
Modify the intercepted TSC MSR #VC handling:
* Write to GUEST_TSC_FREQ will generate a #GP instead of terminating the
guest
* Write to MSR_IA32_TSC will generate a #GP instead of silently ignoring it
However, continue to terminate the guest when reading from intercepted
GUEST_TSC_FREQ MSR with Secure TSC enabled, as intercepted reads indicate an
improper VMM configuration for Secure TSC enabled SNP guests.
[ bp: simplify comment. ]
Fixes: 38cc6495cdec ("x86/sev: Prevent GUEST_TSC_FREQ MSR interception for Secure TSC enabled guests")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/20250722074853.22253-1-nikunj@amd.com
|
|
Problem
-------
With CONFIG_X86_DEBUG_FPU enabled, reading /proc/[kthread]/arch_status
causes a warning and a NULL pointer dereference.
This is because the AVX-512 timestamp code uses x86_task_fpu() but
doesn't check it for NULL. CONFIG_X86_DEBUG_FPU addles that function
for kernel threads (PF_KTHREAD specifically), making it return NULL.
The point of the warning was to ensure that kernel threads only access
task->fpu after going through kernel_fpu_begin()/_end(). Note: all
kernel tasks exposed in /proc have a valid task->fpu.
Solution
--------
One option is to silence the warning and check for NULL from
x86_task_fpu(). However, that warning is fairly fresh and seems like a
defense against misuse of the FPU state in kernel threads.
Instead, stop outputting AVX-512_elapsed_ms for kernel threads
altogether. The data was garbage anyway because avx512_timestamp is
only updated for user threads, not kernel threads.
If anyone ever wants to track kernel thread AVX-512 use, they can come
back later and do it properly, separate from this bug fix.
[ dhansen: mostly rewrite changelog ]
Fixes: 22aafe3bcb67 ("x86/fpu: Remove init_task FPU state dependencies, add debugging warning for PF_KTHREAD tasks")
Co-developed-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250811185044.2227268-1-sohil.mehta%40intel.com
|
|
The SRSO bug can theoretically be used to conduct user->user or guest->guest
attacks and requires a mitigation (namely IBPB instead of SBPB on context
switch) for these. So mark SRSO as being applicable to the user->user and
guest->guest attack vectors.
Additionally, SRSO supports multiple mitigations which mitigate different
potential attack vectors. Some CPUs are also immune to SRSO from
certain attack vectors (like user->kernel).
Use the specific attack vectors requiring mitigation to select the best
SRSO mitigation to avoid unnecessary performance hits.
Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250721160310.1804203-1-david.kaplan@amd.com
|
|
As described in the pinebookpro_v2.1_mainboard_schematic.pdf page 10,
he SPI Flash's VCC connector is connected to VCC_3V0 power source.
This fixes the following warning:
spi-nor spi1.0: supply vcc not found, using dummy regulator
Fixes: 5a65505a69884 ("arm64: dts: rockchip: Add initial support for Pinebook Pro")
Signed-off-by: Peter Robinson <pbrobinson@gmail.com>
Reviewed-by: Dragan Simic <dsimic@manjaro.org>
Link: https://lore.kernel.org/r/20250730102129.224468-1-pbrobinson@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
|
|
The eeprom on the Radxa E52C SBC contains manufacturer data
such as the mac address, so it should be marked as read-only.
Fixes: 9be4171219b6 ("arm64: dts: rockchip: Add Radxa E52C")
Signed-off-by: Chukun Pan <amadeus@jmu.edu.cn>
Link: https://lore.kernel.org/r/20250810100020.445053-2-amadeus@jmu.edu.cn
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Fix an interrupt vector setup race which leads to a non-functioning
device
- Add new Intel CPU models *and* a family: 0x12. Finally. Yippie! :-)
* tag 'x86_urgent_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Plug vector setup race
x86/cpu: Add new Intel CPU model numbers for Wildcatlake and Novalake
|