summaryrefslogtreecommitdiff
path: root/arch/s390/kernel
AgeCommit message (Collapse)Author
2025-05-18s390/entry: Fix last breaking event handling in case of stack corruptionHeiko Carstens
[ Upstream commit ae952eea6f4a7e2193f8721a5366049946e012e7 ] In case of stack corruption stack_invalid() is called and the expectation is that register r10 contains the last breaking event address. This dependency is quite subtle and broke a couple of years ago without that anybody noticed. Fix this by getting rid of the dependency and read the last breaking event address from lowcore. Fixes: 56e62a737028 ("s390: convert to generic entry") Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-20s390/cpumf: Fix double free on error in cpumf_pmu_event_init()Thomas Richter
commit aa1ac98268cd1f380c713f07e39b1fa1d5c7650c upstream. In PMU event initialization functions - cpumsf_pmu_event_init() - cpumf_pmu_event_init() - cfdiag_event_init() the partially created event had to be removed when an error was detected. The event::event_init() member function had to release all resources it allocated in case of error. event::destroy() had to be called on freeing an event after it was successfully created and event::event_init() returned success. With commit c70ca298036c ("perf/core: Simplify the perf_event_alloc() error path") this is not necessary anymore. The performance subsystem common code now always calls event::destroy() to clean up the allocated resources created during event initialization. Remove the event::destroy() invocation in PMU event initialization or that function is called twice for each event that runs into an error condition in event creation. This is the kernel log entry which shows up without the fix: ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 43388 at lib/refcount.c:87 refcount_dec_not_one+0x74/0x90 CPU: 0 UID: 0 PID: 43388 Comm: perf Not tainted 6.15.0-20250407.rc1.git0.300.fc41.s390x+git #1 NONE Hardware name: IBM 3931 A01 704 (LPAR) Krnl PSW : 0704c00180000000 00000209cb2c1b88 (refcount_dec_not_one+0x78/0x90) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 0000020900000027 0000020900000023 0000000000000026 0000018900000000 00000004a2200a00 0000000000000000 0000000000000057 ffffffffffffffea 00000002b386c600 00000002b3f5b3e0 00000209cc51f140 00000209cc7fc550 0000000001449d38 ffffffffffffffff 00000209cb2c1b84 00000189d67dfb80 Krnl Code: 00000209cb2c1b78: c02000506727 larl %r2,00000209cbcce9c6 00000209cb2c1b7e: c0e5ffbd4431 brasl %r14,00000209caa6a3e0 #00000209cb2c1b84: af000000 mc 0,0 >00000209cb2c1b88: a7480001 lhi %r4,1 00000209cb2c1b8c: ebeff0a00004 lmg %r14,%r15,160(%r15) 00000209cb2c1b92: ec243fbf0055 risbg %r2,%r4,63,191,0 00000209cb2c1b98: 07fe bcr 15,%r14 00000209cb2c1b9a: 47000700 bc 0,1792 Call Trace: [<00000209cb2c1b88>] refcount_dec_not_one+0x78/0x90 [<00000209cb2c1dc4>] refcount_dec_and_mutex_lock+0x24/0x90 [<00000209caa3c29e>] hw_perf_event_destroy+0x2e/0x80 [<00000209cacaf8b4>] __free_event+0x74/0x270 [<00000209cacb47c4>] perf_event_alloc.part.0+0x4a4/0x730 [<00000209cacbf3e8>] __do_sys_perf_event_open+0x248/0xc20 [<00000209cacc14a4>] __s390x_sys_perf_event_open+0x44/0x50 [<00000209cb8114de>] __do_syscall+0x12e/0x260 [<00000209cb81ce34>] system_call+0x74/0x98 Last Breaking-Event-Address: [<00000209caa6a4d2>] __warn_printk+0xf2/0x100 ---[ end trace 0000000000000000 ]--- Fixes: c70ca298036c ("perf/core: Simplify the perf_event_alloc() error path") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10s390/entry: Fix setting _CIF_MCCK_GUEST with lowcore relocationSven Schnelle
[ Upstream commit 121df45b37a1016ee6828c2ca3ba825f3e18a8c1 ] When lowcore relocation is enabled, the machine check handler doesn't use the lowcore address when setting _CIF_MCCK_GUEST. Fix this by adding the missing base register. Fixes: 0001b7bbc53a ("s390/entry: Make mchk_int_handler() ready for lowcore relocation") Reported-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13s390/traps: Fix test_monitor_call() inline assemblyHeiko Carstens
commit 5623bc23a1cb9f9a9470fa73b3a20321dc4c4870 upstream. The test_monitor_call() inline assembly uses the xgr instruction, which also modifies the condition code, to clear a register. However the clobber list of the inline assembly does not specify that the condition code is modified, which may lead to incorrect code generation. Use the lhi instruction instead to clear the register without that the condition code is modified. Furthermore this limits clearing to the lower 32 bits of val, since its type is int. Fixes: 17248ea03674 ("s390: fix __EMIT_BUG() macro") Cc: stable@vger.kernel.org Reviewed-by: Juergen Christ <jchrist@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17s390/fpu: Add fpc exception handler / remove fixup section againHeiko Carstens
commit ae02615b7fcea9ce9a4ec40b3c5b5dafd322b179 upstream. The fixup section was added again by mistake when test_fp_ctl() was removed. The reason for the removal of the fixup section is described in commit 484a8ed8b7d1 ("s390/extable: add dedicated uaccess handler"). Remove it again for the same reason. Add an exception handler which handles exceptions when the floating point control register is attempted to be set to invalid values. The exception handler sets the floating point control register to zero and continues execution at the specified address. The new sfpc inline assembly is open-coded to make back porting a bit easier. Fixes: 702644249d3e ("s390/fpu: get rid of test_fp_ctl()") Cc: stable@vger.kernel.org Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-08s390/sclp: Initialize sclp subsystem via arch_cpu_finalize_init()Heiko Carstens
[ Upstream commit 3bcc8a1af581af152d43e42b53db3534018301b5 ] With the switch to GENERIC_CPU_DEVICES an early call to the sclp subsystem was added to smp_prepare_cpus(). This will usually succeed since the sclp subsystem is implicitly initialized early enough if an sclp based console is present. If no such console is present the initialization happens with an arch_initcall(); in such cases calls to the sclp subsystem will fail. For CPU detection this means that the fallback sigp loop will be used permanently to detect CPUs instead of the preferred READ_CPU_INFO sclp request. Fix this by adding an explicit early sclp_init() call via arch_cpu_finalize_init(). Reported-by: Sheshu Ramanandan <sheshu.ramanandan@ibm.com> Fixes: 4a39f12e753d ("s390/smp: Switch to GENERIC_CPU_DEVICES") Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-02-08perf/core: Save raw sample data conditionally based on sample typeYabin Cui
[ Upstream commit b9c44b91476b67327a521568a854babecc4070ab ] Currently, space for raw sample data is always allocated within sample records for both BPF output and tracepoint events. This leads to unused space in sample records when raw sample data is not requested. This patch enforces checking sample type of an event in perf_sample_save_raw_data(). So raw sample data will only be saved if explicitly requested, reducing overhead when it is not needed. Fixes: 0a9081cf0a11 ("perf/core: Add perf_sample_save_raw_data() helper") Signed-off-by: Yabin Cui <yabinc@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240515193610.2350456-2-yabinc@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-27s390/ipl: Fix never less than zero warningAlexander Gordeev
[ Upstream commit 5fa49dd8e521a42379e5e41fcf2c92edaaec0a8b ] DEFINE_IPL_ATTR_STR_RW() macro produces "unsigned 'len' is never less than zero." warning when sys_vmcmd_on_*_store() callbacks are defined. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412081614.5uel8F6W-lkp@intel.com/ Fixes: 247576bf624a ("s390/ipl: Do not accept z/VM CP diag X'008' cmds longer than max length") Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-14s390/cpum_sf: Handle CPU hotplug remove during samplingThomas Richter
[ Upstream commit a0bd7dacbd51c632b8e2c0500b479af564afadf3 ] CPU hotplug remove handling triggers the following function call sequence: CPUHP_AP_PERF_S390_SF_ONLINE --> s390_pmu_sf_offline_cpu() ... CPUHP_AP_PERF_ONLINE --> perf_event_exit_cpu() The s390 CPUMF sampling CPU hotplug handler invokes: s390_pmu_sf_offline_cpu() +--> cpusf_pmu_setup() +--> setup_pmc_cpu() +--> deallocate_buffers() This function de-allocates all sampling data buffers (SDBs) allocated for that CPU at event initialization. It also clears the PMU_F_RESERVED bit. The CPU is gone and can not be sampled. With the event still being active on the removed CPU, the CPU event hotplug support in kernel performance subsystem triggers the following function calls on the removed CPU: perf_event_exit_cpu() +--> perf_event_exit_cpu_context() +--> __perf_event_exit_context() +--> __perf_remove_from_context() +--> event_sched_out() +--> cpumsf_pmu_del() +--> cpumsf_pmu_stop() +--> hw_perf_event_update() to stop and remove the event. During removal of the event, the sampling device driver tries to read out the remaining samples from the sample data buffers (SDBs). But they have already been freed (and may have been re-assigned). This may lead to a use after free situation in which case the samples are most likely invalid. In the best case the memory has not been reassigned and still contains valid data. Remedy this situation and check if the CPU is still in reserved state (bit PMU_F_RESERVED set). In this case the SDBs have not been released an contain valid data. This is always the case when the event is removed (and no CPU hotplug off occured). If the PMU_F_RESERVED bit is not set, the SDB buffers are gone. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-09s390/stacktrace: Use break instead of return statementHeiko Carstens
commit 588a9836a4ef7ec3bfcffda526dfa399637e6cfc upstream. arch_stack_walk_user_common() contains a return statement instead of a break statement in case store_ip() fails while trying to store a callchain entry of a user space process. This may lead to a missing pagefault_enable() call. If this happens any subsequent page fault of the process won't be resolved by the page fault handler and this in turn will lead to the process being killed. Use a break instead of a return statement to fix this. Fixes: ebd912ff9919 ("s390/stacktrace: Merge perf_callchain_user() and arch_stack_walk_user()") Cc: stable@vger.kernel.org Reviewed-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-09s390/entry: Mark IRQ entries to fix stack depot warningsVasily Gorbik
commit 45c9f2b856a075a34873d00788d2e8a250c1effd upstream. The stack depot filters out everything outside of the top interrupt context as an uninteresting or irrelevant part of the stack traces. This helps with stack trace de-duplication, avoiding an explosion of saved stack traces that share the same IRQ context code path but originate from different randomly interrupted points, eventually exhausting the stack depot. Filtering uses in_irqentry_text() to identify functions within the .irqentry.text and .softirqentry.text sections, which then become the last stack trace entries being saved. While __do_softirq() is placed into the .softirqentry.text section by common code, populating .irqentry.text is architecture-specific. Currently, the .irqentry.text section on s390 is empty, which prevents stack depot filtering and de-duplication and could result in warnings like: Stack depot reached limit capacity WARNING: CPU: 0 PID: 286113 at lib/stackdepot.c:252 depot_alloc_stack+0x39a/0x3c8 with PREEMPT and KASAN enabled. Fix this by moving the IO/EXT interrupt handlers from .kprobes.text into the .irqentry.text section and updating the kprobes blacklist to include the .irqentry.text section. This is done only for asynchronous interrupts and explicitly not for program checks, which are synchronous and where the context beyond the program check is important to preserve. Despite machine checks being somewhat in between, they are extremely rare, and preserving context when possible is also of value. SVCs and Restart Interrupts are not relevant, one being always at the boundary to user space and the other being a one-time thing. IRQ entries filtering is also optionally used in ftrace function graph, where the same logic applies. Cc: stable@vger.kernel.org # 5.15+ Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-12-05s390/syscalls: Avoid creation of arch/arch/ directoryMasahiro Yamada
[ Upstream commit 0708967e2d56e370231fd07defa0d69f9ad125e8 ] Building the kernel with ARCH=s390 creates a weird arch/arch/ directory. $ find arch/arch arch/arch arch/arch/s390 arch/arch/s390/include arch/arch/s390/include/generated arch/arch/s390/include/generated/asm arch/arch/s390/include/generated/uapi arch/arch/s390/include/generated/uapi/asm The root cause is 'targets' in arch/s390/kernel/syscalls/Makefile, where the relative path is incorrect. Strictly speaking, 'targets' was not necessary in the first place because this Makefile uses 'filechk' instead of 'if_changed'. However, this commit keeps it, as it will be useful when converting 'filechk' to 'if_changed' later. Fixes: 5c75824d915e ("s390/syscalls: add Makefile to generate system call header files") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Link: https://lore.kernel.org/r/20241111134603.2063226-1-masahiroy@kernel.org Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-12-05s390/cpum_sf: Fix and protect memory allocation of SDBs with mutexThomas Richter
[ Upstream commit f55bd479d8663a4a4e403b3d308d3d1aa33d92df ] Reservation of the PMU hardware is done at first event creation and is protected by a pair of mutex_lock() and mutex_unlock(). After reservation of the PMU hardware the memory required for the PMUs the event is to be installed on is allocated by allocate_buffers() and alloc_sampling_buffer(). This done outside of the mutex protection. Without mutex protection two or more concurrent invocations of perf_event_init() may run in parallel. This can lead to allocation of Sample Data Blocks (SDBs) multiple times for the same PMU. Prevent this and protect memory allocation of SDBs by mutex. Fixes: 8a6fe8f21ec4 ("s390/cpum_sf: Use refcount_t instead of atomic_t") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-09-28Merge tag 's390-6.12-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Vasily Gorbik: - Clean up and improve vdso code: use SYM_* macros for function and data annotations, add CFI annotations to fix GDB unwinding, optimize the chacha20 implementation - Add vfio-ap driver feature advertisement for use by libvirt and mdevctl * tag 's390-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/vfio-ap: Driver feature advertisement s390/vdso: Use one large alternative instead of an alternative branch s390/vdso: Use SYM_DATA_START_LOCAL()/SYM_DATA_END() for data objects tools: Add additional SYM_*() stubs to linkage.h s390/vdso: Use macros for annotation of asm functions s390/vdso: Add CFI annotations to __arch_chacha20_blocks_nostack() s390/vdso: Fix comment within __arch_chacha20_blocks_nostack() s390/vdso: Get rid of permutation constants
2024-09-27[tree-wide] finally take no_llseek outAl Viro
no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-26Merge tag 'asm-generic-6.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic updates from Arnd Bergmann: "These are only two small patches, one cleanup for arch/alpha and a preparation patch cleaning up the handling of runtime constants in the linker scripts" * tag 'asm-generic-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: runtime constants: move list of constants to vmlinux.lds.h alpha: no need to include asm/xchg.h twice
2024-09-23s390/vdso: Use one large alternative instead of an alternative branchHeiko Carstens
Replace the alternative branch with a larger alternative that contains both paths. That way the two paths are closer together and it is easier to change both paths if the need should arise. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-23s390/vdso: Use SYM_DATA_START_LOCAL()/SYM_DATA_END() for data objectsHeiko Carstens
Use SYM_DATA_START_LOCAL()/SYM_DATA_END() in vgetrandom-chacha.S so that the constants end up in an object with correct size: readelf -Ws vgetrandom-chacha.o Num: Value Size Type Bind Vis Ndx Name ... 5: 0000000000000000 32 OBJECT LOCAL DEFAULT 5 chacha20_constants Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-23s390/vdso: Use macros for annotation of asm functionsJens Remus
Use the macros SYM_FUNC_START() and SYM_FUNC_END() to annotate the functions in assembly. Signed-off-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-23s390/vdso: Add CFI annotations to __arch_chacha20_blocks_nostack()Jens Remus
This allows proper unwinding, for instance when using a debugger such as GDB. Signed-off-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-23s390/vdso: Fix comment within __arch_chacha20_blocks_nostack()Heiko Carstens
Fix comment within __arch_chacha20_blocks_nostack() so the comment reflects what the code is doing. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-23s390/vdso: Get rid of permutation constantsHeiko Carstens
The three byte masks for VECTOR PERMUTE are not needed, since the instruction VECTOR SHIFT LEFT DOUBLE BY BYTE can be used to implement the required "rotate left" easily. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-21Merge tag 's390-6.12-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Vasily Gorbik: - Optimize ftrace and kprobes code patching and avoid stop machine for kprobes if sequential instruction fetching facility is available - Add hiperdispatch feature to dynamically adjust CPU capacity in vertical polarization to improve scheduling efficiency and overall performance. Also add infrastructure for handling warning track interrupts (WTI), allowing for graceful CPU preemption - Rework crypto code pkey module and split it into separate, independent modules for sysfs, PCKMO, CCA, and EP11, allowing modules to load only when the relevant hardware is available - Add hardware acceleration for HMAC modes and the full AES-XTS cipher, utilizing message-security assist extensions (MSA) 10 and 11. It introduces new shash implementations for HMAC-SHA224/256/384/512 and registers the hardware-accelerated AES-XTS cipher as the preferred option. Also add clear key token support - Add MSA 10 and 11 processor activity instrumentation counters to perf and update PAI Extension 1 NNPA counters - Cleanup cpu sampling facility code and rework debug/WARN_ON_ONCE statements - Add support for SHA3 performance enhancements introduced with MSA 12 - Add support for the query authentication information feature of MSA 13 and introduce the KDSA CPACF instruction. Provide query and query authentication information in sysfs, enabling tools like cpacfinfo to present this data in a human-readable form - Update kernel disassembler instructions - Always enable EXPOLINE_EXTERN if supported by the compiler to ensure kpatch compatibility - Add missing warning handling and relocated lowcore support to the early program check handler - Optimize ftrace_return_address() and avoid calling unwinder - Make modules use kernel ftrace trampolines - Strip relocs from the final vmlinux ELF file to make it roughly 2 times smaller - Dump register contents and call trace for early crashes to the console - Generate ptdump address marker array dynamically - Fix rcu_sched stalls that might occur when adding or removing large amounts of pages at once to or from the CMM balloon - Fix deadlock caused by recursive lock of the AP bus scan mutex - Unify sync and async register save areas in entry code - Cleanup debug prints in crypto code - Various cleanup and sanitizing patches for the decompressor - Various small ftrace cleanups * tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (84 commits) s390/crypto: Display Query and Query Authentication Information in sysfs s390/crypto: Add Support for Query Authentication Information s390/crypto: Rework RRE and RRF CPACF inline functions s390/crypto: Add KDSA CPACF Instruction s390/disassembler: Remove duplicate instruction format RSY_RDRU s390/boot: Move boot_printk() code to own file s390/boot: Use boot_printk() instead of sclp_early_printk() s390/boot: Rename decompressor_printk() to boot_printk() s390/boot: Compile all files with the same march flag s390: Use MARCH_HAS_*_FEATURES defines s390: Provide MARCH_HAS_*_FEATURES defines s390/facility: Disable compile time optimization for decompressor code s390/boot: Increase minimum architecture to z10 s390/als: Remove obsolete comment s390/sha3: Fix SHA3 selftests failures s390/pkey: Add AES xts and HMAC clear key token support s390/cpacf: Add MSA 10 and 11 new PCKMO functions s390/mm: Add cond_resched() to cmm_alloc/free_pages() s390/pai_ext: Update PAI extension 1 counters s390/pai_crypto: Add support for MSA 10 and 11 pai counters ...
2024-09-21Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
2024-09-13s390/vdso: Wire up getrandom() vdso implementationHeiko Carstens
Provide the s390 specific vdso getrandom() architecture backend. _vdso_rng_data required data is placed within the _vdso_data vvar page, by using a hardcoded offset larger than vdso_data. As required the chacha20 implementation does not write to the stack. The implementation follows more or less the arm64 implementations and makes use of vector instructions. It has a fallback to the getrandom() system call for machines where the vector facility is not installed. The check if the vector facility is installed, as well as an optimization for machines with the vector-enhancements facility 2, is implemented with alternatives, avoiding runtime checks. Note that __kernel_getrandom() is implemented without the vdso user wrapper which would setup a stack frame for odd cases (aka very old glibc variants) where the caller has not done that. All callers of __kernel_getrandom() are required to setup a stack frame, like the C ABI requires it. The vdso testcases vdso_test_getrandom and vdso_test_chacha pass. Benchmark on a z16: $ ./vdso_test_getrandom bench-single vdso: 25000000 times in 0.493703559 seconds syscall: 25000000 times in 6.584025337 seconds Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-13s390/vdso: Move vdso symbol handling to separate header fileHeiko Carstens
The vdso.h header file, which is included at many places, includes generated header files. This can easily lead to recursive header file inclusions if the vdso code is changed. Therefore move the vdso symbol code, which requires the generated header files, to a separate header file, and include it at the two locations which require it. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-13s390/vdso: Allow alternatives in vdso codeHeiko Carstens
Implement the infrastructure required to allow alternatives in vdso code. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-13s390/alternatives: Remove ALT_FACILITY_EARLYHeiko Carstens
Patch all alternatives which depend on facilities from the decompressor. There is no technical reason which enforces to split patching of such alternatives to the decompressor and the kernel. This simplifies alternative handling a bit, since one alternative type is removed. Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-12s390/crypto: Display Query and Query Authentication Information in sysfsFinn Callies
Displays the query (fc=0) and query authentication information (fc=127) as binary in sysfs per CPACF instruction. Files are located in /sys/devices/system/cpu/cpacf/. These information can be fetched via asm already except for PCKMO because this instruction is privileged. To offer a unified interface all CPACF instructions will have this information displayed in sysfs in files <instruction>_query_raw and <instruction>_query_auth_info_raw. A new tool introduced into s390-tools called cpacfinfo will use this information to convert and display in human readable form. Suggested-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Finn Callies <fcallies@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-12s390/disassembler: Remove duplicate instruction format RSY_RDRUJens Remus
Instruction format RSY_RDRU is a duplicate of RSY_RURD2. Use the latter, as it follows the s390-specific conventions for instruction format naming used in binutils. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-09-07s390: Use MARCH_HAS_*_FEATURES definesHeiko Carstens
Replace CONFIG_HAVE_MARCH_*_FEATURES with MARCH_HAS_*_FEATURES everywhere so code gets compiled correctly depending on if the target is the kernel or the decompressor. Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-09-05s390/pai_ext: Update PAI extension 1 countersThomas Richter
Update the internal array of PAI extension 1 NNPA counter string table to support specialized processor instrumentation assist instructions. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-09-05s390/pai_crypto: Add support for MSA 10 and 11 pai countersThomas Richter
Update the internal array of PAI crypto counter string table with new counters supported with Message Security Assist extension (MSA) 10 and MSA 11. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Finn Callies <fcallies@linux.ibm.com> Tested-by: Finn Callies <fcallies@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-09-03arch, mm: move definition of node_data to generic codeMike Rapoport (Microsoft)
Every architecture that supports NUMA defines node_data in the same way: struct pglist_data *node_data[MAX_NUMNODES]; No reason to keep multiple copies of this definition and its forward declarations, especially when such forward declaration is the only thing in include/asm/mmzone.h for many architectures. Add definition and declaration of node_data to generic code and drop architecture-specific versions. Link: https://lkml.kernel.org/r/20240807064110.1003856-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01s390/uv: convert gmap_destroy_page() from follow_page() to folio_walkDavid Hildenbrand
Let's get rid of another follow_page() user and perform the UV calls under PTL -- which likely should be fine. No need for an additional reference while holding the PTL: uv_destroy_folio() and uv_convert_from_secure_folio() raise the refcount, so any concurrent make_folio_secure() would see an unexpted reference and cannot set PG_arch_1 concurrently. Do we really need a writable PTE? Likely yes, because the "destroy" part is, in comparison to the export, a destructive operation. So we'll keep the writability check for now. We'll lose the secretmem check from follow_page(). Likely we don't care about that here. Link: https://lkml.kernel.org/r/20240802155524.517137-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01s390/uv: drop arch_make_page_accessible()David Hildenbrand
All code was converted to using arch_make_folio_accessible(), let's drop arch_make_page_accessible(). Link: https://lkml.kernel.org/r/20240729183844.388481-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-29s390/hiperdispatch: Add hiperdispatch debug countersMete Durlu
Add three counters to follow and understand hiperdispatch behavior; * adjustment_count (amount of capacity adjustments triggered) * greedy_time_ms (time spent while all cpus are on high capacity) * conservative_time_ms (time spent while only entitled cpus are on high capacity) These counters can be found under /sys/kernel/debug/s390/hiperdispatch/ Time counters are in <msec> format and only cover the time spent when hiperdispatch is active. Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/hiperdispatch: Add hiperdispatch debug attributesMete Durlu
Add two attributes for debug purposes. They can be found under; /sys/devices/system/cpu/hiperdispatch/ * hd_stime_threshold : allows user to adjust steal time threshold * hd_delay_factor : allows user to adjust delay factor of hiperdispatch work (after topology updates, delayed work is always delayed extra by this factor) hd_stime_threshold can have values between 0-100 as it represents a percentage value. hd_delay_factor can have values greater than 1. It is multiplied with the default delay to achieve a longer interval, pushing back the next hiperdispatch adjustment after a topology update. Ex: if delay interval is 250ms and the delay factor is 4; delayed interval is now 1000ms(1sec). After each capacity adjustment or topology change, work has a delayed interval of 1 sec for one interval. Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/hiperdispatch: Add hiperdispatch sysctl interfaceMete Durlu
Expose hiperdispatch controls via sysctl. The user can now toggle hiperdispatch via assigning 0 or 1 to s390.hiperdispatch attribute. When hiperdipatch is toggled on, it tries to adjust CPU capacities, while system is in vertical polarization to gain performance benefits from different CPU polarizations. Disabling hiperdispatch reverts the CPU capacities to their default (HIGH_CAPACITY) and stops the dynamic adjustments. Introduce a kconfig option HIPERDISPATCH_ON which allows users to use hiperdispatch by default on vertical polarization. Using the sysctl attribute s390.hiperdispatch would overwrite this behavior. Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/hiperdispatch: Add trace eventsMete Durlu
Add trace events to debug hiperdispatch behavior and track domain rebuilding. Two events provide information about the decision making of hiperdispatch and the adjustments made. Acked-by: Vasily Gorbik <gor@linux.ibm.com> Co-developed-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/hiperdispatch: Add steal time averagingMete Durlu
The measurements done by hiperdispatch can have sudden spikes and dips during run time. To prevent these outliers effecting the decision making process and causing adjustment overhead, use weighted average of the steal time. Acked-by: Vasily Gorbik <gor@linux.ibm.com> Co-developed-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/hiperdispatch: Introduce hiperdispatchMete Durlu
When LPAR is in vertical polarization, CPUs get different polarization values, namely vertical high, vertical medium and vertical low. These values represent the likelyhood of the CPU getting physical runtime. Vertical high CPUs will always get runtime and others get varying runtime depending on the load the CEC is under. Vertical high and vertical medium CPUs are considered the CPUs which the current LPAR has the entitlement to run on. The vertical lows are on the other hand are borrowed CPUs which would only be given to the LPAR by hipervisor when the other LPARs are not utilizing them. Using the CPU capacities, hint linux scheduler when it should prioritise vertical high and vertical medium CPUs over vertical low CPUs. By tracking various system statistics hiperdispatch determines when to adjust cpu capacities. After each adjustment, rebuilding of scheduler domains is necessary to notify the scheduler about capacity changes but since this operation is costly it should be done as sparsely as possible. Acked-by: Vasily Gorbik <gor@linux.ibm.com> Co-developed-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/smp: Add cpu capacitiesMete Durlu
Linux scheduler allows architectures to assign capacity values to individual CPUs. This hints scheduler the performance difference between CPUs and allows more efficient task distribution them. Implement helper methods to set and get CPU capacities for s390. This is particularly helpful in vertical polarization configurations of LPARs. On vertical polarization an LPARs CPUs can get different polarization values depending on the CEC configuration. CPUs with different polarization values can perform different from each other, using CPU capacities this can be reflected to linux scheduler. Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/topology: Add config option to switch to vertical during bootTobias Huschle
By default, all systems on s390 start in horizontal cpu polarization. Selecting the new config option SCHED_TOPOLOGY_VERTICAL allows to build a kernel that switches to vertical polarization during boot. Acked-by: Heiko Carstens <hca@linux.ibm.com> Tested-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/topology: Add sysctl handler for polarizationTobias Huschle
Provide an additional path to set the polarization of the system, such that a user no longer relies on the sysfs interface only and is able configure the polarization for every reboot via sysctl control files. The new sysctl can be set as follows: - s390.polarization=0 for horizontal polarization - s390.polarization=1 for vertical polarization Acked-by: Heiko Carstens <hca@linux.ibm.com> Tested-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/wti: Add debugfs file to display missed grace periods per cpuTobias Huschle
Introduce a new debug file which allows to determine how many warning track grace periods were missed on each CPU. The new file can be found as /sys/kernel/debug/s390/wti It is formatted as: CPU0 CPU1 [...] CPUx xyz xyz [...] xyz Acked-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/wti: Add wti accounting for missed grace periodsTobias Huschle
A virtual CPU that has received a warning-track interrupt may fail to acknowledge the interrupt within the warning-track grace period. While this is usually not a problem, it will become necessary to investigate if there is a large number of such missed warning-track interrupts. Therefore, it is necessary to track these events. The information is tracked through the s390 debug facility and can be found under /sys/kernel/debug/s390dbf/wti/. The hex_ascii output is formatted as: <pid> <symbol> The values pid and current psw are collected when a warning track interrupt is received. Symbol is either the kernel symbol matching the collected psw or redacted to <user> when running in user space. Each line represents the currently executing process when a warning track interrupt was received which was then not acknowledged within its grace period. Acked-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/wti: Prepare graceful CPU pre-emption on wti receptionTobias Huschle
When a warning track interrupt is received, the kernel has only a very limited amount of time to make sure, that the CPU can be yielded as gracefully as possible before being pre-empted by the hypervisor. The interrupt handler for the wti therefore unparks a kernel thread which has being created on boot re-using the CPU hotplug kernel thread infrastructure. These threads exist per CPU and are assigned the highest possible real-time priority. This makes sure, that said threads will execute as soon as possible as the scheduler should pre-empt any other running user tasks to run the real-time thread. Furthermore, the interrupt handler disables all I/O interrupts to prevent additional interrupt processing on the soon-preempted CPU. Interrupt handlers are likely to take kernel locks, which in the worst case, will be kept while the interrupt handler is pre-empted from itself underlying physical CPU. In that case, all tasks or interrupt handlers on other CPUs would have to wait for the pre-empted CPU being dispatched again. By preventing further interrupt processing, this risk is minimized. Once the CPU gets dispatched again, the real-time kernel thread regains control, reenables interrupts and parks itself again. Acked-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/wti: Introduce infrastructure for warning track interruptTobias Huschle
The warning-track interrupt (wti) provides a notification that the receiving CPU will be pre-empted from its physical CPU within a short time frame. This time frame is called grace period and depends on the machine type. Giving up the CPU on time may prevent a task to get stuck while holding a resource. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-08-29s390/ftrace: Avoid extra serialization for graph caller patchingVasily Gorbik
The only context where ftrace_enable_ftrace_graph_caller() or ftrace_disable_ftrace_graph_caller() is called also calls ftrace_arch_code_modify_post_process(), which already performs text_poke_sync_lock(). ftrace_run_update_code() arch_ftrace_update_code() ftrace_modify_all_code() ftrace_enable_ftrace_graph_caller()/ftrace_disable_ftrace_graph_caller() ftrace_arch_code_modify_post_process() text_poke_sync_lock() Remove the redundant serialization. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>