summaryrefslogtreecommitdiff
path: root/arch/s390
AgeCommit message (Collapse)Author
2024-09-12s390/vmlinux.lds.S: Move ro_after_init section behind rodata sectionHeiko Carstens
[ Upstream commit 75c10d5377d8821efafed32e4d72068d9c1f8ec0 ] The .data.rel.ro and .got section were added between the rodata and ro_after_init data section, which adds an RW mapping in between all RO mapping of the kernel image: ---[ Kernel Image Start ]--- 0x000003ffe0000000-0x000003ffe0e00000 14M PMD RO X 0x000003ffe0e00000-0x000003ffe0ec7000 796K PTE RO X 0x000003ffe0ec7000-0x000003ffe0f00000 228K PTE RO NX 0x000003ffe0f00000-0x000003ffe1300000 4M PMD RO NX 0x000003ffe1300000-0x000003ffe1331000 196K PTE RO NX 0x000003ffe1331000-0x000003ffe13b3000 520K PTE RW NX <--- 0x000003ffe13b3000-0x000003ffe13d5000 136K PTE RO NX 0x000003ffe13d5000-0x000003ffe1400000 172K PTE RW NX 0x000003ffe1400000-0x000003ffe1500000 1M PMD RW NX 0x000003ffe1500000-0x000003ffe1700000 2M PTE RW NX 0x000003ffe1700000-0x000003ffe1800000 1M PMD RW NX 0x000003ffe1800000-0x000003ffe187e000 504K PTE RW NX ---[ Kernel Image End ]--- Move the ro_after_init data section again right behind the rodata section to prevent interleaving RO and RW mappings: ---[ Kernel Image Start ]--- 0x000003ffe0000000-0x000003ffe0e00000 14M PMD RO X 0x000003ffe0e00000-0x000003ffe0ec7000 796K PTE RO X 0x000003ffe0ec7000-0x000003ffe0f00000 228K PTE RO NX 0x000003ffe0f00000-0x000003ffe1300000 4M PMD RO NX 0x000003ffe1300000-0x000003ffe1353000 332K PTE RO NX 0x000003ffe1353000-0x000003ffe1400000 692K PTE RW NX 0x000003ffe1400000-0x000003ffe1500000 1M PMD RW NX 0x000003ffe1500000-0x000003ffe1700000 2M PTE RW NX 0x000003ffe1700000-0x000003ffe1800000 1M PMD RW NX 0x000003ffe1800000-0x000003ffe187e000 504K PTE RW NX ---[ Kernel Image End ]--- Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29s390/smp,mcck: fix early IPI handlingHeiko Carstens
[ Upstream commit 4a1725281fc5b0009944b1c0e1d2c1dc311a09ec ] Both the external call as well as the emergency signal submask bits in control register 0 are set before any interrupt handler is registered. Change the order and first register the interrupt handler and only then enable the interrupts by setting the corresponding bits in control register 0. This prevents that the second part of the machine check handler for early machine check handling is not executed: the machine check handler sends an IPI to the CPU it runs on. If the corresponding interrupts are enabled, but no interrupt handler is present, the interrupt is ignored. Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29s390/uv: Panic for set and remove shared access UVC errorsClaudio Imbrenda
[ Upstream commit cff59d8631e1409ffdd22d9d717e15810181b32c ] The return value uv_set_shared() and uv_remove_shared() (which are wrappers around the share() function) is not always checked. The system integrity of a protected guest depends on the Share and Unshare UVCs being successful. This means that any caller that fails to check the return value will compromise the security of the protected guest. No code path that would lead to such violation of the security guarantees is currently exercised, since all the areas that are shared never get unshared during the lifetime of the system. This might change and become an issue in the future. The Share and Unshare UVCs can only fail in case of hypervisor misbehaviour (either a bug or malicious behaviour). In such cases there is no reasonable way forward, and the system needs to panic. This patch replaces the return at the end of the share() function with a panic, to guarantee system integrity. Fixes: 5abb9351dfd9 ("s390/uv: introduce guest side ultravisor code") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20240801112548.85303-1-imbrenda@linux.ibm.com Message-ID: <20240801112548.85303-1-imbrenda@linux.ibm.com> [frankja@linux.ibm.com: Fixed up patch subject] Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-29KVM: s390: fix validity interception issue when gisa is switched offMichael Mueller
commit 5a44bb061d04b0306f2aa8add761d86d152b9377 upstream. We might run into a SIE validity if gisa has been disabled either via using kernel parameter "kvm.use_gisa=0" or by setting the related sysfs attribute to N (echo N >/sys/module/kvm/parameters/use_gisa). The validity is caused by an invalid value in the SIE control block's gisa designation. That happens because we pass the uninitialized gisa origin to virt_to_phys() before writing it to the gisa designation. To fix this we return 0 in kvm_s390_get_gisa_desc() if the origin is 0. kvm_s390_get_gisa_desc() is used to determine which gisa designation to set in the SIE control block. A value of 0 in the gisa designation disables gisa usage. The issue surfaces in the host kernel with the following kernel message as soon a new kvm guest start is attemted. kvm: unhandled validity intercept 0x1011 WARNING: CPU: 0 PID: 781237 at arch/s390/kvm/intercept.c:101 kvm_handle_sie_intercept+0x42e/0x4d0 [kvm] Modules linked in: vhost_net tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat x_tables nf_nat_tftp nf_conntrack_tftp vfio_pci_core irqbypass vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables sunrpc mlx5_ib ib_uverbs ib_core mlx5_core uvdevice s390_trng eadm_sch vfio_ccw zcrypt_cex4 mdev vfio_iommu_type1 vfio sch_fq_codel drm i2c_core loop drm_panel_orientation_quirks configfs nfnetlink lcs ctcm fsm dm_service_time ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common dm_mirror dm_region_hash dm_log zfcp scsi_transport_fc scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey zcrypt dm_multipath rng_core autofs4 [last unloaded: vfio_pci] CPU: 0 PID: 781237 Comm: CPU 0/KVM Not tainted 6.10.0-08682-gcad9f11498ea #6 Hardware name: IBM 3931 A01 701 (LPAR) Krnl PSW : 0704c00180000000 000003d93deb0122 (kvm_handle_sie_intercept+0x432/0x4d0 [kvm]) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 000003d900000027 000003d900000023 0000000000000028 000002cd00000000 000002d063a00900 00000359c6daf708 00000000000bebb5 0000000000001eff 000002cfd82e9000 000002cfd80bc000 0000000000001011 000003d93deda412 000003ff8962df98 000003d93de77ce0 000003d93deb011e 00000359c6daf960 Krnl Code: 000003d93deb0112: c020fffe7259 larl %r2,000003d93de7e5c4 000003d93deb0118: c0e53fa8beac brasl %r14,000003d9bd3c7e70 #000003d93deb011e: af000000 mc 0,0 >000003d93deb0122: a728ffea lhi %r2,-22 000003d93deb0126: a7f4fe24 brc 15,000003d93deafd6e 000003d93deb012a: 9101f0b0 tm 176(%r15),1 000003d93deb012e: a774fe48 brc 7,000003d93deafdbe 000003d93deb0132: 40a0f0ae sth %r10,174(%r15) Call Trace: [<000003d93deb0122>] kvm_handle_sie_intercept+0x432/0x4d0 [kvm] ([<000003d93deb011e>] kvm_handle_sie_intercept+0x42e/0x4d0 [kvm]) [<000003d93deacc10>] vcpu_post_run+0x1d0/0x3b0 [kvm] [<000003d93deaceda>] __vcpu_run+0xea/0x2d0 [kvm] [<000003d93dead9da>] kvm_arch_vcpu_ioctl_run+0x16a/0x430 [kvm] [<000003d93de93ee0>] kvm_vcpu_ioctl+0x190/0x7c0 [kvm] [<000003d9bd728b4e>] vfs_ioctl+0x2e/0x70 [<000003d9bd72a092>] __s390x_sys_ioctl+0xc2/0xd0 [<000003d9be0e9222>] __do_syscall+0x1f2/0x2e0 [<000003d9be0f9a90>] system_call+0x70/0x98 Last Breaking-Event-Address: [<000003d9bd3c7f58>] __warn_printk+0xe8/0xf0 Cc: stable@vger.kernel.org Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com> Fixes: fe0ef0030463 ("KVM: s390: sort out physical vs virtual pointers usage") Signed-off-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20240801123109.2782155-1-mimu@linux.ibm.com Message-ID: <20240801123109.2782155-1-mimu@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-03s390/cpum_cf: Fix endless loop in CF_DIAG event stopThomas Richter
[ Upstream commit e6ce1f12d777f6ee22b20e10ae6a771e7e6f44f5 ] Event CF_DIAG reads out complete counter sets using stcctm instruction. This is done at event start time when the process starts execution and at event stop time when the process is removed from the CPU. During removal the difference of each counter in the counter sets is calculated and saved as raw data in the ring buffer. This works fine unless the number of counters in a counter set is zero. This may happen for the extended counter set. This set is machine specific and the size of the counter set can be zero even when extended counter set is authorized for read access. This case is not handled. cfdiag_diffctr() checks authorization of the extended counter set. If true the functions assumes the extended counter set has been saved in a data buffer. However this is not the case, cfdiag_getctrset() does not save a counter set with counter set size of zero. This mismatch causes an endless loop in the counter set readout during event stop handling. The calculation of the difference of the counters in each counter now verifies the size of the counter set is non-zero. A counter set with size zero is skipped. Fixes: a029a4eab39e ("s390/cpumf: Allow concurrent access for CPU Measurement Counter Facility") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-03s390/pci: Allow allocation of more than 1 MSI interruptGerd Bayer
[ Upstream commit ab42fcb511fd9d241bbab7cc3ca04e34e9fc0666 ] On a PCI adapter that provides up to 8 MSI interrupt sources the s390 implementation of PCI interrupts rejected to accommodate them, although the underlying hardware is able to support that. For MSI-X it is sufficient to allocate a single irq_desc per msi_desc, but for MSI multiple irq descriptors are attached to and controlled by a single msi descriptor. Add the appropriate loops to maintain multiple irq descriptors and tie/untie them to/from the appropriate AIBV bit, if a device driver allocates more than 1 MSI interrupt. Common PCI code passes on requests to allocate a number of interrupt vectors based on the device drivers' demand and the PCI functions' capabilities. However, the root-complex of s390 systems support just a limited number of interrupt vectors per PCI function. Produce a kernel log message to inform about any architecture-specific capping that might be done. With this change, we had a PCI adapter successfully raising interrupts to its device driver via all 8 sources. Fixes: a384c8924a8b ("s390/PCI: Fix single MSI only check") Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-03s390/pci: Refactor arch_setup_msi_irqs()Gerd Bayer
[ Upstream commit 5fd11b96b43708f2f6e3964412c301c1bd20ec0f ] Factor out adapter interrupt allocation from arch_setup_msi_irqs() in preparation for enabling registration of multiple MSIs. Code movement only, no change of functionality intended. Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Stable-dep-of: ab42fcb511fd ("s390/pci: Allow allocation of more than 1 MSI interrupt") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-03s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()Gerald Schaefer
commit df39038cd89525d465c2c8827eb64116873f141a upstream. There is no support for HWPOISON, MEMORY_FAILURE, or ARCH_HAS_COPY_MC on s390. Therefore we do not expect to see VM_FAULT_HWPOISON in do_exception(). However, since commit af19487f00f3 ("mm: make PTE_MARKER_SWAPIN_ERROR more general"), it is possible to see VM_FAULT_HWPOISON in combination with PTE_MARKER_POISONED, even on architectures that do not support HWPOISON otherwise. In this case, we will end up on the BUG() in do_exception(). Fix this by treating VM_FAULT_HWPOISON the same as VM_FAULT_SIGBUS, similar to x86 when MEMORY_FAILURE is not configured. Also print unexpected fault flags, for easier debugging. Note that VM_FAULT_HWPOISON_LARGE is not expected, because s390 cannot support swap entries on other levels than PTE level. Cc: stable@vger.kernel.org # 6.6+ Fixes: af19487f00f3 ("mm: make PTE_MARKER_SWAPIN_ERROR more general") Reported-by: Yunseong Kim <yskelg@gmail.com> Tested-by: Yunseong Kim <yskelg@gmail.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Message-ID: <20240715180416.3632453-1-gerald.schaefer@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Yunseong Kim <yskelg@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-03s390/uv: Don't call folio_wait_writeback() without a folio referenceDavid Hildenbrand
[ Upstream commit 3f29f6537f54d74e64bac0a390fb2e26da25800d ] folio_wait_writeback() requires that no spinlocks are held and that a folio reference is held, as documented. After we dropped the PTL, the folio could get freed concurrently. So grab a temporary reference. Fixes: 214d9bbcd3a6 ("s390/mm: provide memory management functions for protected KVM guests") Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20240508182955.358628-2-david@redhat.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-03s390/mm: Convert gmap_make_secure to use a folioMatthew Wilcox (Oracle)
[ Upstream commit d35c34bb32f2cc4ec0b52e91ad7a8fcab55d7856 ] Remove uses of deprecated page APIs, and move the check for large folios to here to avoid taking the folio lock if the folio is too large. We could do better here by attempting to split the large folio, but I'll leave that improvement for someone who can test it. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240322161149.2327518-3-willy@infradead.org Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Stable-dep-of: 3f29f6537f54 ("s390/uv: Don't call folio_wait_writeback() without a folio reference") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-08-03s390/mm: Convert make_page_secure to use a folioMatthew Wilcox (Oracle)
[ Upstream commit 259e660d91d0e7261ae0ee37bb37266d6006a546 ] These page APIs are deprecated, so convert the incoming page to a folio and use the folio APIs instead. The ultravisor API cannot handle large folios, so return -EINVAL if one has slipped through. Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240322161149.2327518-2-willy@infradead.org Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Stable-dep-of: 3f29f6537f54 ("s390/uv: Don't call folio_wait_writeback() without a folio reference") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-18s390/mm: Add NULL pointer check to crst_table_free() base_crst_free()Heiko Carstens
commit b5efb63acf7bddaf20eacfcac654c25c446eabe8 upstream. crst_table_free() used to work with NULL pointers before the conversion to ptdescs. Since crst_table_free() can be called with a NULL pointer (error handling in crst_table_upgrade() add an explicit check. Also add the same check to base_crst_free() for consistency reasons. In real life this should not happen, since order two GFP_KERNEL allocations will not fail, unless FAIL_PAGE_ALLOC is enabled and used. Reported-by: Yunseong Kim <yskelg@gmail.com> Fixes: 6326c26c1514 ("s390: convert various pgalloc functions to use ptdescs") Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-18s390: Mark psw in __load_psw_mask() as __unitializedSven Schnelle
[ Upstream commit 7278a8fb8d032dfdc03d9b5d17e0bc451cdc1492 ] Without __unitialized, the following code is generated when INIT_STACK_ALL_ZERO is enabled: 86: d7 0f f0 a0 f0 a0 xc 160(16,%r15), 160(%r15) 8c: e3 40 f0 a0 00 24 stg %r4, 160(%r15) 92: c0 10 00 00 00 08 larl %r1, 0xa2 98: e3 10 f0 a8 00 24 stg %r1, 168(%r15) 9e: b2 b2 f0 a0 lpswe 160(%r15) The xc is not adding any security because psw is fully initialized with the following instructions. Add __unitialized to the psw definitiation to avoid the superfluous clearing of psw. Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-11KVM: s390: fix LPSWEY handlingChristian Borntraeger
[ Upstream commit 4c6abb7f7b349f00c0f7ed5045bf67759c012892 ] in rare cases, e.g. for injecting a machine check we do intercept all load PSW instructions via ICTL_LPSW. With facility 193 a new variant LPSWEY was added. KVM needs to handle that as well. Fixes: a3efa8429266 ("KVM: s390: gen_facilities: allow facilities 165, 193, 194 and 196") Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com> Message-ID: <20240628163547.2314-1-borntraeger@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-09Revert "bpf: Take return from set_memory_rox() into account with ↵Greg Kroah-Hartman
bpf_jit_binary_lock_ro()" This reverts commit 08f6c05feb1db21653e98ca84ea04ca032d014c7 which is commit e60adf513275c3a38e5cb67f7fd12387e43a3ff5 upstream. It is part of a series that is reported to both break the arm64 builds and instantly crashes the powerpc systems at the first load of a bpf program. So revert it for now until it can come back in a safe way. Reported-by: matoro <matoro_mailinglist_kernel@matoro.tk> Reported-by: Vitaly Chikunov <vt@altlinux.org> Reported-by: WangYuli <wangyuli@uniontech.com> Link: https://lore.kernel.org/r/5A29E00D83AB84E3+20240706031101.637601-1-wangyuli@uniontech.com Link: https://lore.kernel.org/r/cf736c5e37489e7dc7ffd67b9de2ab47@matoro.tk Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Song Liu <song@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Kees Cook <keescook@chromium.org> Cc: Puranjay Mohan <puranjay12@gmail.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> # s390x Cc: Tiezhu Yang <yangtiezhu@loongson.cn> # LoongArch Cc: Johan Almbladh <johan.almbladh@anyfinetworks.com> # MIPS Part Cc: Alexei Starovoitov <ast@kernel.org> Cc: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-05syscalls: fix compat_sys_io_pgetevents_time64 usageArnd Bergmann
commit d3882564a77c21eb746ba5364f3fa89b88de3d61 upstream. Using sys_io_pgetevents() as the entry point for compat mode tasks works almost correctly, but misses the sign extension for the min_nr and nr arguments. This was addressed on parisc by switching to compat_sys_io_pgetevents_time64() in commit 6431e92fc827 ("parisc: io_pgetevents_time64() needs compat syscall in 32-bit compat mode"), as well as by using more sophisticated system call wrappers on x86 and s390. However, arm64, mips, powerpc, sparc and riscv still have the same bug. Change all of them over to use compat_sys_io_pgetevents_time64() like parisc already does. This was clearly the intention when the function was originally added, but it got hooked up incorrectly in the tables. Cc: stable@vger.kernel.org Fixes: 48166e6ea47d ("y2038: add 64-bit time_t syscalls to all 32-bit architectures") Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-05randomize_kstack: Remove non-functional per-arch entropy filteringKees Cook
[ Upstream commit 6db1208bf95b4c091897b597c415e11edeab2e2d ] An unintended consequence of commit 9c573cd31343 ("randomize_kstack: Improve entropy diffusion") was that the per-architecture entropy size filtering reduced how many bits were being added to the mix, rather than how many bits were being used during the offsetting. All architectures fell back to the existing default of 0x3FF (10 bits), which will consume at most 1KiB of stack space. It seems that this is working just fine, so let's avoid the confusion and update everything to use the default. The prior intent of the per-architecture limits were: arm64: capped at 0x1FF (9 bits), 5 bits effective powerpc: uncapped (10 bits), 6 or 7 bits effective riscv: uncapped (10 bits), 6 bits effective x86: capped at 0xFF (8 bits), 5 (x86_64) or 6 (ia32) bits effective s390: capped at 0xFF (8 bits), undocumented effective entropy Current discussion has led to just dropping the original per-architecture filters. The additional entropy appears to be safe for arm64, x86, and s390. Quoting Arnd, "There is no point pretending that 15.75KB is somehow safe to use while 15.00KB is not." Co-developed-by: Yuntao Liu <liuyuntao12@huawei.com> Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com> Fixes: 9c573cd31343 ("randomize_kstack: Improve entropy diffusion") Link: https://lore.kernel.org/r/20240617133721.377540-1-liuyuntao12@huawei.com Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390 Link: https://lore.kernel.org/r/20240619214711.work.953-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-05bpf: Take return from set_memory_rox() into account with ↵Christophe Leroy
bpf_jit_binary_lock_ro() [ Upstream commit e60adf513275c3a38e5cb67f7fd12387e43a3ff5 ] set_memory_rox() can fail, leaving memory unprotected. Check return and bail out when bpf_jit_binary_lock_ro() returns an error. Link: https://github.com/KSPP/linux/issues/7 Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: linux-hardening@vger.kernel.org <linux-hardening@vger.kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Puranjay Mohan <puranjay12@gmail.com> Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com> # s390x Acked-by: Tiezhu Yang <yangtiezhu@loongson.cn> # LoongArch Reviewed-by: Johan Almbladh <johan.almbladh@anyfinetworks.com> # MIPS Part Message-ID: <036b6393f23a2032ce75a1c92220b2afcb798d5d.1709850515.git.christophe.leroy@csgroup.eu> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-07-05s390/pci: Add missing virt_to_phys() for directed DIBVNiklas Schnelle
[ Upstream commit 4181b51c38875de9f6f11248fa0bcf3246c19c82 ] In commit 4e4dc65ab578 ("s390/pci: use phys_to_virt() for AIBVs/DIBVs") the setting of dibv_addr was missed when adding virt_to_phys(). This only affects systems with directed interrupt delivery enabled which are not generally available. Fixes: 4e4dc65ab578 ("s390/pci: use phys_to_virt() for AIBVs/DIBVs") Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-27kprobe/ftrace: bail out if ftrace was killedStephen Brennan
[ Upstream commit 1a7d0890dd4a502a202aaec792a6c04e6e049547 ] If an error happens in ftrace, ftrace_kill() will prevent disarming kprobes. Eventually, the ftrace_ops associated with the kprobes will be freed, yet the kprobes will still be active, and when triggered, they will use the freed memory, likely resulting in a page fault and panic. This behavior can be reproduced quite easily, by creating a kprobe and then triggering a ftrace_kill(). For simplicity, we can simulate an ftrace error with a kernel module like [1]: [1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer sudo perf probe --add commit_creds sudo perf trace -e probe:commit_creds # In another terminal make sudo insmod ftrace_killer.ko # calls ftrace_kill(), simulating bug # Back to perf terminal # ctrl-c sudo perf probe --del commit_creds After a short period, a page fault and panic would occur as the kprobe continues to execute and uses the freed ftrace_ops. While ftrace_kill() is supposed to be used only in extreme circumstances, it is invoked in FTRACE_WARN_ON() and so there are many places where an unexpected bug could be triggered, yet the system may continue operating, possibly without the administrator noticing. If ftrace_kill() does not panic the system, then we should do everything we can to continue operating, rather than leave a ticking time bomb. Link: https://lore.kernel.org/all/20240501162956.229427-1-stephen.s.brennan@oracle.com/ Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Guo Ren <guoren@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-16s390/cpacf: Make use of invalid opcode produce a link errorHarald Freudenberger
commit 32e8bd6423fc127d2b37bdcf804fd76af3bbec79 upstream. Instead of calling BUG() at runtime introduce and use a prototype for a non-existing function to produce a link error during compile when a not supported opcode is used with the __cpacf_query() or __cpacf_check_opcode() inline functions. Suggested-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Holger Dengler <dengler@linux.ibm.com> Reviewed-by: Juergen Christ <jchrist@linux.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16s390/cpacf: Split and rework cpacf query functionsHarald Freudenberger
commit 830999bd7e72f4128b9dfa37090d9fa8120ce323 upstream. Rework the cpacf query functions to use the correct RRE or RRF instruction formats and set register fields within instructions correctly. Fixes: 1afd43e0fbba ("s390/crypto: allow to query all known cpacf functions") Reported-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com> Suggested-by: Heiko Carstens <hca@linux.ibm.com> Suggested-by: Juergen Christ <jchrist@linux.ibm.com> Suggested-by: Holger Dengler <dengler@linux.ibm.com> Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Holger Dengler <dengler@linux.ibm.com> Reviewed-by: Juergen Christ <jchrist@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-16mm: fix race between __split_huge_pmd_locked() and GUP-fastRyan Roberts
commit 3a5a8d343e1cf96eb9971b17cbd4b832ab19b8e7 upstream. __split_huge_pmd_locked() can be called for a present THP, devmap or (non-present) migration entry. It calls pmdp_invalidate() unconditionally on the pmdp and only determines if it is present or not based on the returned old pmd. This is a problem for the migration entry case because pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a present pmd. On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future call to pmd_present() will return true. And therefore any lockless pgtable walker could see the migration entry pmd in this state and start interpretting the fields as if it were present, leading to BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. x86 does not suffer the above problem, but instead pmd_mkinvalid() will corrupt the offset field of the swap entry within the swap pte. See link below for discussion of that problem. Fix all of this by only calling pmdp_invalidate() for a present pmd. And for good measure let's add a warning to all implementations of pmdp_invalidate[_ad](). I've manually reviewed all other pmdp_invalidate[_ad]() call sites and believe all others to be conformant. This is a theoretical bug found during code review. I don't have any test case to trigger it in practice. Link: https://lkml.kernel.org/r/20240501143310.1381675-1-ryan.roberts@arm.com Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-06-12s390/boot: Remove alt_stfle_fac_list from decompressorSven Schnelle
[ Upstream commit e7dec0b7926f3cd493c697c4c389df77e8e8a34c ] It is nowhere used in the decompressor, therefore remove it. Fixes: 17e89e1340a3 ("s390/facilities: move stfl information from lowcore to global data") Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12s390/ipl: Fix incorrect initialization of nvme dump blockAlexander Egorenkov
[ Upstream commit 7faacaeaf6ce12fae78751de5ad869d8f1e1cd7a ] Initialize the correct fields of the nvme dump block. This bug had not been detected before because first, the fcp and nvme fields of struct ipl_parameter_block are part of the same union and, therefore, overlap in memory and second, they are identical in structure and size. Fixes: d70e38cb1dee ("s390: nvme dump support") Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12s390/ipl: Fix incorrect initialization of len fields in nvme reipl blockAlexander Egorenkov
[ Upstream commit 9c922b73acaf39f867668d9cbe5dc69c23511f84 ] Use correct symbolic constants IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN to initialize nvme reipl block when 'scp_data' sysfs attribute is being updated. This bug had not been detected before because the corresponding fcp and nvme symbolic constants are equal. Fixes: 23a457b8d57d ("s390: nvme reipl") Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12s390/vdso: Use standard stack frame layoutHeiko Carstens
[ Upstream commit 185445c7c137822ad856aae91a41e199370cb534 ] By default user space is compiled with standard stack frame layout and not with the packed stack layout. The vdso code however inherited the -mpacked-stack compiler option from the kernel. Remove this option to make sure the vdso is compiled with standard stack frame layout. This makes sure that the stack frame backchain location for vdso generated stack frames is the same like for calling code (if compiled with default options). This allows to manually walk stack frames without DWARF information, like the kernel is doing it e.g. with arch_stack_walk_user(). Fixes: 4bff8cb54502 ("s390: convert to GENERIC_VDSO") Reviewed-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12kbuild: unify vdso_install rulesMasahiro Yamada
[ Upstream commit 56769ba4b297a629148eb24d554aef72d1ddfd9e ] Currently, there is no standard implementation for vdso_install, leading to various issues: 1. Code duplication Many architectures duplicate similar code just for copying files to the install destination. Some architectures (arm, sparc, x86) create build-id symlinks, introducing more code duplication. 2. Unintended updates of in-tree build artifacts The vdso_install rule depends on the vdso files to install. It may update in-tree build artifacts. This can be problematic, as explained in commit 19514fc665ff ("arm, kbuild: make "make install" not depend on vmlinux"). 3. Broken code in some architectures Makefile code is often copied from one architecture to another without proper adaptation. 'make vdso_install' for parisc does not work. 'make vdso_install' for s390 installs vdso64, but not vdso32. To address these problems, this commit introduces a generic vdso_install rule. Architectures that support vdso_install need to define vdso-install-y in arch/*/Makefile. vdso-install-y lists the files to install. For example, arch/x86/Makefile looks like this: vdso-install-$(CONFIG_X86_64) += arch/x86/entry/vdso/vdso64.so.dbg vdso-install-$(CONFIG_X86_X32_ABI) += arch/x86/entry/vdso/vdsox32.so.dbg vdso-install-$(CONFIG_X86_32) += arch/x86/entry/vdso/vdso32.so.dbg vdso-install-$(CONFIG_IA32_EMULATION) += arch/x86/entry/vdso/vdso32.so.dbg These files will be installed to $(MODLIB)/vdso/ with the .dbg suffix, if exists, stripped away. vdso-install-y can optionally take the second field after the colon separator. This is needed because some architectures install a vdso file as a different base name. The following is a snippet from arch/arm64/Makefile. vdso-install-$(CONFIG_COMPAT_VDSO) += arch/arm64/kernel/vdso32/vdso.so.dbg:vdso32.so This will rename vdso.so.dbg to vdso32.so during installation. If such architectures change their implementation so that the base names match, this workaround will go away. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Sven Schnelle <svens@linux.ibm.com> # s390 Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Reviewed-by: Guo Ren <guoren@kernel.org> Acked-by: Helge Deller <deller@gmx.de> # parisc Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Stable-dep-of: fc2f5f10f9bc ("s390/vdso: Create .build-id links for unstripped vdso files") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12s390/vdso: Generate unwind information for C modulesJens Remus
[ Upstream commit 10f70525365146046dddcc3d36bfaea2aee0376a ] GDB fails to unwind vDSO functions with error message "PC not saved", for instance when stepping through gettimeofday(). Add -fasynchronous-unwind-tables to CFLAGS to generate .eh_frame DWARF unwind information for the vDSO C modules. Fixes: 4bff8cb54502 ("s390: convert to GENERIC_VDSO") Signed-off-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12s390/vdso64: filter out munaligned-symbols flag for vdsoSumanth Korikkar
[ Upstream commit 8192a1b3807510d0ed5be1f8988c08f8d41cced9 ] Gcc recently implemented an optimization [1] for loading symbols without explicit alignment, aligning with the IBM Z ELF ABI. This ABI mandates symbols to reside on a 2-byte boundary, enabling the use of the larl instruction. However, kernel linker scripts may still generate unaligned symbols. To address this, a new -munaligned-symbols option has been introduced [2] in recent gcc versions. [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622872.html [2] https://gcc.gnu.org/pipermail/gcc-patches/2023-August/625986.html However, when -munaligned-symbols is used in vdso code, it leads to the following compilation error: `.data.rel.ro.local' referenced in section `.text' of arch/s390/kernel/vdso64/vdso64_generic.o: defined in discarded section `.data.rel.ro.local' of arch/s390/kernel/vdso64/vdso64_generic.o vdso linker script discards .data section to make it lightweight. However, -munaligned-symbols in vdso object files references literal pool and accesses _vdso_data. Hence, compile vdso code without -munaligned-symbols. This means in the future, vdso code should deal with alignment of newly introduced unaligned linker symbols. Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Link: https://lore.kernel.org/r/20240219132734.22881-2-sumanthk@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Stable-dep-of: 10f705253651 ("s390/vdso: Generate unwind information for C modules") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12s390/bpf: Emit a barrier for BPF_FETCH instructionsIlya Leoshkevich
[ Upstream commit 68378982f0b21de02ac3c6a11e2420badefcb4bc ] BPF_ATOMIC_OP() macro documentation states that "BPF_ADD | BPF_FETCH" should be the same as atomic_fetch_add(), which is currently not the case on s390x: the serialization instruction "bcr 14,0" is missing. This applies to "and", "or" and "xor" variants too. s390x is allowed to reorder stores with subsequent fetches from different addresses, so code relying on BPF_FETCH acting as a barrier, for example: stw [%r0], 1 afadd [%r1], %r2 ldxw %r3, [%r4] may be broken. Fix it by emitting "bcr 14,0". Note that a separate serialization instruction is not needed for BPF_XCHG and BPF_CMPXCHG, because COMPARE AND SWAP performs serialization itself. Fixes: ba3b86b9cef0 ("s390/bpf: Implement new atomic ops") Reported-by: Puranjay Mohan <puranjay12@gmail.com> Closes: https://lore.kernel.org/bpf/mb61p34qvq3wf.fsf@kernel.org/ Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20240507000557.12048-1-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-06-12s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guestsDavid Hildenbrand
[ Upstream commit 06201e00ee3e4beacac48aab2b83eff64ebf0bc0 ] commit fa41ba0d08de ("s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs") introduced an undesired side effect when combined with memory ballooning and VM migration: memory part of the inflated memory balloon will consume memory. Assuming we have a 100GiB VM and inflated the balloon to 40GiB. Our VM will consume ~60GiB of memory. If we now trigger a VM migration, hypervisors like QEMU will read all VM memory. As s390x does not support the shared zeropage, we'll end up allocating for all previously-inflated memory part of the memory balloon: 50 GiB. So we might easily (unexpectedly) crash the VM on the migration source. Even worse, hypervisors like QEMU optimize for zeropage migration to not consume memory on the migration destination: when migrating a "page full of zeroes", on the migration destination they check whether the target memory is already zero (by reading the destination memory) and avoid writing to the memory to not allocate memory: however, s390x will also allocate memory here, implying that also on the migration destination, we will end up allocating all previously-inflated memory part of the memory balloon. This is especially bad if actual memory overcommit was not desired, when memory ballooning is used for dynamic VM memory resizing, setting aside some memory during boot that can be added later on demand. Alternatives like virtio-mem that would avoid this issue are not yet available on s390x. There could be ways to optimize some cases in user space: before reading memory in an anonymous private mapping on the migration source, check via /proc/self/pagemap if anything is already populated. Similarly check on the migration destination before reading. While that would avoid populating tables full of shared zeropages on all architectures, it's harder to get right and performant, and requires user space changes. Further, with posctopy live migration we must place a page, so there, "avoid touching memory to avoid allocating memory" is not really possible. (Note that a previously we would have falsely inserted shared zeropages into processes using UFFDIO_ZEROPAGE where mm_forbids_zeropage() would have actually forbidden it) PV is currently incompatible with memory ballooning, and in the common case, KVM guests don't make use of storage keys. Instead of zapping zeropages when enabling storage keys / PV, that turned out to be problematic in the past, let's do exactly the same we do with KSM pages: trigger unsharing faults to replace the shared zeropages by proper anonymous folios. What about added latency when enabling storage kes? Having a lot of zeropages in applicable environments (PV, legacy guests, unittests) is unexpected. Further, KSM could today already unshare the zeropages and unmerging KSM pages when enabling storage kets would unshare the KSM-placed zeropages in the same way, resulting in the same latency. [ agordeev: Fixed sparse and checkpatch complaints and error handling ] Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com> Fixes: fa41ba0d08de ("s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs") Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20240411161441.910170-3-david@redhat.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-17s390/vdso: Add CFI for RA register to asm macro vdso_funcJens Remus
[ Upstream commit b961ec10b9f9719987470236feb50c967db5a652 ] The return-address (RA) register r14 is specified as volatile in the s390x ELF ABI [1]. Nevertheless proper CFI directives must be provided for an unwinder to restore the return address, if the RA register value is changed from its value at function entry, as it is the case. [1]: s390x ELF ABI, https://github.com/IBM/s390x-abi/releases Fixes: 4bff8cb54502 ("s390: convert to GENERIC_VDSO") Signed-off-by: Jens Remus <jremus@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-17s390/mm: Fix clearing storage keys for huge pagesClaudio Imbrenda
[ Upstream commit 412050af2ea39407fe43324b0be4ab641530ce88 ] The function __storage_key_init_range() expects the end address to be the first byte outside the range to be initialized. I.e. end - start should be the size of the area to be initialized. The current code works because __storage_key_init_range() will still loop over every page in the range, but it is slower than using sske_frame(). Fixes: 3afdfca69870 ("s390/mm: Clear skeys for newly mapped huge guest pmds") Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20240416114220.28489-3-imbrenda@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-17s390/mm: Fix storage key clearing for guest huge pagesClaudio Imbrenda
[ Upstream commit 843c3280686fc1a83d89ee1e0b5599c9f6b09d0c ] The function __storage_key_init_range() expects the end address to be the first byte outside the range to be initialized. I.e. end - start should be the size of the area to be initialized. The current code works because __storage_key_init_range() will still loop over every page in the range, but it is slower than using sske_frame(). Fixes: 964c2c05c9f3 ("s390/mm: Clear huge page storage keys on enable_skey") Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20240416114220.28489-2-imbrenda@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-10s390/entry: align system call table on 8 bytesSumanth Korikkar
commit 378ca2d2ad410a1cd5690d06b46c5e2297f4c8c0 upstream. Align system call table on 8 bytes. With sys_call_table entry size of 8 bytes that eliminates the possibility of a system call pointer crossing cache line boundary. Cc: stable@kernel.org Suggested-by: Ulrich Weigand <ulrich.weigand@de.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-10mm/treewide: replace pud_large() with pud_leaf()Peter Xu
[ Upstream commit 0a845e0f6348ccfa2dcc8c450ffd1c9ffe8c4add ] pud_large() is always defined as pud_leaf(). Merge their usages. Chose pud_leaf() because pud_leaf() is a global API, while pud_large() is not. Link: https://lkml.kernel.org/r/20240305043750.93762-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: c567f2948f57 ("Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-10s390/bpf: Fix bpf_plt pointer arithmeticIlya Leoshkevich
[ Upstream commit 7ded842b356d151ece8ac4985940438e6d3998bb ] Kui-Feng Lee reported a crash on s390x triggered by the dummy_st_ops/dummy_init_ptr_arg test [1]: [<0000000000000002>] 0x2 [<00000000009d5cde>] bpf_struct_ops_test_run+0x156/0x250 [<000000000033145a>] __sys_bpf+0xa1a/0xd00 [<00000000003319dc>] __s390x_sys_bpf+0x44/0x50 [<0000000000c4382c>] __do_syscall+0x244/0x300 [<0000000000c59a40>] system_call+0x70/0x98 This is caused by GCC moving memcpy() after assignments in bpf_jit_plt(), resulting in NULL pointers being written instead of the return and the target addresses. Looking at the GCC internals, the reordering is allowed because the alias analysis thinks that the memcpy() destination and the assignments' left-hand-sides are based on different objects: new_plt and bpf_plt_ret/bpf_plt_target respectively, and therefore they cannot alias. This is in turn due to a violation of the C standard: When two pointers are subtracted, both shall point to elements of the same array object, or one past the last element of the array object ... From the C's perspective, bpf_plt_ret and bpf_plt are distinct objects and cannot be subtracted. In the practical terms, doing so confuses the GCC's alias analysis. The code was written this way in order to let the C side know a few offsets defined in the assembly. While nice, this is by no means necessary. Fix the noncompliance by hardcoding these offsets. [1] https://lore.kernel.org/bpf/c9923c1d-971d-4022-8dc8-1364e929d34c@gmail.com/ Fixes: f1d5df84cd8c ("s390/bpf: Implement bpf_arch_text_poke()") Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Message-ID: <20240320015515.11883-1-iii@linux.ibm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26s390/vtime: fix average steal time calculationMete Durlu
[ Upstream commit 367c50f78451d3bd7ad70bc5c89f9ba6dec46ca9 ] Current average steal timer calculation produces volatile and inflated values. The only user of this value is KVM so far and it uses that to decide whether or not to yield the vCPU which is seeing steal time. KVM compares average steal timer to a threshold and if the threshold is past then it does not allow CPU polling and yields it to host, else it keeps the CPU by polling. Since KVM's steal time threshold is very low by default (%10) it most likely is not effected much by the bloated average steal timer values because the operating region is pretty small. However there might be new users in the future who might rely on this number. Fix average steal timer calculation by changing the formula from: avg_steal_timer = avg_steal_timer / 2 + steal_timer; to the following: avg_steal_timer = (avg_steal_timer + steal_timer) / 2; This ensures that avg_steal_timer is actually a naive average of steal timer values. It now closely follows steal timer values but of course in a smoother manner. Fixes: 152e9b8676c6 ("s390/vtime: steal time exponential moving average") Signed-off-by: Mete Durlu <meted@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26s390/cache: prevent rebuild of shared_cpu_listHeiko Carstens
[ Upstream commit cb0cd4ee11142339f2d47eef6db274290b7a482d ] With commit 36bbc5b4ffab ("cacheinfo: Allow early detection and population of cache attributes") the shared cpu list for each cache level higher than L1 is rebuilt even if the list already has been set up. This is caused by the removal of the cpumask_empty() check within cache_shared_cpu_map_setup(). However architectures can enforce that the shared cpu list is not rebuilt by simply setting cpu_map_populated of the per cpu cache info structure to true, which is also the fix for this problem. Before: $ cat /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list 0-7 After: $ cat /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list 1 Fixes: 36bbc5b4ffab ("cacheinfo: Allow early detection and population of cache attributes") Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26s390/vdso: drop '-fPIC' from LDFLAGSNathan Chancellor
[ Upstream commit 0628c03934187be33942580e10bb9afcc61adeed ] '-fPIC' as an option to the linker does not do what it seems like it should. With ld.bfd, it is treated as '-f PIC', which does not make sense based on the meaning of '-f': -f SHLIB, --auxiliary SHLIB Auxiliary filter for shared object symbol table When building with ld.lld (currently under review in a GitHub pull request), it just errors out because '-f' means nothing and neither does '-fPIC': ld.lld: error: unknown argument '-fPIC' '-fPIC' was blindly copied from CFLAGS when the vDSO stopped being linked with '$(CC)', it should not be needed. Remove it to clear up the build failure with ld.lld. Fixes: 2b2a25845d53 ("s390/vdso: Use $(LD) instead of $(CC) to link vDSO") Link: https://github.com/llvm/llvm-project/pull/75643 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Fangrui Song <maskray@google.com> Link: https://lore.kernel.org/r/20240130-s390-vdso-drop-fpic-from-ldflags-v1-1-094ad104fc55@kernel.org Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26s390/pai: fix attr_event_free upper limit for pai device driversThomas Richter
[ Upstream commit 225d09d6e5f3870560665a1829d2db79330b4c58 ] When the device drivers are initialized, a sysfs directory is created. This contains many attributes which are allocated with kzalloc(). Should it fail, the memory for the attributes already created is freed in attr_event_free(). Its second parameter is number of attribute elements to delete. This parameter is off by one. When i. e. the 10th attribute fails to get created, attributes numbered 0 to 9 should be deleted. Currently only attributes numbered 0 to 8 are deleted. Fixes: 39d62336f5c1 ("s390/pai: add support for cryptography counters") Reported-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15KVM: s390: vsie: fix race during shadow creationChristian Borntraeger
[ Upstream commit fe752331d4b361d43cfd0b89534b4b2176057c32 ] Right now it is possible to see gmap->private being zero in kvm_s390_vsie_gmap_notifier resulting in a crash. This is due to the fact that we add gmap->private == kvm after creation: static int acquire_gmap_shadow(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page) { [...] gmap = gmap_shadow(vcpu->arch.gmap, asce, edat); if (IS_ERR(gmap)) return PTR_ERR(gmap); gmap->private = vcpu->kvm; Let children inherit the private field of the parent. Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Fixes: a3508fbe9dc6 ("KVM: s390: vsie: initial support for nested virtualization") Cc: <stable@vger.kernel.org> Cc: David Hildenbrand <david@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20231220125317.4258-1-borntraeger@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-15KVM: s390: add stat counter for shadow gmap eventsNico Boehr
[ Upstream commit c3235e2dd6956448a562d6b1112205eeebc8ab43 ] The shadow gmap tracks memory of nested guests (guest-3). In certain scenarios, the shadow gmap needs to be rebuilt, which is a costly operation since it involves a SIE exit into guest-1 for every entry in the respective shadow level. Add kvm stat counters when new shadow structures are created at various levels. Also add a counter gmap_shadow_create when a completely fresh shadow gmap is created as well as a counter gmap_shadow_reuse when an existing gmap is being reused. Note that when several levels are shadowed at once, counters on all affected levels will be increased. Also note that not all page table levels need to be present and a ASCE can directly point to e.g. a segment table. In this case, a new segment table will always be equivalent to a new shadow gmap and hence will be counted as gmap_shadow_create and not as gmap_shadow_segment. Signed-off-by: Nico Boehr <nrb@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20231009093304.2555344-2-nrb@linux.ibm.com Message-Id: <20231009093304.2555344-2-nrb@linux.ibm.com> Stable-dep-of: fe752331d4b3 ("KVM: s390: vsie: fix race during shadow creation") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01s390: use the correct count for __iowrite64_copy()Jason Gunthorpe
[ Upstream commit 723a2cc8d69d4342b47dfddbfe6c19f1b135f09b ] The signature for __iowrite64_copy() requires the number of 64 bit quantities, not bytes. Multiple by 8 to get to a byte length before invoking zpci_memcpy_toio() Fixes: 87bc359b9822 ("s390/pci: speed up __iowrite64_copy by using pci store block insn") Acked-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/0-v1-9223d11a7662+1d7785-s390_iowrite64_jgg@nvidia.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-23work around gcc bugs with 'asm goto' with outputsLinus Torvalds
commit 4356e9f841f7fbb945521cef3577ba394c65f3fc upstream. We've had issues with gcc and 'asm goto' before, and we created a 'asm_volatile_goto()' macro for that in the past: see commits 3f0116c3238a ("compiler/gcc4: Add quirk for 'asm goto' miscompilation bug") and a9f180345f53 ("compiler/gcc4: Make quirk for asm_volatile_goto() unconditional"). Then, much later, we ended up removing the workaround in commit 43c249ea0b1e ("compiler-gcc.h: remove ancient workaround for gcc PR 58670") because we no longer supported building the kernel with the affected gcc versions, but we left the macro uses around. Now, Sean Christopherson reports a new version of a very similar problem, which is fixed by re-applying that ancient workaround. But the problem in question is limited to only the 'asm goto with outputs' cases, so instead of re-introducing the old workaround as-is, let's rename and limit the workaround to just that much less common case. It looks like there are at least two separate issues that all hit in this area: (a) some versions of gcc don't mark the asm goto as 'volatile' when it has outputs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98619 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110420 which is easy to work around by just adding the 'volatile' by hand. (b) Internal compiler errors: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110422 which are worked around by adding the extra empty 'asm' as a barrier, as in the original workaround. but the problem Sean sees may be a third thing since it involves bad code generation (not an ICE) even with the manually added 'volatile'. but the same old workaround works for this case, even if this feels a bit like voodoo programming and may only be hiding the issue. Reported-and-tested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/20240208220604.140859-1-seanjc@google.com/ Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Andrew Pinski <quic_apinski@quicinc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-05KVM: s390: fix setting of fpc registerHeiko Carstens
[ Upstream commit b988b1bb0053c0dcd26187d29ef07566a565cf55 ] kvm_arch_vcpu_ioctl_set_fpu() allows to set the floating point control (fpc) register of a guest cpu. The new value is tested for validity by temporarily loading it into the fpc register. This may lead to corruption of the fpc register of the host process: if an interrupt happens while the value is temporarily loaded into the fpc register, and within interrupt context floating point or vector registers are used, the current fp/vx registers are saved with save_fpu_regs() assuming they belong to user space and will be loaded into fp/vx registers when returning to user space. test_fp_ctl() restores the original user space / host process fpc register value, however it will be discarded, when returning to user space. In result the host process will incorrectly continue to run with the value that was supposed to be used for a guest cpu. Fix this by simply removing the test. There is another test right before the SIE context is entered which will handles invalid values. This results in a change of behaviour: invalid values will now be accepted instead of that the ioctl fails with -EINVAL. This seems to be acceptable, given that this interface is most likely not used anymore, and this is in addition the same behaviour implemented with the memory mapped interface (replace invalid values with zero) - see sync_regs() in kvm-s390.c. Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-05s390/ptrace: handle setting of fpc register correctlyHeiko Carstens
[ Upstream commit 8b13601d19c541158a6e18b278c00ba69ae37829 ] If the content of the floating point control (fpc) register of a traced process is modified with the ptrace interface the new value is tested for validity by temporarily loading it into the fpc register. This may lead to corruption of the fpc register of the tracing process: if an interrupt happens while the value is temporarily loaded into the fpc register, and within interrupt context floating point or vector registers are used, the current fp/vx registers are saved with save_fpu_regs() assuming they belong to user space and will be loaded into fp/vx registers when returning to user space. test_fp_ctl() restores the original user space fpc register value, however it will be discarded, when returning to user space. In result the tracer will incorrectly continue to run with the value that was supposed to be used for the traced process. Fix this by saving fpu register contents with save_fpu_regs() before using test_fp_ctl(). Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-05arch: consolidate arch_irq_work_raise prototypesArnd Bergmann
[ Upstream commit 64bac5ea17d527872121adddfee869c7a0618f8f ] The prototype was hidden in an #ifdef on x86, which causes a warning: kernel/irq_work.c:72:13: error: no previous prototype for 'arch_irq_work_raise' [-Werror=missing-prototypes] Some architectures have a working prototype, while others don't. Fix this by providing it in only one place that is always visible. Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Acked-by: Guo Ren <guoren@kernel.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-02-05s390/boot: always align vmalloc area on segment boundaryAlexander Gordeev
[ Upstream commit 65f8780e2d70257200547b5a7654974aa7c37ce1 ] The size of vmalloc area depends from various factors on boot and could be set to: 1. Default size as determined by VMALLOC_DEFAULT_SIZE macro; 2. One half of the virtual address space not occupied by modules and fixed mappings; 3. The size provided by user with vmalloc= kernel command line parameter; In cases [1] and [2] the vmalloc area base address is aligned on Region3 table type boundary, while in case [3] in might get aligned on page boundary. Limit the waste of page tables and always align vmalloc area size and base address on segment boundary. Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>