summaryrefslogtreecommitdiff
path: root/arch/riscv/kvm
AgeCommit message (Collapse)Author
2024-10-28RISC-V: KVM: Break down the __kvm_riscv_switch_to() into macrosAnup Patel
Break down the __kvm_riscv_switch_to() function into macros so that these macros can be later re-used by SBI NACL extension based low-level switch function. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20241020194734.58686-5-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-10-28RISC-V: KVM: Save/restore SCOUNTEREN in C sourceAnup Patel
The SCOUNTEREN CSR need not be saved/restored in the low-level __kvm_riscv_switch_to() function hence move the SCOUNTEREN CSR save/restore to the kvm_riscv_vcpu_swap_in_guest_state() and kvm_riscv_vcpu_swap_in_host_state() functions in C sources. Also, re-arrange the CSR save/restore and related GPR usage in the low-level __kvm_riscv_switch_to() low-level function. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20241020194734.58686-4-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-10-28RISC-V: KVM: Save/restore HSTATUS in C sourceAnup Patel
We will be optimizing HSTATUS CSR access via shared memory setup using the SBI nested acceleration extension. To facilitate this, we first move HSTATUS save/restore in kvm_riscv_vcpu_enter_exit(). Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20241020194734.58686-3-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-10-28RISC-V: KVM: Order the object files alphabeticallyAnup Patel
Order the object files alphabetically in the Makefile so that it is very predictable inserting new object files in the future. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20241020194734.58686-2-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-10-28riscv: KVM: add basic support for host vs guest profilingQuan Zhou
For the information collected on the host side, we need to identify which data originates from the guest and record these events separately, this can be achieved by having KVM register perf callbacks. Signed-off-by: Quan Zhou <zhouquan@iscas.ac.cn> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/00342d535311eb0629b9ba4f1e457a48e2abee33.1728957131.git.zhouquan@iscas.ac.cn Signed-off-by: Anup Patel <anup@brainfault.org>
2024-10-25KVM: RISC-V: Use kvm_faultin_pfn() when mapping pfns into the guestSean Christopherson
Convert RISC-V to __kvm_faultin_pfn()+kvm_release_faultin_page(), which are new APIs to consolidate arch code and provide consistent behavior across all KVM architectures. Opportunisticaly fix a s/priort/prior typo in the related comment. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Anup Patel <anup@brainfault.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-60-seanjc@google.com>
2024-10-25KVM: RISC-V: Mark "struct page" pfns accessed before dropping mmu_lockSean Christopherson
Mark pages accessed before dropping mmu_lock when faulting in guest memory so that RISC-V can convert to kvm_release_faultin_page() without tripping its lockdep assertion on mmu_lock being held. Marking pages accessed outside of mmu_lock is ok (not great, but safe), but marking pages _dirty_ outside of mmu_lock can make filesystems unhappy (see the link below). Do both under mmu_lock to minimize the chances of doing the wrong thing in the future. Link: https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Anup Patel <anup@brainfault.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-59-seanjc@google.com>
2024-10-25KVM: RISC-V: Mark "struct page" pfns dirty iff a stage-2 PTE is installedSean Christopherson
Don't mark pages dirty if KVM bails from the page fault handler without installing a stage-2 mapping, i.e. if the page is guaranteed to not be written by the guest. In addition to being a (very) minor fix, this paves the way for converting RISC-V to use kvm_release_faultin_page(). Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Anup Patel <anup@brainfault.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-58-seanjc@google.com>
2024-10-24RISC-V: KVM: Allow Smnpm and Ssnpm extensions for guestsSamuel Holland
The interface for controlling pointer masking in VS-mode is henvcfg.PMM, which is part of the Ssnpm extension, even though pointer masking in HS-mode is provided by the Smnpm extension. As a result, emulating Smnpm in the guest requires (only) Ssnpm on the host. The guest configures Smnpm through the SBI Firmware Features extension, which KVM does not yet implement, so currently the ISA extension has no visible effect on the guest, and thus it cannot be disabled. Ssnpm is configured using the senvcfg CSR within the guest, so that extension cannot be hidden from the guest without intercepting writes to the CSR. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20241016202814.4061541-10-samuel.holland@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-10-20RISCV: KVM: use raw_spinlock for critical section in imsicCyan Yang
For the external interrupt updating procedure in imsic, there was a spinlock to protect it already. But since it should not be preempted in any cases, we should turn to use raw_spinlock to prevent any preemption in case PREEMPT_RT was enabled. Signed-off-by: Cyan Yang <cyan.yang@sifive.com> Reviewed-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Message-ID: <20240919160126.44487-1-cyan.yang@sifive.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-17Merge branch 'kvm-redo-enable-virt' into HEADPaolo Bonzini
Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed. The primary motivation for this series is to simplify dealing with enabling virtualization for Intel's TDX, which needs to enable virtualization when kvm-intel.ko is loaded, i.e. long before the first VM is created. That said, this is a nice cleanup on its own. By registering the callbacks on-demand, the callbacks themselves don't need to check kvm_usage_count, because their very existence implies a non-zero count. Patch 1 (re)adds a dedicated lock for kvm_usage_count. This avoids a lock ordering issue between cpus_read_lock() and kvm_lock. The lock ordering issue still exist in very rare cases, and will be fixed for good by switching vm_list to an (S)RCU-protected list. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04KVM: Rename arch hooks related to per-CPU virtualization enablingSean Christopherson
Rename the per-CPU hooks used to enable virtualization in hardware to align with the KVM-wide helpers in kvm_main.c, and to better capture that the callbacks are invoked on every online CPU. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240830043600.127750-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-19RISC-V: KVM: Don't zero-out PMU snapshot area before freeing dataAnup Patel
With the latest Linux-6.11-rc3, the below NULL pointer crash is observed when SBI PMU snapshot is enabled for the guest and the guest is forcefully powered-off. Unable to handle kernel NULL pointer dereference at virtual address 0000000000000508 Oops [#1] Modules linked in: kvm CPU: 0 UID: 0 PID: 61 Comm: term-poll Not tainted 6.11.0-rc3-00018-g44d7178dd77a #3 Hardware name: riscv-virtio,qemu (DT) epc : __kvm_write_guest_page+0x94/0xa6 [kvm] ra : __kvm_write_guest_page+0x54/0xa6 [kvm] epc : ffffffff01590e98 ra : ffffffff01590e58 sp : ffff8f80001f39b0 gp : ffffffff81512a60 tp : ffffaf80024872c0 t0 : ffffaf800247e000 t1 : 00000000000007e0 t2 : 0000000000000000 s0 : ffff8f80001f39f0 s1 : 00007fff89ac4000 a0 : ffffffff015dd7e8 a1 : 0000000000000086 a2 : 0000000000000000 a3 : ffffaf8000000000 a4 : ffffaf80024882c0 a5 : 0000000000000000 a6 : ffffaf800328d780 a7 : 00000000000001cc s2 : ffffaf800197bd00 s3 : 00000000000828c4 s4 : ffffaf800248c000 s5 : ffffaf800247d000 s6 : 0000000000001000 s7 : 0000000000001000 s8 : 0000000000000000 s9 : 00007fff861fd500 s10: 0000000000000001 s11: 0000000000800000 t3 : 00000000000004d3 t4 : 00000000000004d3 t5 : ffffffff814126e0 t6 : ffffffff81412700 status: 0000000200000120 badaddr: 0000000000000508 cause: 000000000000000d [<ffffffff01590e98>] __kvm_write_guest_page+0x94/0xa6 [kvm] [<ffffffff015943a6>] kvm_vcpu_write_guest+0x56/0x90 [kvm] [<ffffffff015a175c>] kvm_pmu_clear_snapshot_area+0x42/0x7e [kvm] [<ffffffff015a1972>] kvm_riscv_vcpu_pmu_deinit.part.0+0xe0/0x14e [kvm] [<ffffffff015a2ad0>] kvm_riscv_vcpu_pmu_deinit+0x1a/0x24 [kvm] [<ffffffff0159b344>] kvm_arch_vcpu_destroy+0x28/0x4c [kvm] [<ffffffff0158e420>] kvm_destroy_vcpus+0x5a/0xda [kvm] [<ffffffff0159930c>] kvm_arch_destroy_vm+0x14/0x28 [kvm] [<ffffffff01593260>] kvm_destroy_vm+0x168/0x2a0 [kvm] [<ffffffff015933d4>] kvm_put_kvm+0x3c/0x58 [kvm] [<ffffffff01593412>] kvm_vm_release+0x22/0x2e [kvm] Clearly, the kvm_vcpu_write_guest() function is crashing because it is being called from kvm_pmu_clear_snapshot_area() upon guest tear down. To address the above issue, simplify the kvm_pmu_clear_snapshot_area() to not zero-out PMU snapshot area from kvm_pmu_clear_snapshot_area() because the guest is anyway being tore down. The kvm_pmu_clear_snapshot_area() is also called when guest changes PMU snapshot area of a VCPU but even in this case the previous PMU snaphsot area must not be zeroed-out because the guest might have reclaimed the pervious PMU snapshot area for some other purpose. Fixes: c2f41ddbcdd7 ("RISC-V: KVM: Implement SBI PMU Snapshot feature") Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20240815170907.2792229-1-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-08-19RISC-V: KVM: Fix sbiret init before forwarding to userspaceAndrew Jones
When forwarding SBI calls to userspace ensure sbiret.error is initialized to SBI_ERR_NOT_SUPPORTED first, in case userspace neglects to set it to anything. If userspace neglects it then we can't be sure it did anything else either, so we just report it didn't do or try anything. Just init sbiret.value to zero, which is the preferred value to return when nothing special is specified. KVM was already initializing both sbiret.error and sbiret.value, but the values used appear to come from a copy+paste of the __sbi_ecall() implementation, i.e. a0 and a1, which don't apply prior to the call being executed, nor at all when forwarding to userspace. Fixes: dea8ee31a039 ("RISC-V: KVM: Add SBI v0.1 support") Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20240807154943.150540-2-ajones@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-07-20Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Initial infrastructure for shadow stage-2 MMUs, as part of nested virtualization enablement - Support for userspace changes to the guest CTR_EL0 value, enabling (in part) migration of VMs between heterogenous hardware - Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of the protocol - FPSIMD/SVE support for nested, including merged trap configuration and exception routing - New command-line parameter to control the WFx trap behavior under KVM - Introduce kCFI hardening in the EL2 hypervisor - Fixes + cleanups for handling presence/absence of FEAT_TCRX - Miscellaneous fixes + documentation updates LoongArch: - Add paravirt steal time support - Add support for KVM_DIRTY_LOG_INITIALLY_SET - Add perf kvm-stat support for loongarch RISC-V: - Redirect AMO load/store access fault traps to guest - perf kvm stat support - Use guest files for IMSIC virtualization, when available s390: - Assortment of tiny fixes which are not time critical x86: - Fixes for Xen emulation - Add a global struct to consolidate tracking of host values, e.g. EFER - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX - Print the name of the APICv/AVIC inhibits in the relevant tracepoint - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor - Drop MTRR virtualization, and instead always honor guest PAT on CPUs that support self-snoop - Update to the newfangled Intel CPU FMS infrastructure - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored - Misc cleanups x86 - MMU: - Small cleanups, renames and refactoring extracted from the upcoming Intel TDX support - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't hold leafs SPTEs - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager page splitting, to avoid stalling vCPUs when splitting huge pages - Bug the VM instead of simply warning if KVM tries to split a SPTE that is non-present or not-huge. KVM is guaranteed to end up in a broken state because the callers fully expect a valid SPTE, it's all but dangerous to let more MMU changes happen afterwards x86 - AMD: - Make per-CPU save_area allocations NUMA-aware - Force sev_es_host_save_area() to be inlined to avoid calling into an instrumentable function from noinstr code - Base support for running SEV-SNP guests. API-wise, this includes a new KVM_X86_SNP_VM type, encrypting/measure the initial image into guest memory, and finalizing it before launching it. Internally, there are some gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges This includes basic support for attestation guest requests, enough to say that KVM supports the GHCB 2.0 specification There is no support yet for loading into the firmware those signing keys to be used for attestation requests, and therefore no need yet for the host to provide certificate data for those keys. To support fetching certificate data from userspace, a new KVM exit type will be needed to handle fetching the certificate from userspace. An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS exit type to handle this was introduced in v1 of this patchset, but is still being discussed by community, so for now this patchset only implements a stub version of SNP Extended Guest Requests that does not provide certificate data x86 - Intel: - Remove an unnecessary EPT TLB flush when enabling hardware - Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1) - KVM: x86: Suppress MMIO that is triggered during task switch emulation Explicitly suppress userspace emulated MMIO exits that are triggered when emulating a task switch as KVM doesn't support userspace MMIO during complex (multi-step) emulation Silently ignoring the exit request can result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace for some other reason prior to purging mmio_needed See commit 0dc902267cb3 ("KVM: x86: Suppress pending MMIO write exits if emulator detects exception") for more details on KVM's limitations with respect to emulated MMIO during complex emulator flows Generic: - Rename the AS_UNMOVABLE flag that was introduced for KVM to AS_INACCESSIBLE, because the special casing needed by these pages is not due to just unmovability (and in fact they are only unmovable because the CPU cannot access them) - New ioctl to populate the KVM page tables in advance, which is useful to mitigate KVM page faults during guest boot or after live migration. The code will also be used by TDX, but (probably) not through the ioctl - Enable halt poll shrinking by default, as Intel found it to be a clear win - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86 - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out() - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout Selftests: - Remove dead code in the memslot modification stress test - Treat "branch instructions retired" as supported on all AMD Family 17h+ CPUs - Print the guest pseudo-RNG seed only when it changes, to avoid spamming the log for tests that create lots of VMs - Make the PMU counters test less flaky when counting LLC cache misses by doing CLFLUSH{OPT} in every loop iteration" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits) crypto: ccp: Add the SNP_VLEK_LOAD command KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops KVM: x86: Replace static_call_cond() with static_call() KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event x86/sev: Move sev_guest.h into common SEV header KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event KVM: x86: Suppress MMIO that is triggered during task switch emulation KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory() KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault" KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory KVM: Document KVM_PRE_FAULT_MEMORY ioctl mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE perf kvm: Add kvm-stat for loongarch64 LoongArch: KVM: Add PV steal time support in guest side ...
2024-07-20Merge tag 'riscv-for-linus-6.11-mw1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for various new ISA extensions: * The Zve32[xf] and Zve64[xfd] sub-extensios of the vector extension * Zimop and Zcmop for may-be-operations * The Zca, Zcf, Zcd and Zcb sub-extensions of the C extension * Zawrs - riscv,cpu-intc is now dtschema - A handful of performance improvements and cleanups to text patching - Support for memory hot{,un}plug - The highest user-allocatable virtual address is now visible in hwprobe * tag 'riscv-for-linus-6.11-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (58 commits) riscv: lib: relax assembly constraints in hweight riscv: set trap vector earlier KVM: riscv: selftests: Add Zawrs extension to get-reg-list test KVM: riscv: Support guest wrs.nto riscv: hwprobe: export Zawrs ISA extension riscv: Add Zawrs support for spinlocks dt-bindings: riscv: Add Zawrs ISA extension description riscv: Provide a definition for 'pause' riscv: hwprobe: export highest virtual userspace address riscv: Improve sbi_ecall() code generation by reordering arguments riscv: Add tracepoints for SBI calls and returns riscv: Optimize crc32 with Zbc extension riscv: Enable DAX VMEMMAP optimization riscv: mm: Add support for ZONE_DEVICE virtio-mem: Enable virtio-mem for RISC-V riscv: Enable memory hotplugging for RISC-V riscv: mm: Take memory hotplug read-lock during kernel page table dump riscv: mm: Add memory hotplugging support riscv: mm: Add pfn_to_kaddr() implementation riscv: mm: Refactor create_linear_mapping_range() for memory hot add ...
2024-07-16Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups
2024-07-12Merge patch series "riscv: Apply Zawrs when available"Palmer Dabbelt
Andrew Jones <ajones@ventanamicro.com> says: Zawrs provides two instructions (wrs.nto and wrs.sto), where both are meant to allow the hart to enter a low-power state while waiting on a store to a memory location. The instructions also both wait an implementation-defined "short" duration (unless the implementation terminates the stall for another reason). The difference is that while wrs.sto will terminate when the duration elapses, wrs.nto, depending on configuration, will either just keep waiting or an ILL exception will be raised. Linux will use wrs.nto, so if platforms have an implementation which falls in the "just keep waiting" category (which is not expected), then it should _not_ advertise Zawrs in the hardware description. Like wfi (and with the same {m,h}status bits to configure it), when wrs.nto is configured to raise exceptions it's expected that the higher privilege level will see the instruction was a wait instruction, do something, and then resume execution following the instruction. For example, KVM does configure exceptions for wfi (hstatus.VTW=1) and therefore also for wrs.nto. KVM does this for wfi since it's better to allow other tasks to be scheduled while a VCPU waits for an interrupt. For waits such as those where wrs.nto/sto would be used, which are typically locks, it is also a good idea for KVM to be involved, as it can attempt to schedule the lock holding VCPU. This series starts with Christoph's addition of the riscv smp_cond_load_relaxed function which applies wrs.sto when available. That patch has been reworked to use wrs.nto and to use the same approach as Arm for the wait loop, since we can't have arbitrary C code between the load-reserved and the wrs. Then, hwprobe support is added (since the instructions are also usable from usermode), and finally KVM is taught about wrs.nto, allowing guests to see and use the Zawrs extension. We still don't have test results from hardware, and it's not possible to prove that using Zawrs is a win when testing on QEMU, not even when oversubscribing VCPUs to guests. However, it is possible to use KVM selftests to force a scenario where we can prove Zawrs does its job and does it well. [4] is a test which does this and, on my machine, without Zawrs it takes 16 seconds to complete and with Zawrs it takes 0.25 seconds. This series is also available here [1]. In order to use QEMU for testing a build with [2] is needed. In order to enable guests to use Zawrs with KVM using kvmtool, the branch at [3] may be used. [1] https://github.com/jones-drew/linux/commits/riscv/zawrs-v3/ [2] https://lore.kernel.org/all/20240312152901.512001-2-ajones@ventanamicro.com/ [3] https://github.com/jones-drew/kvmtool/commits/riscv/zawrs/ [4] https://github.com/jones-drew/linux/commit/cb2beccebcece10881db842ed69bdd5715cfab5d Link: https://lore.kernel.org/r/20240426100820.14762-8-ajones@ventanamicro.com * b4-shazam-merge: KVM: riscv: selftests: Add Zawrs extension to get-reg-list test KVM: riscv: Support guest wrs.nto riscv: hwprobe: export Zawrs ISA extension riscv: Add Zawrs support for spinlocks dt-bindings: riscv: Add Zawrs ISA extension description riscv: Provide a definition for 'pause' Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-12KVM: riscv: Support guest wrs.ntoAndrew Jones
When a guest traps on wrs.nto, call kvm_vcpu_on_spin() to attempt to yield to the lock holding VCPU. Also extend the KVM ISA extension ONE_REG interface to allow KVM userspace to detect and enable the Zawrs extension for the Guest/VM. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Acked-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240426100820.14762-13-ajones@ventanamicro.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-12Merge tag 'loongarch-kvm-6.11' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.11 1. Add ParaVirt steal time support. 2. Add some VM migration enhancement. 3. Add perf kvm-stat support for loongarch.
2024-07-03Merge patch series "Assorted fixes in RISC-V PMU driver"Palmer Dabbelt
Atish Patra <atishp@rivosinc.com> says: This series contains 3 fixes out of which the first one is a new fix for invalid event data reported in lkml[2]. The last two are v3 of Samuel's patch[1]. I added the RB/TB/Fixes tag and moved 1 unrelated change to its own patch. I also changed an error message in kvm vcpu_pmu from pr_err to pr_debug to avoid redundant failure error messages generated due to the boot time quering of events implemented in the patch[1] Here is the original cover letter for the patch[1] Before this patch: $ perf list hw List of pre-defined events (to be used in -e or -M): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] ref-cycles [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] $ perf stat -ddd true Performance counter stats for 'true': 4.36 msec task-clock # 0.744 CPUs utilized 1 context-switches # 229.325 /sec 0 cpu-migrations # 0.000 /sec 38 page-faults # 8.714 K/sec 4,375,694 cycles # 1.003 GHz (60.64%) 728,945 instructions # 0.17 insn per cycle 79,199 branches # 18.162 M/sec 17,709 branch-misses # 22.36% of all branches 181,734 L1-dcache-loads # 41.676 M/sec 5,547 L1-dcache-load-misses # 3.05% of all L1-dcache accesses <not counted> LLC-loads (0.00%) <not counted> LLC-load-misses (0.00%) <not counted> L1-icache-loads (0.00%) <not counted> L1-icache-load-misses (0.00%) <not counted> dTLB-loads (0.00%) <not counted> dTLB-load-misses (0.00%) <not counted> iTLB-loads (0.00%) <not counted> iTLB-load-misses (0.00%) <not counted> L1-dcache-prefetches (0.00%) <not counted> L1-dcache-prefetch-misses (0.00%) 0.005860375 seconds time elapsed 0.000000000 seconds user 0.010383000 seconds sys After this patch: $ perf list hw List of pre-defined events (to be used in -e or -M): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] $ perf stat -ddd true Performance counter stats for 'true': 5.16 msec task-clock # 0.848 CPUs utilized 1 context-switches # 193.817 /sec 0 cpu-migrations # 0.000 /sec 37 page-faults # 7.171 K/sec 5,183,625 cycles # 1.005 GHz 961,696 instructions # 0.19 insn per cycle 85,853 branches # 16.640 M/sec 20,462 branch-misses # 23.83% of all branches 243,545 L1-dcache-loads # 47.203 M/sec 5,974 L1-dcache-load-misses # 2.45% of all L1-dcache accesses <not supported> LLC-loads <not supported> LLC-load-misses <not supported> L1-icache-loads <not supported> L1-icache-load-misses <not supported> dTLB-loads 19,619 dTLB-load-misses <not supported> iTLB-loads 6,831 iTLB-load-misses <not supported> L1-dcache-prefetches <not supported> L1-dcache-prefetch-misses 0.006085625 seconds time elapsed 0.000000000 seconds user 0.013022000 seconds sys [1] https://lore.kernel.org/linux-riscv/20240418014652.1143466-1-samuel.holland@sifive.com/ [2] https://lore.kernel.org/all/CC51D53B-846C-4D81-86FC-FBF969D0A0D6@pku.edu.cn/ * b4-shazam-merge: perf: RISC-V: Check standard event availability drivers/perf: riscv: Reset the counter to hpmevent mapping while starting cpus drivers/perf: riscv: Do not update the event data if uptodate Link: https://lore.kernel.org/r/20240628-misc_perf_fixes-v4-0-e01cfddcf035@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-03perf: RISC-V: Check standard event availabilitySamuel Holland
The RISC-V SBI PMU specification defines several standard hardware and cache events. Currently, all of these events are exposed to userspace, even when not actually implemented. They appear in the `perf list` output, and commands like `perf stat` try to use them. This is more than just a cosmetic issue, because the PMU driver's .add function fails for these events, which causes pmu_groups_sched_in() to prematurely stop scheduling in other (possibly valid) hardware events. Add logic to check which events are supported by the hardware (i.e. can be mapped to some counter), so only usable events are reported to userspace. Since the kernel does not know the mapping between events and possible counters, this check must happen during boot, when no counters are in use. Make the check asynchronous to minimize impact on boot time. Fixes: e9991434596f ("RISC-V: Add perf platform driver based on SBI PMU extension") Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240628-misc_perf_fixes-v4-3-e01cfddcf035@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26RISC-V: KVM: Allow Zcmop extension for Guest/VMClément Léger
Extend the KVM ISA extension ONE_REG interface to allow KVM user space to detect and enable Zcmop extension for Guest/VM. Signed-off-by: Clément Léger <cleger@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Acked-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240619113529.676940-16-cleger@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26RISC-V: KVM: Allow Zca, Zcf, Zcd and Zcb extensions for Guest/VMClément Léger
Extend the KVM ISA extension ONE_REG interface to allow KVM user space to detect and enable Zca, Zcf, Zcd and Zcb extensions for Guest/VM. Signed-off-by: Clément Léger <cleger@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Acked-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240619113529.676940-11-cleger@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26RISC-V: KVM: Allow Zimop extension for Guest/VMClément Léger
Extend the KVM ISA extension ONE_REG interface to allow KVM user space to detect and enable Zimop extension for Guest/VM. Signed-off-by: Clément Léger <cleger@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Acked-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240619113529.676940-5-cleger@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-06-26RISC-V: KVM: Redirect AMO load/store access fault traps to guestYu-Wei Hsu
The KVM RISC-V does not delegate AMO load/store access fault traps to VS-mode (hedeleg) so typically M-mode takes these traps and redirects them back to HS-mode. However, upon returning from M-mode, the KVM RISC-V running in HS-mode terminates VS-mode software. The KVM RISC-V should redirect AMO load/store access fault traps back to VS-mode and let the VS-mode trap handler determine the next steps. Signed-off-by: Yu-Wei Hsu <betterman5240@gmail.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240429092113.70695-1-betterman5240@gmail.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-06-26RISCV: KVM: add tracepoints for entry and exit eventsShenlin Liang
Like other architectures, RISCV KVM also needs to add these event tracepoints to count the number of times kvm guest entry/exit. Signed-off-by: Shenlin Liang <liangshenlin@eswincomputing.com> Reviewed-by: Anup Patel <anup@brainfault.org> Tested-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240422080833.8745-2-liangshenlin@eswincomputing.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-06-26RISC-V: KVM: Use IMSIC guest files when availableAnup Patel
Let us discover and use IMSIC guest files from the IMSIC global config provided by the IMSIC irqchip driver. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20240411090639.237119-3-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-06-26RISC-V: KVM: Share APLIC and IMSIC defines with irqchip driversAnup Patel
We have common APLIC and IMSIC headers available under include/linux/irqchip/ directory which are used by APLIC and IMSIC irqchip drivers. Let us replace the use of kvm_aia_*.h headers with include/linux/irqchip/riscv-*.h headers. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20240411090639.237119-2-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-06-18KVM: Introduce vcpu->wants_to_runDavid Matlack
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit was not set. Replace all references to vcpu->run->immediate_exit with !vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a malicious userspace could invoked KVM_RUN with immediate_exit=true and then after KVM reads it to set wants_to_run=false, flip it to false. This would result in the vCPU running in KVM_RUN with wants_to_run=false. This wouldn't cause any real bugs today but is a dangerous landmine. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-31RISC-V: KVM: Fix incorrect reg_subtype labels in ↵Quan Zhou
kvm_riscv_vcpu_set_reg_isa_ext function In the function kvm_riscv_vcpu_set_reg_isa_ext, the original code used incorrect reg_subtype labels KVM_REG_RISCV_SBI_MULTI_EN/DIS. These have been corrected to KVM_REG_RISCV_ISA_MULTI_EN/DIS respectively. Although they are numerically equivalent, the actual processing will not result in errors, but it may lead to ambiguous code semantics. Fixes: 613029442a4b ("RISC-V: KVM: Extend ONE_REG to enable/disable multiple ISA extensions") Signed-off-by: Quan Zhou <zhouquan@iscas.ac.cn> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/ff1c6771a67d660db94372ac9aaa40f51e5e0090.1716429371.git.zhouquan@iscas.ac.cn Signed-off-by: Anup Patel <anup@brainfault.org>
2024-05-31RISC-V: KVM: No need to use mask when hart-index-bit is 0Yong-Xuan Wang
When the maximum hart number within groups is 1, hart-index-bit is set to 0. Consequently, there is no need to restore the hart ID from IMSIC addresses and hart-index-bit settings. Currently, QEMU and kvmtool do not pass correct hart-index-bit values when the maximum hart number is a power of 2, thereby avoiding this issue. Corresponding patches for QEMU and kvmtool will also be dispatched. Fixes: 89d01306e34d ("RISC-V: KVM: Implement device interface for AIA irqchip") Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20240415064905.25184-1-yongxuan.wang@sifive.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-05-18Merge tag 'kbuild-v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Avoid 'constexpr', which is a keyword in C23 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of 'dt_binding_check' - Fix weak references to avoid GOT entries in position-independent code generation - Convert the last use of 'optional' property in arch/sh/Kconfig - Remove support for the 'optional' property in Kconfig - Remove support for Clang's ThinLTO caching, which does not work with the .incbin directive - Change the semantics of $(src) so it always points to the source directory, which fixes Makefile inconsistencies between upstream and downstream - Fix 'make tar-pkg' for RISC-V to produce a consistent package - Provide reasonable default coverage for objtool, sanitizers, and profilers - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc. - Remove the last use of tristate choice in drivers/rapidio/Kconfig - Various cleanups and fixes in Kconfig * tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits) kconfig: use sym_get_choice_menu() in sym_check_prop() rapidio: remove choice for enumeration kconfig: lxdialog: remove initialization with A_NORMAL kconfig: m/nconf: merge two item_add_str() calls kconfig: m/nconf: remove dead code to display value of bool choice kconfig: m/nconf: remove dead code to display children of choice members kconfig: gconf: show checkbox for choice correctly kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal Makefile: remove redundant tool coverage variables kbuild: provide reasonable defaults for tool coverage modules: Drop the .export_symbol section from the final modules kconfig: use menu_list_for_each_sym() in sym_check_choice_deps() kconfig: use sym_get_choice_menu() in conf_write_defconfig() kconfig: add sym_get_choice_menu() helper kconfig: turn defaults and additional prompt for choice members into error kconfig: turn missing prompt for choice members into error kconfig: turn conf_choice() into void function kconfig: use linked list in sym_set_changed() kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED kconfig: gconf: remove debug code ...
2024-05-10kbuild: use $(src) instead of $(srctree)/$(src) for source directoryMasahiro Yamada
Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for checked-in source files. It is merely a convention without any functional difference. In fact, $(obj) and $(src) are exactly the same, as defined in scripts/Makefile.build: src := $(obj) When the kernel is built in a separate output directory, $(src) does not accurately reflect the source directory location. While Kbuild resolves this discrepancy by specifying VPATH=$(srctree) to search for source files, it does not cover all cases. For example, when adding a header search path for local headers, -I$(srctree)/$(src) is typically passed to the compiler. This introduces inconsistency between upstream and downstream Makefiles because $(src) is used instead of $(srctree)/$(src) for the latter. To address this inconsistency, this commit changes the semantics of $(src) so that it always points to the directory in the source tree. Going forward, the variables used in Makefiles will have the following meanings: $(obj) - directory in the object tree $(src) - directory in the source tree (changed by this commit) $(objtree) - the top of the kernel object tree $(srctree) - the top of the kernel source tree Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced with $(src). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2024-05-07Merge tag 'kvm-riscv-6.10-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini
KVM/riscv changes for 6.10 - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak
2024-04-26RISC-V: KVM: Improve firmware counter read functionAtish Patra
Rename the function to indicate that it is meant for firmware counter read. While at it, add a range sanity check for it as well. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240420151741.962500-17-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Support 64 bit firmware counters on RV32Atish Patra
The SBI v2.0 introduced a fw_read_hi function to read 64 bit firmware counters for RV32 based systems. Add infrastructure to support that. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-16-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Add perf sampling support for guestsAtish Patra
KVM enables perf for guest via counter virtualization. However, the sampling can not be supported as there is no mechanism to enabled trap/emulate scountovf in ISA yet. Rely on the SBI PMU snapshot to provide the counter overflow data via the shared memory. In case of sampling event, the host first sets the guest's LCOFI interrupt and injects to the guest via irq filtering mechanism defined in AIA specification. Thus, ssaia must be enabled in the host in order to use perf sampling in the guest. No other AIA dependency w.r.t kernel is required. Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-15-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Implement SBI PMU Snapshot featureAtish Patra
PMU Snapshot function allows to minimize the number of traps when the guest access configures/access the hpmcounters. If the snapshot feature is enabled, the hypervisor updates the shared memory with counter data and state of overflown counters. The guest can just read the shared memory instead of trap & emulate done by the hypervisor. This patch doesn't implement the counter overflow yet. Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-14-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: No need to exit to the user space if perf event failedAtish Patra
Currently, we return a linux error code if creating a perf event failed in kvm. That shouldn't be necessary as guest can continue to operate without perf profiling or profiling with firmware counters. Return appropriate SBI error code to indicate that PMU configuration failed. An error message in kvm already describes the reason for failure. Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling") Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-13-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: No need to update the counter value during resetAtish Patra
The virtual counter value is updated during pmu_ctr_read. There is no need to update it in reset case. Otherwise, it will be counted twice which is incorrect. Fixes: 0cb74b65d2e5 ("RISC-V: KVM: Implement perf support without sampling") Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-12-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-26RISC-V: KVM: Fix the initial sample period valueAtish Patra
The initial sample period value when counter value is not assigned should be set to maximum value supported by the counter width. Otherwise, it may result in spurious interrupts. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240420151741.962500-11-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22RISC-V: KVM: Rename the SBI_STA_SHMEM_DISABLE to a generic nameAtish Patra
SBI_STA_SHMEM_DISABLE is a macro to invoke disable shared memory commands. As this can be invoked from other SBI extension context as well, rename it to more generic name as SBI_SHMEM_DISABLE. Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240420151741.962500-7-atishp@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22RISCV: KVM: Introduce vcpu->reset_cntx_lockYong-Xuan Wang
Originally, the use of kvm->lock in SBI_EXT_HSM_HART_START also avoids the simultaneous updates to the reset context of target VCPU. Since this lock has been replace with vcpu->mp_state_lock, and this new lock also protects the vcpu->mp_state. We have to add a separate lock for vcpu->reset_cntx. Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240417074528.16506-3-yongxuan.wang@sifive.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-22RISCV: KVM: Introduce mp_state_lock to avoid lock inversionYong-Xuan Wang
Documentation/virt/kvm/locking.rst advises that kvm->lock should be acquired outside vcpu->mutex and kvm->srcu. However, when KVM/RISC-V handling SBI_EXT_HSM_HART_START, the lock ordering is vcpu->mutex, kvm->srcu then kvm->lock. Although the lockdep checking no longer complains about this after commit f0f44752f5f6 ("rcu: Annotate SRCU's update-side lockdep dependencies"), it's necessary to replace kvm->lock with a new dedicated lock to ensure only one hart can execute the SBI_EXT_HSM_HART_START call for the target hart simultaneously. Additionally, this patch also rename "power_off" to "mp_state" with two possible values. The vcpu->mp_state_lock also protects the access of vcpu->mp_state. Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240417074528.16506-2-yongxuan.wang@sifive.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-11KVM: delete .change_pte MMU notifier callbackPaolo Bonzini
The .change_pte() MMU notifier callback was intended as an optimization. The original point of it was that KSM could tell KVM to flip its secondary PTE to a new location without having to first zap it. At the time there was also an .invalidate_page() callback; both of them were *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), and .invalidate_page() also doubled as a fallback implementation of .change_pte(). Later on, however, both callbacks were changed to occur within an invalidate_range_start/end() block. In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) did so to remove the fallback from .invalidate_page() to .change_pte() and allow sleepable .invalidate_page() hooks. This however made KVM's usage of the .change_pte() callback completely moot, because KVM unmaps the sPTEs during .invalidate_range_start() and therefore .change_pte() has no hope of finding a sPTE to change. Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as well as all the architecture specific implementations. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Bibo Mao <maobibo@loongson.cn> Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-08RISC-V: KVM: Handle breakpoint exits for VCPUChao Du
Exit to userspace for breakpoint traps. Set the exit_reason as KVM_EXIT_DEBUG before exit. Signed-off-by: Chao Du <duchao@eswincomputing.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240402062628.5425-3-duchao@eswincomputing.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-04-08RISC-V: KVM: Implement kvm_arch_vcpu_ioctl_set_guest_debug()Chao Du
kvm_vm_ioctl_check_extension(): Return 1 if KVM_CAP_SET_GUEST_DEBUG is been checked. kvm_arch_vcpu_ioctl_set_guest_debug(): Update the guest_debug flags from userspace accordingly. Route the breakpoint exceptions to HS mode if the VCPU is being debugged by userspace, by clearing the corresponding bit in hedeleg. Initialize the hedeleg configuration in kvm_riscv_vcpu_setup_config(). Write the actual CSR in kvm_arch_vcpu_load(). Signed-off-by: Chao Du <duchao@eswincomputing.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240402062628.5425-2-duchao@eswincomputing.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-03-26RISC-V: KVM: Fix APLIC in_clrip[x] read emulationAnup Patel
The reads to APLIC in_clrip[x] registers returns rectified input values of the interrupt sources. A rectified input value of an interrupt source is defined by the section "4.5.2 Source configurations (sourcecfg[1]–sourcecfg[1023])" of the RISC-V AIA specification as: rectified input value = (incoming wire value) XOR (source is inverted) Update the riscv_aplic_input() implementation to match the above. Cc: stable@vger.kernel.org Fixes: 74967aa208e2 ("RISC-V: KVM: Add in-kernel emulation of AIA APLIC") Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240321085041.1955293-3-apatel@ventanamicro.com
2024-03-25RISC-V: KVM: Fix APLIC setipnum_le/be write emulationAnup Patel
The writes to setipnum_le/be register for APLIC in MSI-mode have special consideration for level-triggered interrupts as-per the section "4.9.2 Special consideration for level-sensitive interrupt sources" of the RISC-V AIA specification. Particularly, the below text from the RISC-V AIA specification defines the behaviour of writes to setipnum_le/be register for level-triggered interrupts: "A second option is for the interrupt service routine to write the APLIC’s source identity number for the interrupt to the domain’s setipnum register just before exiting. This will cause the interrupt’s pending bit to be set to one again if the source is still asserting an interrupt, but not if the source is not asserting an interrupt." Fix setipnum_le/be write emulation for in-kernel APLIC by implementing the above behaviour in aplic_write_pending() function. Cc: stable@vger.kernel.org Fixes: 74967aa208e2 ("RISC-V: KVM: Add in-kernel emulation of AIA APLIC") Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20240321085041.1955293-2-apatel@ventanamicro.com