summaryrefslogtreecommitdiff
path: root/arch/arm64/include/asm/kvm_host.h
AgeCommit message (Collapse)Author
21 hoursMerge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "A quick summary: perf support for Branch Record Buffer Extensions (BRBE), typical PMU hardware updates, small additions to MTE for store-only tag checking and exposing non-address bits to signal handlers, HAVE_LIVEPATCH enabled on arm64, VMAP_STACK forced on. There is also a TLBI optimisation on hardware that does not require break-before-make when changing the user PTEs between contiguous and non-contiguous. More details: Perf and PMU updates: - Add support for new (v3) Hisilicon SLLC and DDRC PMUs - Add support for Arm-NI PMU integrations that share interrupts between clock domains within a given instance - Allow SPE to be configured with a lower sample period than the minimum recommendation advertised by PMSIDR_EL1.Interval - Add suppport for Arm's "Branch Record Buffer Extension" (BRBE) - Adjust the perf watchdog period according to cpu frequency changes - Minor driver fixes and cleanups Hardware features: - Support for MTE store-only checking (FEAT_MTE_STORE_ONLY) - Support for reporting the non-address bits during a synchronous MTE tag check fault (FEAT_MTE_TAGGED_FAR) - Optimise the TLBI when folding/unfolding contiguous PTEs on hardware with FEAT_BBM (break-before-make) level 2 and no TLB conflict aborts Software features: - Enable HAVE_LIVEPATCH after implementing arch_stack_walk_reliable() and using the text-poke API for late module relocations - Force VMAP_STACK always on and change arm64_efi_rt_init() to use arch_alloc_vmap_stack() in order to avoid KASAN false positives ACPI: - Improve SPCR handling and messaging on systems lacking an SPCR table Debug: - Simplify the debug exception entry path - Drop redundant DBG_MDSCR_* macros Kselftests: - Cleanups and improvements for SME, SVE and FPSIMD tests Miscellaneous: - Optimise loop to reduce redundant operations in contpte_ptep_get() - Remove ISB when resetting POR_EL0 during signal handling - Mark the kernel as tainted on SEA and SError panic - Remove redundant gcs_free() call" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits) arm64/gcs: task_gcs_el0_enable() should use passed task arm64: Kconfig: Keep selects somewhat alphabetically ordered arm64: signal: Remove ISB when resetting POR_EL0 kselftest/arm64: Handle attempts to disable SM on SME only systems kselftest/arm64: Fix SVE write data generation for SME only systems kselftest/arm64: Test SME on SME only systems in fp-ptrace kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace kselftest/arm64: Allow sve-ptrace to run on SME only systems arm64/mm: Drop redundant addr increment in set_huge_pte_at() kselftest/arm4: Provide local defines for AT_HWCAP3 arm64: Mark kernel as tainted on SAE and SError panic arm64/gcs: Don't call gcs_free() when releasing task_struct drivers/perf: hisi: Support PMUs with no interrupt drivers/perf: hisi: Relax the event number check of v2 PMUs drivers/perf: hisi: Add support for HiSilicon SLLC v3 PMU driver drivers/perf: hisi: Use ACPI driver_data to retrieve SLLC PMU information drivers/perf: hisi: Add support for HiSilicon DDRC v3 PMU driver drivers/perf: hisi: Simplify the probe process for each DDRC version perf/arm-ni: Support sharing IRQs within an NI instance perf/arm-ni: Consolidate CPU affinity handling ...
2025-07-08KVM: arm64: nvhe: Disable branch generation in nVHE guestsAnshuman Khandual
While BRBE can record branches within guests, the host recording branches in guests is not supported by perf (though events are). Support for BRBE in guests will supported by providing direct access to BRBE within the guests. That is how x86 LBR works for guests. Therefore, BRBE needs to be disabled on guest entry and restored on exit. For nVHE, this requires explicit handling for guests. Before entering a guest, save the BRBE state and disable the it. When returning to the host, restore the state. For VHE, it is not necessary. We initialize BRBCR_EL1.{E1BRE,E0BRE}=={0,0} at boot time, and HCR_EL2.TGE==1 while running in the host. We configure BRBCR_EL2.{E2BRE,E0HBRE} to enable branch recording in the host. When entering the guest, we set HCR_EL2.TGE==0 which means BRBCR_EL1 is used instead of BRBCR_EL2. Consequently for VHE, BRBE recording is disabled at EL1 and EL0 when running a guest. Should recording in guests (by the host) ever be desired, the perf ABI will need to be extended to distinguish guest addresses (struct perf_branch_entry.priv) for starters. BRBE records would also need to be invalidated on guest entry/exit as guest/host EL1 and EL0 records can't be distinguished. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Co-developed-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Tested-by: James Clark <james.clark@linaro.org> Reviewed-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20250611-arm-brbe-v19-v23-3-e7775563036e@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2025-07-03KVM: arm64: Remove kvm_arch_vcpu_run_map_fp()Mark Rutland
Historically KVM hyp code saved the host's FPSIMD state into the hosts's fpsimd_state memory, and so it was necessary to map this into the hyp Stage-1 mappings before running a vCPU. This is no longer necessary as of commits: * fbc7e61195e2 ("KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state") * 8eca7f6d5100 ("KVM: arm64: Remove host FPSIMD saving for non-protected KVM") Since those commits, we eagerly save the host's FPSIMD state before calling into hyp to run a vCPU, and hyp code never reads nor writes the host's fpsimd_state memory. There's no longer any need to map the host's fpsimd_state memory into the hyp Stage-1, and kvm_arch_vcpu_run_map_fp() is unnecessary but benign. Remove kvm_arch_vcpu_run_map_fp(). Currently there is no code to perform a corresponding unmap, and we never mapped the host's SVE or SME state into the hyp Stage-1, so no other code needs to be removed. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Cc: kvmarm@lists.linux.dev Reviewed-by: Mark Brown <broonie@kernel.org> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20250619134817.4075340-1-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-19KVM: arm64: VHE: Centralize ISBs when returning to hostMark Rutland
The VHE hyp code has recently gained a few ISBs. Simplify this to one unconditional ISB in __kvm_vcpu_run_vhe(), and remove the unnecessary ISB from the kvm_call_hyp_ret() macro. While kvm_call_hyp_ret() is also used to invoke __vgic_v3_get_gic_config(), but no ISB is necessary in that case either. For the moment, an ISB is left in kvm_call_hyp(), as there are many more users, and removing the ISB would require a more thorough audit. Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250617133718.4014181-8-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-11Merge tag 'kvmarm-fixes-6.16-2' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.16, take #2 - Rework of system register accessors for system registers that are directly writen to memory, so that sanitisation of the in-memory value happens at the correct time (after the read, or before the write). For convenience, RMW-style accessors are also provided. - Multiple fixes for the so-called "arch-timer-edge-cases' selftest, which was always broken.
2025-06-05KVM: arm64: Make __vcpu_sys_reg() a pure rvalue operandMarc Zyngier
Now that we don't have any use of __vcpu_sys_reg() as a lvalue, remove the in-place update, and directly return the sanitised value. Reviewed-by: Miguel Luis <miguel.luis@oracle.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250603070824.1192795-5-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-05KVM: arm64: Add RMW specific sysreg accessorMarc Zyngier
In a number of cases, we perform a Read-Modify-Write operation on a system register, meaning that we would apply the RESx masks twice. Instead, provide a new accessor that performs this RMW operation, allowing the masks to be applied exactly once per operation. Reviewed-by: Miguel Luis <miguel.luis@oracle.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250603070824.1192795-3-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-05KVM: arm64: Add assignment-specific sysreg accessorMarc Zyngier
Assigning a value to a system register doesn't do what it is supposed to be doing if that register is one that has RESx bits. The main problem is that we use __vcpu_sys_reg(), which can be used both as a lvalue and rvalue. When used as a lvalue, the bit masking occurs *before* the new value is assigned, meaning that we (1) do pointless work on the old cvalue, and (2) potentially assign an invalid value as we fail to apply the masks to it. Fix this by providing a new __vcpu_assign_sys_reg() that does what it says on the tin, and sanitises the *new* value instead of the old one. This comes with a significant amount of churn. Reviewed-by: Miguel Luis <miguel.luis@oracle.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250603070824.1192795-2-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-27KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUsMaxim Levitsky
Use kvm_trylock_all_vcpus instead of a custom implementation when locking all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs. This fixes the following false lockdep warning: [ 328.171264] BUG: MAX_LOCK_DEPTH too low! [ 328.175227] turning off the locking correctness validator. [ 328.180726] Please attach the output of /proc/lock_stat to the bug report [ 328.187531] depth: 48 max: 48! [ 328.190678] 48 locks held by qemu-kvm/11664: [ 328.194957] #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0 [ 328.204048] #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.212521] #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.220991] #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.229463] #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.237934] #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.246405] #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-ID: <20250512180407.659015-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-23Merge branch kvm-arm64/nv-nv into kvmarm-master/nextMarc Zyngier
* kvm-arm64/nv-nv: : . : Flick the switch on the NV support by adding the missing piece : in the form of the VNCR page management. From the cover letter: : : "This is probably the most interesting bit of the whole NV adventure. : So far, everything else has been a walk in the park, but this one is : where the real fun takes place. : : With FEAT_NV2, most of the NV support revolves around tricking a guest : into accessing memory while it tries to access system registers. The : hypervisor's job is to handle the context switch of the actual : registers with the state in memory as needed." : . KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating KVM: arm64: Document NV caps and vcpu flags KVM: arm64: Allow userspace to request KVM_ARM_VCPU_EL2* KVM: arm64: nv: Remove dead code from ERET handling KVM: arm64: nv: Plumb TLBI S1E2 into system instruction dispatch KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2 KVM: arm64: nv: Program host's VNCR_EL2 to the fixmap address KVM: arm64: nv: Handle VNCR_EL2 invalidation from MMU notifiers KVM: arm64: nv: Handle mapping of VNCR_EL2 at EL2 KVM: arm64: nv: Handle VNCR_EL2-triggered faults KVM: arm64: nv: Add userspace and guest handling of VNCR_EL2 KVM: arm64: nv: Add pseudo-TLB backing VNCR_EL2 KVM: arm64: nv: Don't adjust PSTATE.M when L2 is nesting KVM: arm64: nv: Move TLBI range decoding to a helper KVM: arm64: nv: Snapshot S1 ASID tagging information during walk KVM: arm64: nv: Extract translation helper from the AT code KVM: arm64: nv: Allocate VNCR page when required arm64: sysreg: Add layout for VNCR_EL2 Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23Merge branch kvm-arm64/fgt-masks into kvmarm-master/nextMarc Zyngier
* kvm-arm64/fgt-masks: (43 commits) : . : Large rework of the way KVM deals with trap bits in conjunction with : the CPU feature registers. It now draws a direct link between which : the feature set, the system registers that need to UNDEF to match : the configuration and bits that need to behave as RES0 or RES1 in : the trap registers that are visible to the guest. : : Best of all, these definitions are mostly automatically generated : from the JSON description published by ARM under a permissive : license. : . KVM: arm64: Handle TSB CSYNC traps KVM: arm64: Add FGT descriptors for FEAT_FGT2 KVM: arm64: Allow sysreg ranges for FGT descriptors KVM: arm64: Add context-switch for FEAT_FGT2 registers KVM: arm64: Add trap routing for FEAT_FGT2 registers KVM: arm64: Add sanitisation for FEAT_FGT2 registers KVM: arm64: Add FEAT_FGT2 registers to the VNCR page KVM: arm64: Use HCR_EL2 feature map to drive fixed-value bits KVM: arm64: Use HCRX_EL2 feature map to drive fixed-value bits KVM: arm64: Allow kvm_has_feat() to take variable arguments KVM: arm64: Use FGT feature maps to drive RES0 bits KVM: arm64: Validate FGT register descriptions against RES0 masks KVM: arm64: Switch to table-driven FGU configuration KVM: arm64: Handle PSB CSYNC traps KVM: arm64: Use KVM-specific HCRX_EL2 RES0 mask KVM: arm64: Remove hand-crafted masks for FGT registers KVM: arm64: Use computed FGT masks to setup FGT registers KVM: arm64: Propagate FGT masks to the nVHE hypervisor KVM: arm64: Unconditionally configure fine-grain traps KVM: arm64: Use computed masks as sanitisers for FGT registers ... Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23Merge branch kvm-arm64/pkvm-np-thp-6.16 into kvmarm-master/nextMarc Zyngier
* kvm-arm64/pkvm-np-thp-6.16: (21 commits) : . : Large mapping support for non-protected pKVM guests, courtesy of : Vincent Donnefort. From the cover letter: : : "This series adds support for stage-2 huge mappings (PMD_SIZE) to pKVM : np-guests, that is installing PMD-level mappings in the stage-2, : whenever the stage-1 is backed by either Hugetlbfs or THPs." : . KVM: arm64: np-guest CMOs with PMD_SIZE fixmap KVM: arm64: Stage-2 huge mappings for np-guests KVM: arm64: Add a range to pkvm_mappings KVM: arm64: Convert pkvm_mappings to interval tree KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest() KVM: arm64: Add a range to __pkvm_host_wrprotect_guest() KVM: arm64: Add a range to __pkvm_host_unshare_guest() KVM: arm64: Add a range to __pkvm_host_share_guest() KVM: arm64: Introduce for_each_hyp_page KVM: arm64: Handle huge mappings for np-guest CMOs KVM: arm64: Extend pKVM selftest for np-guests KVM: arm64: Selftest for pKVM transitions KVM: arm64: Don't WARN from __pkvm_host_share_guest() KVM: arm64: Add .hyp.data section KVM: arm64: Unconditionally cross check hyp state KVM: arm64: Defer EL2 stage-1 mapping on share KVM: arm64: Move hyp state to hyp_vmemmap KVM: arm64: Introduce {get,set}_host_state() helpers KVM: arm64: Use 0b11 for encoding PKVM_NOPAGE KVM: arm64: Fix pKVM page-tracking comments ... Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19KVM: arm64: Add sanitisation for FEAT_FGT2 registersMarc Zyngier
Just like the FEAT_FGT registers, treat the FGT2 variant the same way. THis is a large update, but a fairly mechanical one. The config dependencies are extracted from the 2025-03 JSON drop. Reviewed-by: Joey Gouly <joey.gouly@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19KVM: arm64: Add FEAT_FGT2 registers to the VNCR pageMarc Zyngier
The FEAT_FGT2 registers are part of the VNCR page. Describe the corresponding offsets and add them to the vcpu sysreg enumeration. Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19KVM: arm64: Allow kvm_has_feat() to take variable argumentsMarc Zyngier
In order to be able to write more compact (and easier to read) code, let kvm_has_feat() and co take variable arguments. This enables constructs such as: #define FEAT_SME ID_AA64PFR1_EL1, SME, IMP if (kvm_has_feat(kvm, FEAT_SME)) [...] which is admitedly more readable. Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19KVM: arm64: Use FGT feature maps to drive RES0 bitsMarc Zyngier
Another benefit of mapping bits to features is that it becomes trivial to define which bits should be handled as RES0. Let's apply this principle to the guest's view of the FGT registers. Reviewed-by: Joey Gouly <joey.gouly@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19KVM: arm64: Allow userspace to request KVM_ARM_VCPU_EL2*Marc Zyngier
Since we're (almost) feature complete, let's allow userspace to request KVM_ARM_VCPU_EL2* by bumping KVM_VCPU_MAX_FEATURES up. We also now advertise the features to userspace with new capabilities. It's going to be great... Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> Link: https://lore.kernel.org/r/20250514103501.2225951-17-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2Marc Zyngier
A TLBI by VA for S1 must take effect on our pseudo-TLB for VNCR and potentially knock the fixmap mapping. Even worse, that TLBI must be able to work cross-vcpu. For that, we track on a per-VM basis if any VNCR is mapped, using an atomic counter. Whenever a TLBI S1E2 occurs and that this counter is non-zero, we take the long road all the way back to the core code. There, we iterate over all vcpus and check whether this particular invalidation has any damaging effect. If it does, we nuke the pseudo TLB and the corresponding fixmap. Yes, this is costly. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250514103501.2225951-14-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19KVM: arm64: nv: Handle mapping of VNCR_EL2 at EL2Marc Zyngier
Now that we can handle faults triggered through VNCR_EL2, we need to map the corresponding page at EL2. But where, you'll ask? Since each CPU in the system can run a vcpu, we need a per-CPU mapping. For that, we carve a NR_CPUS range in the fixmap, giving us a per-CPU va at which to map the guest's VNCR's page. The mapping occurs both on vcpu load and on the back of a fault, both generating a request that will take care of the mapping. That mapping will also get dropped on vcpu put. Yes, this is a bit heavy handed, but it is simple. Eventually, we may want to have a per-VM, per-CPU mapping, which would avoid all the TLBI overhead. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250514103501.2225951-11-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19KVM: arm64: nv: Handle VNCR_EL2-triggered faultsMarc Zyngier
As VNCR_EL2.BADDR contains a VA, it is bound to trigger faults. These faults can have multiple source: - We haven't mapped anything on the host: we need to compute the resulting translation, populate a TLB, and eventually map the corresponding page - The permissions are out of whack: we need to tell the guest about this state of affairs Note that the kernel doesn't support S1POE for itself yet, so the particular case of a VNCR page mapped with no permissions or with write-only permissions is not correctly handled yet. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250514103501.2225951-10-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19KVM: arm64: nv: Add userspace and guest handling of VNCR_EL2Marc Zyngier
Plug VNCR_EL2 in the vcpu_sysreg enum, define its RES0/RES1 bits, and make it accessible to userspace when the VM is configured to support FEAT_NV2. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250514103501.2225951-9-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19KVM: arm64: nv: Add pseudo-TLB backing VNCR_EL2Marc Zyngier
FEAT_NV2 introduces an interesting problem for NV, as VNCR_EL2.BADDR is a virtual address in the EL2&0 (or EL2, but we thankfully ignore this) translation regime. As we need to replicate such mapping in the real EL2, it means that we need to remember that there is such a translation, and that any TLBI affecting EL2 can possibly affect this translation. It also means that any invalidation driven by an MMU notifier must be able to shoot down any such mapping. All in all, we need a data structure that represents this mapping, and that is extremely close to a TLB. Given that we can only use one of those per vcpu at any given time, we only allocate one. No effort is made to keep that structure small. If we need to start caching multiple of them, we may want to revisit that design point. But for now, it is kept simple so that we can reason about it. Oh, and add a braindump of how things are supposed to work, because I will definitely page this out at some point. Yes, pun intended. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250514103501.2225951-8-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19KVM: arm64: nv: Don't adjust PSTATE.M when L2 is nestingMarc Zyngier
We currently check for HCR_EL2.NV being set to decide whether we need to repaint PSTATE.M to say EL2 instead of EL1 on exit. However, this isn't correct when L2 is itself a hypervisor, and that L1 as set its own HCR_EL2.NV. That's because we "flatten" the state and inherit parts of the guest's own setup. In that case, we shouldn't adjust PSTATE.M, as this is really EL1 for both us and the guest. Instead of trying to try and work out how we ended-up with HCR_EL2.NV being set by introspecting both the host and guest states, use a per-CPU flag to remember the context (HYP or not), and use that information to decide whether PSTATE needs tweaking. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250514103501.2225951-7-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-10KVM: arm64: Validate FGT register descriptions against RES0 masksMarc Zyngier
In order to point out to the unsuspecting KVM hacker that they are missing something somewhere, validate that the known FGT bits do not intersect with the corresponding RES0 mask, as computed at boot time. THis check is also performed at boot time, ensuring that there is no runtime overhead. Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-10KVM: arm64: Switch to table-driven FGU configurationMarc Zyngier
Defining the FGU behaviour is extremely tedious. It relies on matching each set of bits from FGT registers with am architectural feature, and adding them to the FGU list if the corresponding feature isn't advertised to the guest. It is however relatively easy to dump most of that information from the architecture JSON description, and use that to control the FGU bits. Let's introduce a new set of tables descripbing the mapping between FGT bits and features. Most of the time, this is only a lookup in an idreg field, with a few more complex exceptions. While this is obviously many more lines in a new file, this is mostly generated, and is pretty easy to maintain. Reviewed-by: Joey Gouly <joey.gouly@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06KVM: arm64: Propagate FGT masks to the nVHE hypervisorMarc Zyngier
The nVHE hypervisor needs to have access to its own view of the FGT masks, which unfortunately results in a bit of data duplication. Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06KVM: arm64: Compute FGT masks from KVM's own FGT tablesMarc Zyngier
In the process of decoupling KVM's view of the FGT bits from the wider architectural state, use KVM's own FGT tables to build a synthetic view of what is actually known. This allows for some checking along the way. Reviewed-by: Joey Gouly <joey.gouly@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06arm64: sysreg: Replace HFGxTR_EL2 with HFG{R,W}TR_EL2Marc Zyngier
Treating HFGRTR_EL2 and HFGWTR_EL2 identically was a mistake. It makes things hard to reason about, has the potential to introduce bugs by giving a meaning to bits that are really reserved, and is in general a bad description of the architecture. Given that #defines are cheap, let's describe both registers as intended by the architecture, and repaint all the existing uses. Yes, this is painful. The registers themselves are generated from the JSON file in an automated way. Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-29Merge branch kvm-arm64/nv-pmu-fixes into kvmarm-master/nextMarc Zyngier
* kvm-arm64/nv-pmu-fixes: : . : Fixes for NV PMU emulation. From the cover letter: : : "Joey reports that some of his PMU tests do not behave quite as : expected: : : - MDCR_EL2.HPMN is set to 0 out of reset : : - PMCR_EL0.P should reset all the counters when written from EL2 : : Oliver points out that setting PMCR_EL0.N from userspace by writing to : the register is silly with NV, and that we need a new PMU attribute : instead. : : On top of that, I figured out that we had a number of little gotchas: : : - It is possible for a guest to write an HPMN value that is out of : bound, and it seems valuable to limit it : : - PMCR_EL0.N should be the maximum number of counters when read from : EL2, and MDCR_EL2.HPMN when read from EL0/EL1 : : - Prevent userspace from updating PMCR_EL0.N when EL2 is available" : . KVM: arm64: Let kvm_vcpu_read_pmcr() return an EL-dependent value for PMCR_EL0.N KVM: arm64: Handle out-of-bound write to MDCR_EL2.HPMN KVM: arm64: Don't let userspace write to PMCR_EL0.N when the vcpu has EL2 KVM: arm64: Allow userspace to limit the number of PMU counters for EL2 VMs KVM: arm64: Contextualise the handling of PMCR_EL0.P writes KVM: arm64: Fix MDCR_EL2.HPMN reset value KVM: arm64: Repaint pmcr_n into nr_pmu_counters Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-28KVM: arm64: Track SVE state in the hypervisor vcpu structureFuad Tabba
When dealing with a guest with SVE enabled, make sure the host SVE state is pinned at EL2 S1, and that the hypervisor vCPU state is correctly initialised (and then unpinned on teardown). Co-authored-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20250416152648.2982950-2-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-24KVM: arm64, x86: make kvm_arch_has_irq_bypass() inlinePaolo Bonzini
kvm_arch_has_irq_bypass() is a small function and even though it does not appear in any *really* hot paths, it's also not entirely rare. Make it inline---it also works out nicely in preparation for using it in kvm-intel.ko and kvm-amd.ko, since the function is not currently exported. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-11KVM: arm64: Repaint pmcr_n into nr_pmu_countersMarc Zyngier
The pmcr_n field obviously refers to PMCR_EL0.N, but is generally used as the number of counters seen by the guest. Rename it accordingly. Suggested-by: Oliver upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-03-20Merge branch 'kvm-nvmx-and-vm-teardown' into HEADPaolo Bonzini
The immediate issue being fixed here is a nVMX bug where KVM fails to detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI). However, checking for a pending interrupt accesses the legacy PIC, and x86's kvm_arch_destroy_vm() currently frees the PIC before destroying vCPUs, i.e. checking for IRQs during the forced nested VM-Exit results in a NULL pointer deref; that's a prerequisite for the nVMX fix. The remaining patches attempt to bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. E.g. KVM currently unloads each vCPU's MMUs in a separate operation from destroying vCPUs, all because when guest SMP support was added, KVM had a kludgy MMU teardown flow that broke when a VM had more than one 1 vCPU. And that oddity lived on, for 18 years... Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-19Merge branch 'kvm-arm64/pkvm-6.15' into kvmarm/nextOliver Upton
* kvm-arm64/pkvm-6.15: : pKVM updates for 6.15 : : - SecPageTable stats for stage-2 table pages allocated by the protected : hypervisor (Vincent Donnefort) : : - HCRX_EL2 trap + vCPU initialization fixes for pKVM (Fuad Tabba) KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu KVM: arm64: Factor out pKVM hyp vcpu creation to separate function KVM: arm64: Initialize HCRX_EL2 traps in pKVM KVM: arm64: Factor out setting HCRX_EL2 traps into separate function KVM: arm64: Count pKVM stage-2 usage in secondary pagetable stats KVM: arm64: Distinct pKVM teardown memcache for stage-2 KVM: arm64: Add flags to kvm_hyp_memcache Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19Merge branch 'kvm-arm64/writable-midr' into kvmarm/nextOliver Upton
* kvm-arm64/writable-midr: : Writable implementation ID registers, courtesy of Sebastian Ott : : Introduce a new capability that allows userspace to set the : ID registers that identify a CPU implementation: MIDR_EL1, REVIDR_EL1, : and AIDR_EL1. Also plug a hole in KVM's trap configuration where : SMIDR_EL1 was readable at EL1, despite the fact that KVM does not : support SME. KVM: arm64: Fix documentation for KVM_CAP_ARM_WRITABLE_IMP_ID_REGS KVM: arm64: Copy MIDR_EL1 into hyp VM when it is writable KVM: arm64: Copy guest CTR_EL0 into hyp VM KVM: selftests: arm64: Test writes to MIDR,REVIDR,AIDR KVM: arm64: Allow userspace to change the implementation ID registers KVM: arm64: Load VPIDR_EL2 with the VM's MIDR_EL1 value KVM: arm64: Maintain per-VM copy of implementation ID regs KVM: arm64: Set HCR_EL2.TID1 unconditionally Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19Merge branch 'kvm-arm64/pv-cpuid' into kvmarm/nextOliver Upton
* kvm-arm64/pv-cpuid: : Paravirtualized implementation ID, courtesy of Shameer Kolothum : : Big-little has historically been a pain in the ass to virtualize. The : implementation ID (MIDR, REVIDR, AIDR) of a vCPU can change at the whim : of vCPU scheduling. This can be particularly annoying when the guest : needs to know the underlying implementation to mitigate errata. : : "Hyperscalers" face a similar scheduling problem, where VMs may freely : migrate between hosts in a pool of heterogenous hardware. And yes, our : server-class friends are equally riddled with errata too. : : In absence of an architected solution to this wart on the ecosystem, : introduce support for paravirtualizing the implementation exposed : to a VM, allowing the VMM to describe the pool of implementations that a : VM may be exposed to due to scheduling/migration. : : Userspace is expected to intercept and handle these hypercalls using the : SMCCC filter UAPI, should it choose to do so. smccc: kvm_guest: Fix kernel builds for 32 bit arm KVM: selftests: Add test for KVM_REG_ARM_VENDOR_HYP_BMAP_2 smccc/kvm_guest: Enable errata based on implementation CPUs arm64: Make  _midr_in_range_list() an exported function KVM: arm64: Introduce KVM_REG_ARM_VENDOR_HYP_BMAP_2 KVM: arm64: Specify hypercall ABI for retrieving target implementations arm64: Modify _midr_range() functions to read MIDR/REVIDR internally Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19Merge branch 'kvm-arm64/nv-vgic' into kvmarm/nextOliver Upton
* kvm-arm64/nv-vgic: : NV VGICv3 support, courtesy of Marc Zyngier : : Support for emulating the GIC hypervisor controls and managing shadow : VGICv3 state for the L1 hypervisor. As part of it, bring in support for : taking IRQs to the L1 and UAPI to manage the VGIC maintenance interrupt. KVM: arm64: nv: Fail KVM init if asking for NV without GICv3 KVM: arm64: nv: Allow userland to set VGIC maintenance IRQ KVM: arm64: nv: Fold GICv3 host trapping requirements into guest setup KVM: arm64: nv: Propagate used_lrs between L1 and L0 contexts KVM: arm64: nv: Request vPE doorbell upon nested ERET to L2 KVM: arm64: nv: Respect virtual HCR_EL2.TWx setting KVM: arm64: nv: Add Maintenance Interrupt emulation KVM: arm64: nv: Handle L2->L1 transition on interrupt injection KVM: arm64: nv: Nested GICv3 emulation KVM: arm64: nv: Sanitise ICH_HCR_EL2 accesses KVM: arm64: nv: Plumb handling of GICv3 EL2 accesses KVM: arm64: nv: Add ICH_*_EL2 registers to vpcu_sysreg KVM: arm64: nv: Load timer before the GIC arm64: sysreg: Add layout for ICH_MISR_EL2 arm64: sysreg: Add layout for ICH_VTR_EL2 arm64: sysreg: Add layout for ICH_HCR_EL2 Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpuFuad Tabba
Instead of creating and initializing _all_ hyp vcpus in pKVM when the first host vcpu runs for the first time, initialize _each_ hyp vcpu in conjunction with its corresponding host vcpu. Some of the host vcpu state (e.g., system registers and traps values) is not initialized until the first time the host vcpu is run. Therefore, initializing a hyp vcpu before its corresponding host vcpu has run for the first time might not view the complete host state of these vcpus. Additionally, this behavior is inline with non-protected modes. Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20250314111832.4137161-5-tabba@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14KVM: arm64: Count pKVM stage-2 usage in secondary pagetable statsVincent Donnefort
Count the pages used by pKVM for the guest stage-2 in memory stats under secondary pagetable, similarly to what the VHE mode does. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250313114038.1502357-4-vdonnefort@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14KVM: arm64: Distinct pKVM teardown memcache for stage-2Vincent Donnefort
In order to account for memory dedicated to the stage-2 page-tables, use a separated memcache when tearing down the VM. Meanwhile rename reclaim_guest_pages to reflect the fact it only reclaim page-table pages. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250313114038.1502357-3-vdonnefort@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14KVM: arm64: Add flags to kvm_hyp_memcacheVincent Donnefort
Add flags to kvm_hyp_memcache and propagate the latter to the allocation and free callbacks. This will later allow to account for memory, based on the memcache configuration. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250313114038.1502357-2-vdonnefort@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03KVM: arm64: nv: Request vPE doorbell upon nested ERET to L2Oliver Upton
Running an L2 guest with GICv4 enabled goes absolutely nowhere, and gets into a vicious cycle of nested ERET followed by nested exception entry into the L1. When KVM does a put on a runnable vCPU, it marks the vPE as nonresident but does not request a doorbell IRQ. Behind the scenes in the ITS driver's view of the vCPU, its_vpe::pending_last gets set to true to indicate that context is still runnable. This comes to a head when doing the nested ERET into L2. The vPE doesn't get scheduled on the redistributor as it is exclusively part of the L1's VGIC context. kvm_vgic_vcpu_pending_irq() returns true because the vPE appears runnable, and KVM does a nested exception entry into the L1 before L2 ever gets off the ground. This issue can be papered over by requesting a doorbell IRQ when descheduling a vPE as part of a nested ERET. KVM needs this anyway to kick the vCPU out of the L2 when an IRQ becomes pending for the L1. Link: https://lore.kernel.org/r/20240823212703.3576061-4-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250225172930.1850838-13-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03KVM: arm64: nv: Handle L2->L1 transition on interrupt injectionMarc Zyngier
An interrupt being delivered to L1 while running L2 must result in the correct exception being delivered to L1. This means that if, on entry to L2, we found ourselves with pending interrupts in the L1 distributor, we need to take immediate action. This is done by posting a request which will prevent the entry in L2, and deliver an IRQ exception to L1, forcing the switch. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250225172930.1850838-10-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03KVM: arm64: nv: Add ICH_*_EL2 registers to vpcu_sysregMarc Zyngier
FEAT_NV2 comes with a bunch of register-to-memory redirection involving the ICH_*_EL2 registers (LRs, APRs, VMCR, HCR). Adds them to the vcpu_sysreg enumeration. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250225172930.1850838-6-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26KVM: arm64: Introduce KVM_REG_ARM_VENDOR_HYP_BMAP_2Shameer Kolothum
The vendor_hyp_bmap bitmap holds the information about the Vendor Hyp services available to the user space and can be get/set using {G, S}ET_ONE_REG interfaces. This is done using the pseudo-firmware bitmap register KVM_REG_ARM_VENDOR_HYP_BMAP. At present, this bitmap is a 64 bit one and since the function numbers for newly added DISCOVER_IPML_* hypercalls are 64-65, introduce another pseudo-firmware bitmap register KVM_REG_ARM_VENDOR_HYP_BMAP_2. Reviewed-by: Sebastian Ott <sebott@redhat.com> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20250221140229.12588-4-shameerali.kolothum.thodi@huawei.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26KVM: Drop kvm_arch_sync_events() now that all implementations are nopsSean Christopherson
Remove kvm_arch_sync_events() now that x86 no longer uses it (no other arch has ever used it). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Bibo Mao <maobibo@loongson.cn> Message-ID: <20250224235542.2562848-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26KVM: arm64: Allow userspace to change the implementation ID registersSebastian Ott
KVM's treatment of the ID registers that describe the implementation (MIDR, REVIDR, and AIDR) is interesting, to say the least. On the userspace-facing end of it, KVM presents the values of the boot CPU on all vCPUs and treats them as invariant. On the guest side of things KVM presents the hardware values of the local CPU, which can change during CPU migration in a big-little system. While one may call this fragile, there is at least some degree of predictability around it. For example, if a VMM wanted to present big-little to a guest, it could affine vCPUs accordingly to the correct clusters. All of this makes a giant mess out of adding support for making these implementation ID registers writable. Avoid breaking the rather subtle ABI around the old way of doing things by requiring opt-in from userspace to make the registers writable. When the cap is enabled, allow userspace to set MIDR, REVIDR, and AIDR to any non-reserved value and present those values consistently across all vCPUs. Signed-off-by: Sebastian Ott <sebott@redhat.com> [oliver: changelog, capability] Link: https://lore.kernel.org/r/20250225005401.679536-5-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26KVM: arm64: Maintain per-VM copy of implementation ID regsSebastian Ott
Get ready to allow changes to the implementation ID registers by tracking the VM-wide values. Signed-off-by: Sebastian Ott <sebott@redhat.com> Link: https://lore.kernel.org/r/20250225005401.679536-3-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-20KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2Oliver Upton
Vladimir reports that a race condition to attach a VMID to a stage-2 MMU sometimes results in a vCPU entering the guest with a VMID of 0: | CPU1 | CPU2 | | | | kvm_arch_vcpu_ioctl_run | | vcpu_load <= load VTTBR_EL2 | | kvm_vmid->id = 0 | | | kvm_arch_vcpu_ioctl_run | | vcpu_load <= load VTTBR_EL2 | | with kvm_vmid->id = 0| | kvm_arm_vmid_update <= allocates fresh | | kvm_vmid->id and | | reload VTTBR_EL2 | | | | | kvm_arm_vmid_update <= observes that kvm_vmid->id | | already allocated, | | skips reload VTTBR_EL2 Oh yeah, it's as bad as it looks. Remember that VHE loads the stage-2 MMU eagerly but a VMID only gets attached to the MMU later on in the KVM_RUN loop. Even in the "best case" where VTTBR_EL2 correctly gets reprogrammed before entering the EL1&0 regime, there is a period of time where hardware is configured with VMID 0. That's completely insane. So, rather than decorating the 'late' binding with another hack, just allocate the damn thing up front. Attaching a VMID from vcpu_load() is still rollover safe since (surprise!) it'll always get called after a vCPU was preempted. Excuse me while I go find a brown paper bag. Cc: stable@vger.kernel.org Fixes: 934bf871f011 ("KVM: arm64: Load the stage-2 MMU context in kvm_vcpu_load_vhe()") Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250219220737.130842-1-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13KVM: arm64: Fix alignment of kvm_hyp_memcache allocationsQuentin Perret
When allocating guest stage-2 page-table pages at EL2, pKVM can consume pages from the host-provided kvm_hyp_memcache. As pgtable.c expects zeroed pages, guest_s2_zalloc_page() actively implements this zeroing with a PAGE_SIZE memset. Unfortunately, we don't check the page alignment of the host-provided address before doing so, which could lead to the memset overrunning the page if the host was malicious. Fix this by simply force-aligning all kvm_hyp_memcache allocations to page boundaries. Fixes: 60dfe093ec13 ("KVM: arm64: Instantiate guest stage-2 page-tables at EL2") Reported-by: Ben Simner <ben.simner@cl.cam.ac.uk> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20250213153615.3642515-1-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>