summaryrefslogtreecommitdiff
path: root/arch/arm64/kvm
AgeCommit message (Collapse)Author
2023-06-01KVM: arm64: Handle FFA_MEM_LEND calls from the hostWill Deacon
Handle FFA_MEM_LEND calls from the host by treating them identically to FFA_MEM_SHARE calls for the purposes of the host stage-2 page-table, but forwarding on the original request to EL3. Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230523101828.7328-9-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01KVM: arm64: Handle FFA_MEM_RECLAIM calls from the hostWill Deacon
Intecept FFA_MEM_RECLAIM calls from the host and transition the host stage-2 page-table entries from the SHARED_OWNED state back to the OWNED state once EL3 has confirmed that the secure mapping has been reclaimed. Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230523101828.7328-8-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01KVM: arm64: Handle FFA_MEM_SHARE calls from the hostWill Deacon
Intercept FFA_MEM_SHARE/FFA_FN64_MEM_SHARE calls from the host and transition the host stage-2 page-table entries from the OWNED state to the SHARED_OWNED state prior to forwarding the call onto EL3. Co-developed-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230523101828.7328-7-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01KVM: arm64: Add FF-A helpers to share/unshare memory with secure worldWill Deacon
Extend pKVM's memory protection code so that we can update the host's stage-2 page-table to track pages shared with secure world by the host using FF-A and prevent those pages from being mapped into a guest. Co-developed-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230523101828.7328-6-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01KVM: arm64: Handle FFA_RXTX_MAP and FFA_RXTX_UNMAP calls from the hostWill Deacon
Handle FFA_RXTX_MAP and FFA_RXTX_UNMAP calls from the host by sharing the host's mailbox memory with the hypervisor and establishing a separate pair of mailboxes between the hypervisor and the SPMD at EL3. Co-developed-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230523101828.7328-5-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01KVM: arm64: Allocate pages for hypervisor FF-A mailboxesWill Deacon
The FF-A proxy code needs to allocate its own buffer pair for communication with EL3 and for forwarding calls from the host at EL1. Reserve a couple of pages for this purpose and use them to initialise the hypervisor's FF-A buffer structure. Co-developed-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230523101828.7328-4-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01KVM: arm64: Probe FF-A version and host/hyp partition ID during initWill Deacon
Probe FF-A during pKVM initialisation so that we can detect any inconsistencies in the version or partition ID early on. Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230523101828.7328-3-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-01KVM: arm64: Block unsafe FF-A calls from the hostWill Deacon
When KVM is initialised in protected mode, we must take care to filter certain FFA calls from the host kernel so that the integrity of guest and hypervisor memory is maintained and is not made available to the secure world. As a first step, intercept and block all memory-related FF-A SMC calls from the host to EL3 and don't advertise any FF-A features. This puts the framework in place for handling them properly. Co-developed-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Andrew Walbran <qwandor@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230523101828.7328-2-will@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-31KVM: arm64: Document default vPMU behavior on heterogeneous systemsOliver Upton
KVM maintains a mask of supported CPUs when a vPMU type is explicitly selected by userspace and is used to reject any attempt to run the vCPU on an unsupported CPU. This is great, but we're still beholden to the default behavior where vCPUs can be scheduled anywhere and guest counters may silently stop working. Avoid confusing the next poor sod to look at this code and document the intended behavior. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230525212723.3361524-3-oliver.upton@linux.dev
2023-05-31KVM: arm64: Iterate arm_pmus list to probe for default PMUOliver Upton
To date KVM has relied on using a perf event to probe the core PMU at the time of vPMU initialization. Behind the scenes perf_event_init() would iteratively walk the PMUs of the system and return the PMU that could handle the event. However, an upcoming change in perf core will drop the iterative walk, thereby breaking the fragile dance we do on the KVM side. Avoid the problem altogether by iterating over the list of supported PMUs maintained in KVM, returning the core PMU that matches the CPU we were called on. Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230525212723.3361524-2-oliver.upton@linux.dev
2023-05-31KVM: arm64: Drop last page ref in kvm_pgtable_stage2_free_removed()Oliver Upton
The reference count on page table allocations is increased for every 'counted' PTE (valid or donated) in the table in addition to the initial reference from ->zalloc_page(). kvm_pgtable_stage2_free_removed() fails to drop the last reference on the root of the table walk, meaning we leak memory. Fix it by dropping the last reference after the free walker returns, at which point all references for 'counted' PTEs have been released. Cc: stable@vger.kernel.org Fixes: 5c359cca1faf ("KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make") Reported-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Tested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230530193213.1663411-1-oliver.upton@linux.dev
2023-05-30KVM: arm64: Use BTI for nvheMostafa Saleh
CONFIG_ARM64_BTI_KERNEL compiles the kernel to support ARMv8.5-BTI. However, the nvhe code doesn't make use of it as it doesn't map any pages with Guarded Page(GP) bit. kvm pgtable code is modified to map executable pages with GP bit if BTI is enabled for the kernel. At hyp init, SCTLR_EL2.BT is set to 1 to match EL1 configuration (SCTLR_EL1.BT1) set in bti_enable(). One difference between kernel and nvhe code, is that the kernel maps .text with GP while nvhe maps all the executable pages, this makes nvhe code need to deal with special initialization code coming from other executable sections (.idmap.text). For this we need to add bti instruction at the beginning of __kvm_handle_stub_hvc as it can be called by __host_hvc through branch instruction(br) and unlike SYM_FUNC_START, SYM_CODE_START doesn’t add bti instruction at the beginning, and it can’t be modified to add it as it is used with vector tables. Another solution which is more intrusive is to convert __kvm_handle_stub_hvc to a function and inject “bti jc” instead of “bti c” in SYM_FUNC_START Signed-off-by: Mostafa Saleh <smostafa@google.com> Link: https://lore.kernel.org/r/20230530150845.2856828-1-smostafa@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-30KVM: arm64: Populate fault info for watchpointAkihiko Odaki
When handling ESR_ELx_EC_WATCHPT_LOW, far_el2 member of struct kvm_vcpu_fault_info will be copied to far member of struct kvm_debug_exit_arch and exposed to the userspace. The userspace will see stale values from older faults if the fault info does not get populated. Fixes: 8fb2046180a0 ("KVM: arm64: Move early handlers to per-EC handlers") Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230530024651.10014-1-akihiko.odaki@daynix.com Cc: stable@vger.kernel.org
2023-05-24KVM: arm64: Reload PTE after invoking walker callback on preorder traversalFuad Tabba
The preorder callback on the kvm_pgtable_stage2_map() path can replace a table with a block, then recursively free the detached table. The higher-level walking logic stashes the old page table entry and then walks the freed table, invoking the leaf callback and potentially freeing pgtable pages prematurely. In normal operation, the call to tear down the detached stage-2 is indirected and uses an RCU callback to trigger the freeing. RCU is not available to pKVM, which is where this bug is triggered. Change the behavior of the walker to reload the page table entry after invoking the walker callback on preorder traversal, as it does for leaf entries. Tested on Pixel 6. Fixes: 5c359cca1faf ("KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make") Suggested-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230522103258.402272-1-tabba@google.com
2023-05-24KVM: arm64: Handle trap of tagged Set/Way CMOsMarc Zyngier
We appear to have missed the Set/Way CMOs when adding MTE support. Not that we really expect anyone to use them, but you never know what stupidity some people can come up with... Treat these mostly like we deal with the classic S/W CMOs, only with an additional check that MTE really is enabled. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20230515204601.1270428-3-maz@kernel.org
2023-05-19KVM: arm64: Prevent unconditional donation of unmapped regions from the hostWill Deacon
Since host stage-2 mappings are created lazily, we cannot rely solely on the pte in order to recover the target physical address when checking a host-initiated memory transition as this permits donation of unmapped regions corresponding to MMIO or "no-map" memory. Instead of inspecting the pte, move the addr_is_allowed_memory() check into the host callback function where it is passed the physical address directly from the walker. Cc: Quentin Perret <qperret@google.com> Fixes: e82edcc75c4e ("KVM: arm64: Implement do_share() helper for sharing memory") Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230518095844.1178-1-will@kernel.org
2023-05-19KVM: arm64: vgic: Fix a commentJean-Philippe Brucker
It is host userspace, not the guest, that issues KVM_DEV_ARM_VGIC_GRP_CTRL Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230518100914.2837292-5-jean-philippe@linaro.org
2023-05-19KVM: arm64: vgic: Fix locking commentJean-Philippe Brucker
It is now config_lock that must be held, not kvm lock. Replace the comment with a lockdep annotation. Fixes: f00327731131 ("KVM: arm64: Use config_lock to protect vgic state") Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230518100914.2837292-4-jean-philippe@linaro.org
2023-05-19KVM: arm64: vgic: Wrap vgic_its_create() with config_lockJean-Philippe Brucker
vgic_its_create() changes the vgic state without holding the config_lock, which triggers a lockdep warning in vgic_v4_init(): [ 358.667941] WARNING: CPU: 3 PID: 178 at arch/arm64/kvm/vgic/vgic-v4.c:245 vgic_v4_init+0x15c/0x7a8 ... [ 358.707410] vgic_v4_init+0x15c/0x7a8 [ 358.708550] vgic_its_create+0x37c/0x4a4 [ 358.709640] kvm_vm_ioctl+0x1518/0x2d80 [ 358.710688] __arm64_sys_ioctl+0x7ac/0x1ba8 [ 358.711960] invoke_syscall.constprop.0+0x70/0x1e0 [ 358.713245] do_el0_svc+0xe4/0x2d4 [ 358.714289] el0_svc+0x44/0x8c [ 358.715329] el0t_64_sync_handler+0xf4/0x120 [ 358.716615] el0t_64_sync+0x190/0x194 Wrap the whole of vgic_its_create() with config_lock since, in addition to calling vgic_v4_init(), it also modifies the global kvm->arch.vgic state. Fixes: f00327731131 ("KVM: arm64: Use config_lock to protect vgic state") Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230518100914.2837292-3-jean-philippe@linaro.org
2023-05-19KVM: arm64: vgic: Fix a circular locking issueJean-Philippe Brucker
Lockdep reports a circular lock dependency between the srcu and the config_lock: [ 262.179917] -> #1 (&kvm->srcu){.+.+}-{0:0}: [ 262.182010] __synchronize_srcu+0xb0/0x224 [ 262.183422] synchronize_srcu_expedited+0x24/0x34 [ 262.184554] kvm_io_bus_register_dev+0x324/0x50c [ 262.185650] vgic_register_redist_iodev+0x254/0x398 [ 262.186740] vgic_v3_set_redist_base+0x3b0/0x724 [ 262.188087] kvm_vgic_addr+0x364/0x600 [ 262.189189] vgic_set_common_attr+0x90/0x544 [ 262.190278] vgic_v3_set_attr+0x74/0x9c [ 262.191432] kvm_device_ioctl+0x2a0/0x4e4 [ 262.192515] __arm64_sys_ioctl+0x7ac/0x1ba8 [ 262.193612] invoke_syscall.constprop.0+0x70/0x1e0 [ 262.195006] do_el0_svc+0xe4/0x2d4 [ 262.195929] el0_svc+0x44/0x8c [ 262.196917] el0t_64_sync_handler+0xf4/0x120 [ 262.198238] el0t_64_sync+0x190/0x194 [ 262.199224] [ 262.199224] -> #0 (&kvm->arch.config_lock){+.+.}-{3:3}: [ 262.201094] __lock_acquire+0x2b70/0x626c [ 262.202245] lock_acquire+0x454/0x778 [ 262.203132] __mutex_lock+0x190/0x8b4 [ 262.204023] mutex_lock_nested+0x24/0x30 [ 262.205100] vgic_mmio_write_v3_misc+0x5c/0x2a0 [ 262.206178] dispatch_mmio_write+0xd8/0x258 [ 262.207498] __kvm_io_bus_write+0x1e0/0x350 [ 262.208582] kvm_io_bus_write+0xe0/0x1cc [ 262.209653] io_mem_abort+0x2ac/0x6d8 [ 262.210569] kvm_handle_guest_abort+0x9b8/0x1f88 [ 262.211937] handle_exit+0xc4/0x39c [ 262.212971] kvm_arch_vcpu_ioctl_run+0x90c/0x1c04 [ 262.214154] kvm_vcpu_ioctl+0x450/0x12f8 [ 262.215233] __arm64_sys_ioctl+0x7ac/0x1ba8 [ 262.216402] invoke_syscall.constprop.0+0x70/0x1e0 [ 262.217774] do_el0_svc+0xe4/0x2d4 [ 262.218758] el0_svc+0x44/0x8c [ 262.219941] el0t_64_sync_handler+0xf4/0x120 [ 262.221110] el0t_64_sync+0x190/0x194 Note that the current report, which can be triggered by the vgic_irq kselftest, is a triple chain that includes slots_lock, but after inverting the slots_lock/config_lock dependency, the actual problem reported above remains. In several places, the vgic code calls kvm_io_bus_register_dev(), which synchronizes the srcu, while holding config_lock (#1). And the MMIO handler takes the config_lock while holding the srcu read lock (#0). Break dependency #1, by registering the distributor and redistributors without holding config_lock. The ITS also uses kvm_io_bus_register_dev() but already relies on slots_lock to serialize calls. The distributor iodev is created on the first KVM_RUN call. Multiple threads will race for vgic initialization, and only the first one will see !vgic_ready() under the lock. To serialize those threads, rely on slots_lock rather than config_lock. Redistributors are created earlier, through KVM_DEV_ARM_VGIC_GRP_ADDR ioctls and vCPU creation. Similarly, serialize the iodev creation with slots_lock, and the rest with config_lock. Fixes: f00327731131 ("KVM: arm64: Use config_lock to protect vgic state") Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230518100914.2837292-2-jean-philippe@linaro.org
2023-05-18arm64: kvm: avoid overflow in integer divisionArnd Bergmann
The newly added kvm_mmu_split_nr_page_tables() function uses DIV_ROUND_DOWN_ULL() to divide 64-bit addresses, but this requires a 32-bit divisior, and PUD_SIZE may exceed that when 64KB pages are used: arch/arm64/kvm/mmu.c: In function 'kvm_mmu_split_nr_page_tables': include/linux/math.h:42:64: error: conversion from 'long unsigned int' to 'u32' {aka 'unsigned int'} changes value from '68719476736' to '0' [-Werror=overflow] 42 | DIV_ROUND_DOWN_ULL((unsigned long long)(ll) + (d) - 1, (d)) | ^~~ include/linux/math.h:39:47: note: in definition of macro 'DIV_ROUND_DOWN_ULL' 39 | #define DIV_ROUND_DOWN_ULL(ll, d) div_u64(ll, d) | ^ arch/arm64/kvm/mmu.c:95:22: note: in expansion of macro 'DIV_ROUND_UP_ULL' 95 | n += DIV_ROUND_UP_ULL(range, PUD_SIZE); | ^~~~~~~~~~~~~~~~ Since this code is only used on 64-bit targets, DIV_ROUND_UP() can deal with this more easily, as it already takes 64-bit arguments. Fixes: e7bf7a490c68 ("KVM: arm64: Split huge pages when dirty logging is enabled") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20230517202352.793673-1-arnd@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16KVM: arm64: Use local TLBI on permission relaxationMarc Zyngier
Broadcast TLB invalidations (TLBIs) targeting the Inner Shareable Domain are usually less performant than their non-shareable variant. In particular, we observed some implementations that take millliseconds to complete parallel broadcasted TLBIs. It's safe to use non-shareable TLBIs when relaxing permissions on a PTE in the KVM case. According to the ARM ARM (0487I.a) section D8.13.1 "Using break-before-make when updating translation table entries", permission relaxation does not need break-before-make. Specifically, R_WHZWS states that these are the only changes that require a break-before-make sequence: changes of memory type (Shareability or Cacheability), address changes, or changing the block size. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-13-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOGRicardo Koller
This is the arm64 counterpart of commit cb00a70bd4b7 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG"), which has the benefit of splitting the cost of splitting a memslot across multiple ioctls. Split huge pages on the range specified using KVM_CLEAR_DIRTY_LOG. And do not split when enabling dirty logging if KVM_DIRTY_LOG_INITIALLY_SET is set. Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-12-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()Ricardo Koller
Move the functionality of kvm_mmu_write_protect_pt_masked() into its caller, kvm_arch_mmu_enable_log_dirty_pt_masked(). This will be used in a subsequent commit in order to share some of the code in kvm_arch_mmu_enable_log_dirty_pt_masked(). Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-11-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16KVM: arm64: Split huge pages when dirty logging is enabledRicardo Koller
Split huge pages eagerly when enabling dirty logging. The goal is to avoid doing it while faulting on write-protected pages, which negatively impacts guest performance. A memslot marked for dirty logging is split in 1GB pieces at a time. This is in order to release the mmu_lock and give other kernel threads the opportunity to run, and also in order to allocate enough pages to split a 1GB range worth of huge pages (or a single 1GB huge page). Note that these page allocations can fail, so eager page splitting is best-effort. This is not a correctness issue though, as huge pages can still be split on write-faults. Eager page splitting only takes effect when the huge page mapping has been existing in the stage-2 page table. Otherwise, the huge page will be mapped to multiple non-huge pages on page fault. The benefits of eager page splitting are the same as in x86, added with commit a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled"). For example, when running dirty_log_perf_test with 64 virtual CPUs (Ampere Altra), 1GB per vCPU, 50% reads, and 2MB HugeTLB memory, the time it takes vCPUs to access all of their memory after dirty logging is enabled decreased by 44% from 2.58s to 1.42s. Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-10-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16KVM: arm64: Add kvm_uninit_stage2_mmu()Ricardo Koller
Add kvm_uninit_stage2_mmu() and move kvm_free_stage2_pgd() into it. A future commit will add some more things to do inside of kvm_uninit_stage2_mmu(). Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-9-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16KVM: arm64: Refactor kvm_arch_commit_memory_region()Ricardo Koller
Refactor kvm_arch_commit_memory_region() as a preparation for a future commit to look cleaner and more understandable. Also, it looks more like its x86 counterpart (in kvm_mmu_slot_apply_flags()). Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-8-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16KVM: arm64: Add kvm_pgtable_stage2_split()Ricardo Koller
Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a range of huge pages. This will be used for eager-splitting huge pages into PAGE_SIZE pages. The goal is to avoid having to split huge pages on write-protection faults, and instead use this function to do it ahead of time for large ranges (e.g., all guest memory in 1G chunks at a time). Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-7-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZERicardo Koller
Add a capability for userspace to specify the eager split chunk size. The chunk size specifies how many pages to break at a time, using a single allocation. Bigger the chunk size, more pages need to be allocated ahead of time. Suggested-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-6-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16KVM: arm64: Add helper for creating unlinked stage2 subtreesRicardo Koller
Add a stage2 helper, kvm_pgtable_stage2_create_unlinked(), for creating unlinked tables (which is the opposite of kvm_pgtable_stage2_free_unlinked()). Creating an unlinked table is useful for splitting level 1 and 2 entries into subtrees of PAGE_SIZE PTEs. For example, a level 1 entry can be split into PAGE_SIZE PTEs by first creating a fully populated tree, and then use it to replace the level 1 entry in a single step. This will be used in a subsequent commit for eager huge-page splitting (a dirty-logging optimization). Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-4-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIsRicardo Koller
Add two flags to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_SKIP_BBM_TLBI and KVM_PGTABLE_WALK_SKIP_CMO, to indicate that the walk should not perform TLB invalidations (TLBIs) in break-before-make (BBM) nor cache maintenance operations (CMO). This will be used by a future commit to create unlinked tables not accessible to the HW page-table walker. Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-3-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-16KVM: arm64: Rename free_removed to free_unlinkedRicardo Koller
Normalize on referring to tables outside of an active paging structure as 'unlinked'. A subsequent change to KVM will add support for building page tables that are not part of an active paging structure. The existing 'removed_table' terminology is quite clunky when applied in this context. Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-2-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-11Merge branch kvm-arm64/pgtable-fixes-6.4 into kvmarm-master/fixesMarc Zyngier
* kvm-arm64/pgtable-fixes-6.4: : . : Fixes for concurrent S2 mapping race from Oliver: : : "So it appears that there is a race between two parallel stage-2 map : walkers that could lead to mapping the incorrect PA for a given IPA, as : the IPA -> PA relationship picks up an unintended offset. This series : eliminates the problem by using the current IPA of the walk as the : source-of-truth regarding where we are in a map operation." : . KVM: arm64: Constify start/end/phys fields of the pgtable walker data KVM: arm64: Infer PA offset from VA in hyp map walker KVM: arm64: Infer the PA offset from IPA in stage-2 map walker Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-05-11Merge branch kvm-arm64/misc-6.4 into kvmarm-master/fixesMarc Zyngier
* kvm-arm64/misc-6.4: : . : Minor changes for 6.4: : : - Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...) : : - FP/SVE/SME documentation update, in the hope that this field : becomes clearer... : : - Add workaround for the usual Apple SEIS brokenness : : - Random comment fixes : . KVM: arm64: vgic: Add Apple M2 PRO/MAX cpus to the list of broken SEIS implementations KVM: arm64: Clarify host SME state management KVM: arm64: Restructure check for SVE support in FP trap handler KVM: arm64: Document check for TIF_FOREIGN_FPSTATE KVM: arm64: Fix repeated words in comments KVM: arm64: Use the bitmap API to allocate bitmaps KVM: arm64: Slightly optimize flush_context() Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-05-11KVM: arm64: vgic: Add Apple M2 PRO/MAX cpus to the list of broken SEIS ↵Marc Zyngier
implementations Unsurprisingly, the M2 PRO is also affected by the SEIS bug, so add it to the naughty list. And since M2 MAX is likely to be of the same ilk, flag it as well. Tested on a M2 PRO mini machine. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Zenghui Yu <yuzenghui@huawei.com> Link: https://lore.kernel.org/r/20230501182141.39770-1-maz@kernel.org
2023-05-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "s390: - More phys_to_virt conversions - Improvement of AP management for VSIE (nested virtualization) ARM64: - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. x86: - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled, and by giving the guest control of CR0.WP when EPT is enabled on VMX (VMX-only because SVM doesn't support per-bit controls) - Add CR0/CR4 helpers to query single bits, and clean up related code where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return as a bool - Move AMD_PSFD to cpufeatures.h and purge KVM's definition - Avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features - Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - AMD SVM: - Add support for virtual NMIs - Fixes for edge cases related to virtual interrupts - Intel AMX: - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is not being reported due to userspace not opting in via prctl() - Fix a bug in emulation of ENCLS in compatibility mode - Allow emulation of NOP and PAUSE for L2 - AMX selftests improvements - Misc cleanups MIPS: - Constify MIPS's internal callbacks (a leftover from the hardware enabling rework that landed in 6.3) Generic: - Drop unnecessary casts from "void *" throughout kvm_main.c - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct size by 8 bytes on 64-bit kernels by utilizing a padding hole Documentation: - Fix goof introduced by the conversion to rST" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits) KVM: s390: pci: fix virtual-physical confusion on module unload/load KVM: s390: vsie: clarifications on setting the APCB KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init() KVM: selftests: Test the PMU event "Instructions retired" KVM: selftests: Copy full counter values from guest in PMU event filter test KVM: selftests: Use error codes to signal errors in PMU event filter test KVM: selftests: Print detailed info in PMU event filter asserts KVM: selftests: Add helpers for PMC asserts in PMU event filter test KVM: selftests: Add a common helper for the PMU event filter guest code KVM: selftests: Fix spelling mistake "perrmited" -> "permitted" KVM: arm64: vhe: Drop extra isb() on guest exit KVM: arm64: vhe: Synchronise with page table walker on MMU update KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc() KVM: arm64: nvhe: Synchronise with page table walker on TLBI KVM: arm64: Handle 32bit CNTPCTSS traps KVM: arm64: nvhe: Synchronise with page table walker on vcpu run KVM: arm64: vgic: Don't acquire its_lock before config_lock KVM: selftests: Add test to verify KVM's supported XCR0 ...
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-26Merge tag 'kvmarm-6.4' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.4 - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements.
2023-04-25Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "ACPI: - Improve error reporting when failing to manage SDEI on AGDI device removal Assembly routines: - Improve register constraints so that the compiler can make use of the zero register instead of moving an immediate #0 into a GPR - Allow the compiler to allocate the registers used for CAS instructions CPU features and system registers: - Cleanups to the way in which CPU features are identified from the ID register fields - Extend system register definition generation to handle Enum types when defining shared register fields - Generate definitions for new _EL2 registers and add new fields for ID_AA64PFR1_EL1 - Allow SVE to be disabled separately from SME on the kernel command-line Tracing: - Support for "direct calls" in ftrace, which enables BPF tracing for arm64 Kdump: - Don't bother unmapping the crashkernel from the linear mapping, which then allows us to use huge (block) mappings and reduce TLB pressure when a crashkernel is loaded. Memory management: - Try again to remove data cache invalidation from the coherent DMA allocation path - Simplify the fixmap code by mapping at page granularity - Allow the kfence pool to be allocated early, preventing the rest of the linear mapping from being forced to page granularity Perf and PMU: - Move CPU PMU code out to drivers/perf/ where it can be reused by the 32-bit ARM architecture when running on ARMv8 CPUs - Fix race between CPU PMU probing and pKVM host de-privilege - Add support for Apple M2 CPU PMU - Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event dynamically, depending on what the CPU actually supports - Minor fixes and cleanups to system PMU drivers Stack tracing: - Use the XPACLRI instruction to strip PAC from pointers, rather than rolling our own function in C - Remove redundant PAC removal for toolchains that handle this in their builtins - Make backtracing more resilient in the face of instrumentation Miscellaneous: - Fix single-step with KGDB - Remove harmless warning when 'nokaslr' is passed on the kernel command-line - Minor fixes and cleanups across the board" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits) KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege arm64: kexec: include reboot.h arm64: delete dead code in this_cpu_set_vectors() arm64/cpufeature: Use helper macro to specify ID register for capabilites drivers/perf: hisi: add NULL check for name drivers/perf: hisi: Remove redundant initialized of pmu->name arm64/cpufeature: Consistently use symbolic constants for min_field_value arm64/cpufeature: Pull out helper for CPUID register definitions arm64/sysreg: Convert HFGITR_EL2 to automatic generation ACPI: AGDI: Improve error reporting for problems during .remove() arm64: kernel: Fix kernel warning when nokaslr is passed to commandline perf/arm-cmn: Fix port detection for CMN-700 arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step arm64: move PAC masks to <asm/pointer_auth.h> arm64: use XPACLRI to strip PAC arm64: avoid redundant PAC stripping in __builtin_return_address() arm64/sme: Fix some comments of ARM SME arm64/signal: Alloc tpidr2 sigframe after checking system_supports_tpidr2() arm64/signal: Use system_supports_tpidr2() to check TPIDR2 arm64/idreg: Don't disable SME when disabling SVE ...
2023-04-24Merge tag 'rcu.6.4.april5.2023.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux Pull RCU updates from Joel Fernandes: - Updates and additions to MAINTAINERS files, with Boqun being added to the RCU entry and Zqiang being added as an RCU reviewer. I have also transitioned from reviewer to maintainer; however, Paul will be taking over sending RCU pull-requests for the next merge window. - Resolution of hotplug warning in nohz code, achieved by fixing cpu_is_hotpluggable() through interaction with the nohz subsystem. Tick dependency modifications by Zqiang, focusing on fixing usage of the TICK_DEP_BIT_RCU_EXP bitmask. - Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n kernels, fixed by Zqiang. - Improvements to rcu-tasks stall reporting by Neeraj. - Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for increased robustness, affecting several components like mac802154, drbd, vmw_vmci, tracing, and more. A report by Eric Dumazet showed that the API could be unknowingly used in an atomic context, so we'd rather make sure they know what they're asking for by being explicit: https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/ - Documentation updates, including corrections to spelling, clarifications in comments, and improvements to the srcu_size_state comments. - Better srcu_struct cache locality for readers, by adjusting the size of srcu_struct in support of SRCU usage by Christoph Hellwig. - Teach lockdep to detect deadlocks between srcu_read_lock() vs synchronize_srcu() contributed by Boqun. Previously lockdep could not detect such deadlocks, now it can. - Integration of rcutorture and rcu-related tools, targeted for v6.4 from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis module parameter, and more - Miscellaneous changes, various code cleanups and comment improvements * tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits) checkpatch: Error out if deprecated RCU API used mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep() ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep() net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep() net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep() lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep() tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep() misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep() drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep() rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan() rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early rcu: Remove never-set needwake assignment from rcu_report_qs_rdp() rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race rcu/trace: use strscpy() to instead of strncpy() tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem ...
2023-04-21Merge tag 'kvmarm-fixes-6.3-4' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.3, part #4 - Plug a buffer overflow due to the use of the user-provided register width for firmware regs. Outright reject accesses where the user register width does not match the kernel representation. - Protect non-atomic RMW operations on vCPU flags against preemption, as an update to the flags by an intervening preemption could be lost.
2023-04-21KVM: arm64: Clarify host SME state managementMark Brown
Normally when running a guest we do not touch the floating point register state until first use of floating point by the guest, saving the current state and loading the guest state at that point. This has been found to offer a performance benefit in common cases. However currently if SME is active when switching to a guest then we exit streaming mode, disable ZA and invalidate the floating point register state prior to starting the guest. The exit from streaming mode is required for correct guest operation, if we leave streaming mode enabled then many non-SME operations can generate SME traps (eg, SVE operations will become streaming SVE operations). If EL1 leaves CPACR_EL1.SMEN disabled then the host is unable to intercept these traps. This will mean that a SME unaware guest will see SME exceptions which will confuse it. Disabling streaming mode also avoids creating spurious indications of usage of the SME hardware which could impact system performance, especially with shared SME implementations. Document the requirement to exit streaming mode clearly. There is no issue with guest operation caused by PSTATE.ZA so we can defer handling for that until first floating point usage, do so if the register state is not that of the current task and hence has already been saved. We could also do this for the case where the register state is that for the current task however this is very unlikely to happen and would require disproportionate effort so continue to save the state in that case. Saving this state on first use would require that we map and unmap storage for the host version of these registers for use by the hypervisor, taking care to deal with protected KVM and the fact that the host can free or reallocate the backing storage. Given that the strong recommendation is that applications should only keep PSTATE.ZA enabled when the state it enables is in active use it is difficult to see a case where a VMM would wish to do this, it would need to not only be using SME but also running the guest in the middle of SME usage. This can be revisited in the future if a use case does arises, in the interim such tasks will work but experience a performance overhead. This brings our handling of SME more into line with our handling of other floating point state and documents more clearly the constraints we have, especially around streaming mode. Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221214-kvm-arm64-sme-context-switch-v2-3-57ba0082e9ff@kernel.org
2023-04-21KVM: arm64: Restructure check for SVE support in FP trap handlerMark Brown
We share the same handler for general floating point and SVE traps with a check to make sure we don't handle any SVE traps if the system doesn't have SVE support. Since we will be adding SME support and wishing to handle that along with other FP related traps rewrite the check to be more scalable and a bit clearer too, ensuring we don't misidentify SME traps as SVE ones. Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221214-kvm-arm64-sme-context-switch-v2-2-57ba0082e9ff@kernel.org
2023-04-21KVM: arm64: Document check for TIF_FOREIGN_FPSTATEMark Brown
In kvm_arch_vcpu_load_fp() we unconditionally set the current FP state to FP_STATE_HOST_OWNED, this will be overridden to FP_STATE_NONE if TIF_FOREIGN_FPSTATE is set but the check is deferred until kvm_arch_vcpu_ctxflush_fp() where we are no longer preemptable. Add a comment to this effect to help avoid people being concerned about the lack of a check and discover where the check is done. Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221214-kvm-arm64-sme-context-switch-v2-1-57ba0082e9ff@kernel.org
2023-04-21KVM: arm64: Fix repeated words in commentsJingyu Wang
Delete the redundant word 'to'. Signed-off-by: Jingyu Wang <jingyuwang_vip@163.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230309075919.169518-1-jingyuwang_vip@163.com
2023-04-21KVM: arm64: Constify start/end/phys fields of the pgtable walker dataMarc Zyngier
As we are revamping the way the pgtable walker evaluates some of the data, make it clear that we rely on somew of the fields to be constant across the lifetime of a walk. For this, flag the start, end and phys fields of the walk data as 'const', which will generate an error if we were to accidentally update these fields again. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21KVM: arm64: Infer PA offset from VA in hyp map walkerOliver Upton
Similar to the recently fixed stage-2 walker, the hyp map walker increments the PA and VA of a walk separately. Unlike stage-2, there is no bug here as the map walker has exclusive access to the stage-1 page tables. Nonetheless, in the interest of continuity throughout the page table code, tweak the hyp map walker to avoid incrementing the PA and instead use the VA as the authoritative source of how far along a table walk has gotten. Calculate the PA to use for a leaf PTE by adding the offset of the VA from the start of the walk to the starting PA. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230421071606.1603916-3-oliver.upton@linux.dev
2023-04-21KVM: arm64: Infer the PA offset from IPA in stage-2 map walkerOliver Upton
Until now, the page table walker counted increments to the PA and IPA of a walk in two separate places. While the PA is incremented as soon as a leaf PTE is installed in stage2_map_walker_try_leaf(), the IPA is actually bumped in the generic table walker context. Critically, __kvm_pgtable_visit() rereads the PTE after the LEAF callback returns to work out if a table or leaf was installed, and only bumps the IPA for a leaf PTE. This arrangement worked fine when we handled faults behind the write lock, as the walker had exclusive access to the stage-2 page tables. However, commit 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel") started handling all stage-2 faults behind the read lock, opening up a race where a walker could increment the PA but not the IPA of a walk. Nothing good ensues, as the walker starts mapping with the incorrect IPA -> PA relationship. For example, assume that two vCPUs took a data abort on the same IPA. One observes that dirty logging is disabled, and the other observed that it is enabled: vCPU attempting PMD mapping vCPU attempting PTE mapping ====================================== ===================================== /* install PMD */ stage2_make_pte(ctx, leaf); data->phys += granule; /* replace PMD with a table */ stage2_try_break_pte(ctx, data->mmu); stage2_make_pte(ctx, table); /* table is observed */ ctx.old = READ_ONCE(*ptep); table = kvm_pte_table(ctx.old, level); /* * map walk continues w/o incrementing * IPA. */ __kvm_pgtable_walk(..., level + 1); Bring an end to the whole mess by using the IPA as the single source of truth for how far along a walk has gotten. Work out the correct PA to map by calculating the IPA offset from the beginning of the walk and add that to the starting physical address. Cc: stable@vger.kernel.org Fixes: 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel") Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230421071606.1603916-2-oliver.upton@linux.dev
2023-04-21Merge branch kvm-arm64/spec-ptw into kvmarm-master/nextMarc Zyngier
* kvm-arm64/spec-ptw: : . : On taking an exception from EL1&0 to EL2(&0), the page table walker is : allowed to carry on with speculative walks started from EL1&0 while : running at EL2 (see R_LFHQG). Given that the PTW may be actively using : the EL1&0 system registers, the only safe way to deal with it is to : issue a DSB before changing any of it. : : We already did the right thing for SPE and TRBE, but ignored the PTW : for unknown reasons (probably because the architecture wasn't crystal : clear at the time). : : This requires a bit of surgery in the nvhe code, though most of these : patches are comments so that my future self can understand the purpose : of these barriers. The VHE code is largely unaffected, thanks to the : DSB in the context switch. : . KVM: arm64: vhe: Drop extra isb() on guest exit KVM: arm64: vhe: Synchronise with page table walker on MMU update KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc() KVM: arm64: nvhe: Synchronise with page table walker on TLBI KVM: arm64: nvhe: Synchronise with page table walker on vcpu run Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21Merge branch kvm-arm64/smccc-filtering into kvmarm-master/nextMarc Zyngier
* kvm-arm64/smccc-filtering: : . : SMCCC call filtering and forwarding to userspace, courtesy of : Oliver Upton. From the cover letter: : : "The Arm SMCCC is rather prescriptive in regards to the allocation of : SMCCC function ID ranges. Many of the hypercall ranges have an : associated specification from Arm (FF-A, PSCI, SDEI, etc.) with some : room for vendor-specific implementations. : : The ever-expanding SMCCC surface leaves a lot of work within KVM for : providing new features. Furthermore, KVM implements its own : vendor-specific ABI, with little room for other implementations (like : Hyper-V, for example). Rather than cramming it all into the kernel we : should provide a way for userspace to handle hypercalls." : . KVM: selftests: Fix spelling mistake "KVM_HYPERCAL_EXIT_SMC" -> "KVM_HYPERCALL_EXIT_SMC" KVM: arm64: Test that SMC64 arch calls are reserved KVM: arm64: Prevent userspace from handling SMC64 arch range KVM: arm64: Expose SMC/HVC width to userspace KVM: selftests: Add test for SMCCC filter KVM: selftests: Add a helper for SMCCC calls with SMC instruction KVM: arm64: Let errors from SMCCC emulation to reach userspace KVM: arm64: Return NOT_SUPPORTED to guest for unknown PSCI version KVM: arm64: Introduce support for userspace SMCCC filtering KVM: arm64: Add support for KVM_EXIT_HYPERCALL KVM: arm64: Use a maple tree to represent the SMCCC filter KVM: arm64: Refactor hvc filtering to support different actions KVM: arm64: Start handling SMCs from EL1 KVM: arm64: Rename SMC/HVC call handler to reflect reality KVM: arm64: Add vm fd device attribute accessors KVM: arm64: Add a helper to check if a VM has ran once KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALL Signed-off-by: Marc Zyngier <maz@kernel.org>