linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2023-08-31	KVM: x86/mmu: Don't rely on page-track mechanism to flush on memslot change	Sean Christopherson
	Call kvm_mmu_zap_all_fast() directly when flushing a memslot instead of bouncing through the page-track mechanism. KVM (unfortunately) needs to zap and flush all page tables on memslot DELETE/MOVE irrespective of whether KVM is shadowing guest page tables. This will allow changing KVM to register a page-track notifier on the first shadow root allocation, and will also allow deleting the misguided kvm_page_track_flush_slot() hook itself once KVM-GT also moves to a different method for reacting to memslot changes. No functional change intended. Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20221110014821.1548347-2-seanjc@google.com Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Move kvm_arch_flush_shadow_{all,memslot}() to mmu.c	Sean Christopherson
	Move x86's implementation of kvm_arch_flush_shadow_{all,memslot}() into mmu.c, and make kvm_mmu_zap_all() static as it was globally visible only for kvm_arch_flush_shadow_all(). This will allow refactoring kvm_arch_flush_shadow_memslot() to call kvm_mmu_zap_all() directly without having to expose kvm_mmu_zap_all_fast() outside of mmu.c. Keeping everything in mmu.c will also likely simplify supporting TDX, which intends to do zap only relevant SPTEs on memslot updates. No functional change intended. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	drm/i915/gvt: Protect gfn hash table with vgpu_lock	Sean Christopherson
	Use vgpu_lock instead of KVM's mmu_lock to protect accesses to the hash table used to track which gfns are write-protected when shadowing the guest's GTT, and hoist the acquisition of vgpu_lock from intel_vgpu_page_track_handler() out to its sole caller, kvmgt_page_track_write(). This fixes a bug where kvmgt_page_track_write(), which doesn't hold kvm->mmu_lock, could race with intel_gvt_page_track_remove() and trigger a use-after-free. Fixing kvmgt_page_track_write() by taking kvm->mmu_lock is not an option as mmu_lock is a r/w spinlock, and intel_vgpu_page_track_handler() might sleep when acquiring vgpu->cache_lock deep down the callstack: intel_vgpu_page_track_handler() \| \|-> page_track->handler / ppgtt_write_protection_handler() \| \|-> ppgtt_handle_guest_write_page_table_bytes() \| \|-> ppgtt_handle_guest_write_page_table() \| \|-> ppgtt_handle_guest_entry_removal() \| \|-> ppgtt_invalidate_pte() \| \|-> intel_gvt_dma_unmap_guest_page() \| \|-> mutex_lock(&vgpu->cache_lock); Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	drm/i915/gvt: Drop unused helper intel_vgpu_reset_gtt()	Sean Christopherson
	Drop intel_vgpu_reset_gtt() as it no longer has any callers. In addition to eliminating dead code, this eliminates the last possible scenario where __kvmgt_protect_table_find() can be reached without holding vgpu_lock. Requiring vgpu_lock to be held when calling __kvmgt_protect_table_find() will allow a protecting the gfn hash with vgpu_lock without too much fuss. No functional change intended. Fixes: ba25d977571e ("drm/i915/gvt: Do not destroy ppgtt_mm during vGPU D3->D0.") Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	drm/i915/gvt: Use an "unsigned long" to iterate over memslot gfns	Sean Christopherson
	Use an "unsigned long" instead of an "int" when iterating over the gfns in a memslot. The number of pages in the memslot is tracked as an "unsigned long", e.g. KVMGT could theoretically break if a KVM memslot larger than 16TiB were deleted (2^32 * 4KiB). Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	drm/i915/gvt: Don't rely on KVM's gfn_to_pfn() to query possible 2M GTT	Sean Christopherson
	Now that gvt_pin_guest_page() explicitly verifies the pinned PFN is a transparent hugepage page, don't use KVM's gfn_to_pfn() to pre-check if a 2MiB GTT entry is possible and instead just try to map the GFN with a 2MiB entry. Using KVM to query pfn that is ultimately managed through VFIO is odd, and KVM's gfn_to_pfn() is not intended for non-KVM consumption; it's exported only because of KVM vendor modules (x86 and PPC). Open code the check on 2MiB support instead of keeping is_2MB_gtt_possible() around for a single line of code. Move the call to intel_gvt_dma_map_guest_page() for a 4KiB entry into its case statement, i.e. fork the common path into the 4KiB and 2MiB "direct" shadow paths. Keeping the call in the "common" path is arguably more in the spirit of "one change per patch", but retaining the local "page_size" variable is silly, i.e. the call site will be changed either way, and jumping around the no-longer-common code is more subtle and rather odd, i.e. would just need to be immediately cleaned up. Drop the error message from gvt_pin_guest_page() when KVMGT attempts to shadow a 2MiB guest page that isn't backed by a compatible hugepage in the host. Dropping the pre-check on a THP makes it much more likely that the "error" will be encountered in normal operation. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	drm/i915/gvt: Error out on an attempt to shadowing an unknown GTT entry type	Sean Christopherson
	Bail from ppgtt_populate_shadow_entry() if an unexpected GTT entry type is encountered instead of subtly falling through to the common "direct shadow" path. Eliminating the default/error path's reliance on the common handling will allow hoisting intel_gvt_dma_map_guest_page() into the case statements so that the 2MiB case can try intel_gvt_dma_map_guest_page() and fallback to splitting the entry on failure. Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	drm/i915/gvt: Explicitly check that vGPU is attached before shadowing	Sean Christopherson
	Move the check that a vGPU is attached from is_2MB_gtt_possible() all the way up to shadow_ppgtt_mm() to avoid unnecessary work, and to make it more obvious that a future cleanup of is_2MB_gtt_possible() isn't introducing a bug. is_2MB_gtt_possible() has only one caller, ppgtt_populate_shadow_entry(), and all paths in ppgtt_populate_shadow_entry() eventually check for attachment by way of intel_gvt_dma_map_guest_page(). And of the paths that lead to ppgtt_populate_shadow_entry(), shadow_ppgtt_mm() is the only one that doesn't already check for INTEL_VGPU_STATUS_ACTIVE or INTEL_VGPU_STATUS_ATTACHED. workload_thread() <= pick_next_workload() => INTEL_VGPU_STATUS_ACTIVE \| -> dispatch_workload() \| \|-> prepare_workload() \| -> intel_vgpu_sync_oos_pages() \| \| \| \|-> ppgtt_set_guest_page_sync() \| \| \| \|-> sync_oos_page() \| \| \| \|-> ppgtt_populate_shadow_entry() \| \|-> intel_vgpu_flush_post_shadow() \| 1: \|-> ppgtt_handle_guest_write_page_table() \| \|-> ppgtt_handle_guest_entry_add() \| 2: \| -> ppgtt_populate_spt_by_guest_entry() \| \| \| \|-> ppgtt_populate_spt() \| \| \| \|-> ppgtt_populate_shadow_entry() \| \| \| \|-> ppgtt_populate_spt_by_guest_entry() [see 2] \| \|-> ppgtt_populate_shadow_entry() kvmgt_page_track_write() <= KVM callback => INTEL_VGPU_STATUS_ATTACHED \| \|-> intel_vgpu_page_track_handler() \| \|-> ppgtt_write_protection_handler() \| \|-> ppgtt_handle_guest_write_page_table_bytes() \| \|-> ppgtt_handle_guest_write_page_table() [see 1] Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	drm/i915/gvt: Put the page reference obtained by KVM's gfn_to_pfn()	Sean Christopherson
	Put the struct page reference acquired by gfn_to_pfn(), KVM's API is that the caller is ultimately responsible for dropping any reference. Note, kvm_release_pfn_clean() ensures the pfn is actually a refcounted struct page before trying to put any references. Fixes: b901b252b6cf ("drm/i915/gvt: Add 2M huge gtt support") Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	drm/i915/gvt: Don't try to unpin an empty page range	Yan Zhao
	Attempt to unpin pages in the error path of gvt_pin_guest_page() if and only if at least one page was successfully pinned. Unpinning doesn't cause functional problems, but vfio_device_container_unpin_pages() rightfully warns about being asked to unpin zero pages. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> [sean: write changelog] Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	drm/i915/gvt: Verify hugepages are contiguous in physical address space	Sean Christopherson
	When shadowing a GTT entry with a 2M page, verify that the pfns are contiguous, not just that the struct page pointers are contiguous. The memory map is virtual contiguous if "CONFIG_FLATMEM=y \|\| CONFIG_SPARSEMEM_VMEMMAP=y", but not for "CONFIG_SPARSEMEM=y && CONFIG_SPARSEMEM_VMEMMAP=n", so theoretically KVMGT could encounter struct pages that are virtually contiguous, but not physically contiguous. In practice, this flaw is likely a non-issue as it would cause functional problems iff a section isn't 2M aligned _and_ is directly adjacent to another section with discontiguous pfns. Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	drm/i915/gvt: remove interface intel_gvt_is_valid_gfn	Yan Zhao
	Currently intel_gvt_is_valid_gfn() is called in two places: (1) shadowing guest GGTT entry (2) shadowing guest PPGTT leaf entry, which was introduced in commit cc753fbe1ac4 ("drm/i915/gvt: validate gfn before set shadow page entry"). However, now it's not necessary to call this interface any more, because a. GGTT partial write issue has been fixed by commit bc0686ff5fad ("drm/i915/gvt: support inconsecutive partial gtt entry write") commit 510fe10b6180 ("drm/i915/gvt: fix a bug of partially write ggtt enties") b. PPGTT resides in normal guest RAM and we only treat 8-byte writes as valid page table writes. Any invalid GPA found is regarded as an error, either due to guest misbehavior/attack or bug in host shadow code. So,rather than do GFN pre-checking and replace invalid GFNs with scratch GFN and continue silently, just remove the pre-checking and abort PPGTT shadowing on error detected. c. GFN validity check is still performed in intel_gvt_dma_map_guest_page() --> gvt_pin_guest_page(). It's more desirable to call VFIO interface to do both validity check and mapping. Calling intel_gvt_is_valid_gfn() to do GFN validity check from KVM side while later mapping the GFN through VFIO interface is unnecessarily fragile and confusing for unaware readers. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> [sean: remove now-unused local variables] Acked-by: Zhi Wang <zhi.a.wang@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	drm/i915/gvt: Verify pfn is "valid" before dereferencing "struct page"	Sean Christopherson
	Check that the pfn found by gfn_to_pfn() is actually backed by "struct page" memory prior to retrieving and dereferencing the page. KVM supports backing guest memory with VM_PFNMAP, VM_IO, etc., and so there is no guarantee the pfn returned by gfn_to_pfn() has an associated "struct page". Fixes: b901b252b6cf ("drm/i915/gvt: Add 2M huge gtt support") Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: BUG() in rmap helpers iff CONFIG_BUG_ON_DATA_CORRUPTION=y	Sean Christopherson
	Introduce KVM_BUG_ON_DATA_CORRUPTION() and use it in the low-level rmap helpers to convert the existing BUG()s to WARN_ON_ONCE() when the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() on corruption of host kernel data structures. Environments that don't have infrastructure to automatically capture crash dumps, i.e. aren't likely to enable CONFIG_BUG_ON_DATA_CORRUPTION=y, are typically better served overall by WARN-and-continue behavior (for the kernel, the VM is dead regardless), as a BUG() while holding mmu_lock all but guarantees the _best_ case scenario is a panic(). Make the BUG()s conditional instead of removing/replacing them entirely as there's a non-zero chance (though by no means a guarantee) that the damage isn't contained to the target VM, e.g. if no rmap is found for a SPTE then KVM may be double-zapping the SPTE, i.e. has already freed the memory the SPTE pointed at and thus KVM is reading/writing memory that KVM no longer owns. Link: https://lore.kernel.org/all/20221129191237.31447-1-mizhang@google.com Suggested-by: Mingwei Zhang <mizhang@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20230729004722.1056172-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Plumb "struct kvm" all the way to pte_list_remove()	Mingwei Zhang
	Plumb "struct kvm" all the way to pte_list_remove() to allow the usage of KVM_BUG() and/or KVM_BUG_ON(). This will allow killing only the offending VM instead of doing BUG() if the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() if KVM's data structures (rmaps) appear to be corrupted. Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: tweak changelog] Link: https://lore.kernel.org/r/20230729004722.1056172-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Use BUILD_BUG_ON_INVALID() for KVM_MMU_WARN_ON() stub	Sean Christopherson
	Use BUILD_BUG_ON_INVALID() instead of an empty do-while loop to stub out KVM_MMU_WARN_ON() when CONFIG_KVM_PROVE_MMU=n, that way _some_ build issues with the usage of KVM_MMU_WARN_ON() will be dected even if the kernel is using the stubs, e.g. basic syntax errors will be detected. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Replace MMU_DEBUG with proper KVM_PROVE_MMU Kconfig	Sean Christopherson
	Replace MMU_DEBUG, which requires manually modifying KVM to enable the macro, with a proper Kconfig, KVM_PROVE_MMU. Now that pgprintk() and rmap_printk() are gone, i.e. the macro guards only KVM_MMU_WARN_ON() and won't flood the kernel logs, enabling the option for debug kernels is both desirable and feasible. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Bug the VM if a vCPU ends up in long mode without PAE enabled	Sean Christopherson
	Promote the ASSERT(), which is quite dead code in KVM, into a KVM_BUG_ON() for KVM's sanity check that CR4.PAE=1 if the vCPU is in long mode when performing a walk of guest page tables. The sanity is quite cheap since neither EFER nor CR4.PAE requires a VMREAD, especially relative to the cost of walking the guest page tables. More importantly, the sanity check would have prevented the true badness fixed by commit 112e66017bff ("KVM: nVMX: add missing consistency checks for CR0 and CR4"). The missed consistency check resulted in some versions of KVM corrupting the on-stack guest_walker structure due to KVM thinking there are 4/5 levels of page tables, but wiring up the MMU hooks to point at the paging32 implementation, which only allocates space for two levels of page tables in "struct guest_walker32". Queue a page fault for injection if the assertion fails, as both callers, FNAME(gva_to_gpa) and FNAME(walk_addr_generic), assume that walker.fault contains sane info on a walk failure. E.g. not populating the fault info could result in KVM consuming and/or exposing uninitialized stack data before the vCPU is kicked out to userspace, which doesn't happen until KVM checks for KVM_REQ_VM_DEAD on the next enter. Move the check below the initialization of "pte_access" so that the aforementioned to-be-injected page fault doesn't consume uninitialized stack data. The information _shouldn't_ reach the guest or userspace, but there's zero downside to being paranoid in this case. Link: https://lore.kernel.org/r/20230729004722.1056172-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()	Sean Christopherson
	Convert all "runtime" assertions, i.e. assertions that can be triggered while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the MMU that is tied to running vCPUs, i.e. not contained to loading and initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if KVM ends up with a bug that causes a root to be invalidated before the page fault handler is invoked, pretty much _every_ page fault VM-Exit triggers the WARN. If a WARN is triggered frequently, the resulting spam usually causes a lot of damage of its own, e.g. consumes resources to log the WARN and pollutes the kernel log, often to the point where other useful information can be lost. In many case, the damage caused by the spam is actually worse than the bug itself, e.g. KVM can almost always recover from an unexpectedly invalid root. On the flip side, warning every time is rarely helpful for debug and triage, i.e. a single splat is usually sufficient to point a debugger in the right direction, and automated testing, e.g. syzkaller, typically runs with warn_on_panic=1, i.e. will never get past the first WARN anyways. Lastly, when an assertions fails multiple times, the stack traces in KVM are almost always identical, i.e. the full splat only needs to be captured once. And _if_ there is value in captruing information about the failed assert, a ratelimited printk() is sufficient and less likely to rack up a large amount of collateral damage. Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Rename MMU_WARN_ON() to KVM_MMU_WARN_ON()	Sean Christopherson
	Rename MMU_WARN_ON() to make it super obvious that the assertions are all about KVM's MMU, not the primary MMU. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Cleanup sanity check of SPTEs at SP free	Sean Christopherson
	Massage the error message for the sanity check on SPTEs when freeing a shadow page to be more verbose, and to print out all shadow-present SPTEs, not just the first SPTE encountered. Printing all SPTEs can be quite valuable for debug, e.g. highlights whether the leak is a one-off or widepsread, or possibly the result of memory corruption (something else in the kernel stomping on KVM's SPTEs). Opportunistically move the MMU_WARN_ON() into the helper itself, which will allow a future cleanup to use BUILD_BUG_ON_INVALID() as the stub for MMU_WARN_ON(). BUILD_BUG_ON_INVALID() works as intended and results in the compiler complaining about is_empty_shadow_page() not being declared. Link: https://lore.kernel.org/r/20230729004722.1056172-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Avoid pointer arithmetic when iterating over SPTEs	Sean Christopherson
	Replace the pointer arithmetic used to iterate over SPTEs in is_empty_shadow_page() with more standard interger-based iteration. No functional change intended. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Delete the "dbg" module param	Sean Christopherson
	Delete KVM's "dbg" module param now that its usage in KVM is gone (it used to guard pgprintk() and rmap_printk()). Link: https://lore.kernel.org/r/20230729004722.1056172-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Delete rmap_printk() and all its usage	Sean Christopherson
	Delete rmap_printk() so that MMU_WARN_ON() and MMU_DEBUG can be morphed into something that can be regularly enabled for debug kernels. The information provided by rmap_printk() isn't all that useful now that the rmap and unsync code is mature, as the prints are simultaneously too verbose (_lots_ of message) and yet not verbose enough to be helpful for debug (most instances print just the SPTE pointer/value, which is rarely sufficient to root cause anything but trivial bugs). Alternatively, rmap_printk() could be reworked to into tracepoints, but it's not clear there is a real need as rmap bugs rarely escape initial development, and when bugs do escape to production, they are often edge cases and/or reside in code that isn't directly related to the rmaps. In other words, the problems with rmap_printk() being unhelpful also apply to tracepoints. And deleting rmap_printk() doesn't preclude adding tracepoints in the future. Link: https://lore.kernel.org/r/20230729004722.1056172-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Delete pgprintk() and all its usage	Sean Christopherson
	Delete KVM's pgprintk() and all its usage, as the code is very prone to bitrot due to being buried behind MMU_DEBUG, and the functionality has been rendered almost entirely obsolete by the tracepoints KVM has gained over the years. And for the situations where the information provided by KVM's tracepoints is insufficient, pgprintk() rarely fills in the gaps, and is almost always far too noisy, i.e. developers end up implementing custom prints anyways. Link: https://lore.kernel.org/r/20230729004722.1056172-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Guard against collision with KVM-defined PFERR_IMPLICIT_ACCESS	Sean Christopherson
	Add an assertion in kvm_mmu_page_fault() to ensure the error code provided by hardware doesn't conflict with KVM's software-defined IMPLICIT_ACCESS flag. In the unlikely scenario that future hardware starts using bit 48 for a hardware-defined flag, preserving the bit could result in KVM incorrectly interpreting the unknown flag as KVM's IMPLICIT_ACCESS flag. WARN so that any such conflict can be surfaced to KVM developers and resolved, but otherwise ignore the bit as KVM can't possibly rely on a flag it knows nothing about. Fixes: 4f4aa80e3b88 ("KVM: X86: Handle implicit supervisor access with SMAP") Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230721223711.2334426-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	KVM: x86/mmu: Move the lockdep_assert of mmu_lock to inside ↵	Like Xu
	clear_dirty_pt_masked() Move the lockdep_assert_held_write(&kvm->mmu_lock) from the only one caller kvm_tdp_mmu_clear_dirty_pt_masked() to inside clear_dirty_pt_masked(). This change makes it more obvious why it's safe for clear_dirty_pt_masked() to use the non-atomic (for non-volatile SPTEs) tdp_mmu_clear_spte_bits() helper. for_each_tdp_mmu_root() does its own lockdep, so the only "loss" in lockdep coverage is if the list is completely empty. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230627042639.12636-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31	Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM x86 changes for 6.6: - Misc cleanups - Retry APIC optimized recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR can diverge from the default iff TSC scaling is enabled, and clean up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID
2023-08-31	Merge tag 'kvm-x86-svm-6.6' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM: x86: SVM changes for 6.6: - Add support for SEV-ES DebugSwap, i.e. allow SEV-ES guests to use debug registers and generate/handle #DBs - Clean up LBR virtualization code - Fix a bug where KVM fails to set the target pCPU during an IRTE update - Fix fatal bugs in SEV-ES intrahost migration - Fix a bug where the recent (architecturally correct) change to reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to skip it)
2023-08-31	Merge tag 'kvm-x86-vmx-6.6' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM: x86: VMX changes for 6.6: - Misc cleanups - Fix a bug where KVM reads a stale vmcs.IDT_VECTORING_INFO_FIELD when trying to handle NMI VM-Exits
2023-08-31	Merge tag 'kvm-x86-pmu-6.6' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM x86 PMU changes for 6.6: - Clean up KVM's handling of Intel architectural events
2023-08-31	Merge tag 'kvm-riscv-6.6-1' of https://github.com/kvm-riscv/linux into HEAD	Paolo Bonzini
	KVM/riscv changes for 6.6 - Zba, Zbs, Zicntr, Zicsr, Zifencei, and Zihpm support for Guest/VM - Added ONE_REG interface for SATP mode - Added ONE_REG interface to enable/disable multiple ISA extensions - Improved error codes returned by ONE_REG interfaces - Added KVM_GET_REG_LIST ioctl() implementation for KVM RISC-V - Added get-reg-list selftest for KVM RISC-V
2023-08-31	Merge tag 'kvm-s390-next-6.6-1' of ↵	Paolo Bonzini
	https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD - PV crypto passthrough enablement (Tony, Steffen, Viktor, Janosch) Allows a PV guest to use crypto cards. Card access is governed by the firmware and once a crypto queue is "bound" to a PV VM every other entity (PV or not) looses access until it is not bound anymore. Enablement is done via flags when creating the PV VM. - Guest debug fixes (Ilya)
2023-08-31	Merge tag 'kvm-x86-selftests-6.6' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM: x86: Selftests changes for 6.6: - Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs - Add support for printf() in guest code and covert all guest asserts to use printf-based reporting - Clean up the PMU event filter test and add new testcases - Include x86 selftests in the KVM x86 MAINTAINERS entry
2023-08-31	Merge tag 'kvm-x86-generic-6.6' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	Common KVM changes for 6.6: - Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass action specific data without needing to constantly update the main handlers. - Drop unused function declarations
2023-08-31	Merge tag 'kvmarm-6.6' of ↵	Paolo Bonzini
	git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 6.6 - Add support for TLB range invalidation of Stage-2 page tables, avoiding unnecessary invalidations. Systems that do not implement range invalidation still rely on a full invalidation when dealing with large ranges. - Add infrastructure for forwarding traps taken from a L2 guest to the L1 guest, with L0 acting as the dispatcher, another baby step towards the full nested support. - Simplify the way we deal with the (long deprecated) 'CPU target', resulting in a much needed cleanup. - Fix another set of PMU bugs, both on the guest and host sides, as we seem to never have any shortage of those... - Relax the alignment requirements of EL2 VA allocations for non-stack allocations, as we were otherwise wasting a lot of that precious VA space. - The usual set of non-functional cleanups, although I note the lack of spelling fixes...
2023-08-31	nls: Hide new NLS_UCS2_UTILS	Dr. David Alan Gilbert
	NLS_UCS2_UTILS is an option selected by filesystems that need it, don't expose it to users. Fixes: 089f7f591348 ("fs/smb: Swing unicode common code from smb->NLS") Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-31	selftests/bpf: Fix d_path test	Jiri Olsa
	Recent commit [1] broke d_path test, because now filp_close is not called directly from sys_close, but eventually later when the file is finally released. As suggested by Hou Tao we don't need to re-hook the bpf program, but just instead we can use sys_close_range to trigger filp_close synchronously. [1] 021a160abf62 ("fs: use __fput_sync in close(2)") Suggested-by: Hou Tao <houtao@huaweicloud.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230831141103.359810-1-jolsa@kernel.org
2023-08-31	smb3: allow controlling length of time directory entries are cached with dir ↵	Steve French
	leases Currently with directory leases we cache directory contents for a fixed period of time (default 30 seconds) but for many workloads this is too short. Allow configuring the maximum amount of time directory entries are cached when a directory lease is held on that directory. Add module load parm "max_dir_cache" For example to set the timeout to 10 minutes you would do: echo 600 > /sys/module/cifs/parameters/dir_cache_timeout or to disable caching directory contents: echo 0 > /sys/module/cifs/parameters/dir_cache_timeout Reviewed-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-08-31	block: don't add or resize partition on the disk with GENHD_FL_NO_PART	Li Lingfeng
	Commit a33df75c6328 ("block: use an xarray for disk->part_tbl") remove disk_expand_part_tbl() in add_partition(), which means all kinds of devices will support extended dynamic `dev_t`. However, some devices with GENHD_FL_NO_PART are not expected to add or resize partition. Fix this by adding check of GENHD_FL_NO_PART before add or resize partition. Fixes: a33df75c6328 ("block: use an xarray for disk->part_tbl") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230831075900.1725842-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-31	block: remove the call to file_remove_privs in blkdev_write_iter	Christoph Hellwig
	file_remove_privs instantly returns 0 when not called for regular files, so don't bother. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20230831121911.280155-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-31	ceph: make num_fwd and num_retry to __u32	Xiubo Li
	The num_fwd in MClientRequestForward is int32_t, while the num_fwd in ceph_mds_request_head is __u8. This is buggy when the num_fwd is larger than 256 it will always be truncate to 0 again. But the client couldn't recoginize this. This will make them to __u32 instead. Because the old cephs will directly copy the raw memories when decoding the reqeust's head, so we need to make sure this kclient will be compatible with old cephs. For newer cephs they will decode the requests depending the version, which will be much simpler and easier to extend new members. Link: https://tracker.ceph.com/issues/62145 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-31	ceph: make members in struct ceph_mds_request_args_ext a union	Xiubo Li
	In ceph mainline it will allow to set the btime in the setattr request and just add a 'btime' member in the union 'ceph_mds_request_args' and then bump up the header version to 4. That means the total size of union 'ceph_mds_request_args' will increase sizeof(struct ceph_timespec) bytes, but in kclient it will increase the sizeof(setattr_ext) bytes for each request. Since the MDS will always depend on the header's vesion and front_len members to decode the 'ceph_mds_request_head' struct, at the same time kclient hasn't supported the 'btime' feature yet in setattr request, so it's safe to do this change here. This will save 48 bytes memories for each request. Fixes: 4f1ddb1ea874 ("ceph: implement updated ceph_mds_request_head structure") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-08-31	ALSA: hda/tas2781: Use standard clamp() macro	Takashi Iwai
	Instead of the home-made clamp() function, use the standard macro(). Fixes: 5be27f1e3ec9 ("ALSA: hda/tas2781: Add tas2781 HDA driver") Link: https://lore.kernel.org/r/20230831123620.23064-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2023-08-31	dma-pool: remove a __maybe_unused label in atomic_pool_expand	Christoph Hellwig
	Move the #endif a line so that free_page label is only seen by the compile pass when actually used. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chunhui He <hchunhui@mail.ustc.edu.cn> Reviewed-by: Robin Murphy <roin.murphy@arm.com>
2023-08-31	bpf, docs: Fix invalid escape sequence warnings in bpf_doc.py	Vishal Chourasia
	The script bpf_doc.py generates multiple SyntaxWarnings related to invalid escape sequences when executed with Python 3.12. These warnings do not appear in Python 3.10 and 3.11 and do not affect the kernel build, which completes successfully. This patch resolves these SyntaxWarnings by converting the relevant string literals to raw strings or by escaping backslashes. This ensures that backslashes are interpreted as literal characters, eliminating the warnings. Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20230829074931.2511204-1-vishalc@linux.ibm.com
2023-08-31	macintosh/ams: linux/platform_device.h is needed	Randy Dunlap
	ams.h uses struct platform_device, so the header should be used to prevent build errors: drivers/macintosh/ams/ams-input.c: In function 'ams_input_enable': drivers/macintosh/ams/ams-input.c:68:45: error: invalid use of undefined type 'struct platform_device' 68 \| input->dev.parent = &ams_info.of_dev->dev; drivers/macintosh/ams/ams-input.c: In function 'ams_input_init': drivers/macintosh/ams/ams-input.c:146:51: error: invalid use of undefined type 'struct platform_device' 146 \| return device_create_file(&ams_info.of_dev->dev, &dev_attr_joystick); drivers/macintosh/ams/ams-input.c: In function 'ams_input_exit': drivers/macintosh/ams/ams-input.c:151:44: error: invalid use of undefined type 'struct platform_device' 151 \| device_remove_file(&ams_info.of_dev->dev, &dev_attr_joystick); drivers/macintosh/ams/ams-input.c: In function 'ams_input_init': drivers/macintosh/ams/ams-input.c:147:1: error: control reaches end of non-void function [-Werror=return-type] 147 \| } Fixes: 233d687d1b78 ("macintosh: Explicitly include correct DT includes") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230829225837.15520-1-rdunlap@infradead.org
2023-08-31	xsk: Fix xsk_diag use-after-free error during socket cleanup	Magnus Karlsson
	Fix a use-after-free error that is possible if the xsk_diag interface is used after the socket has been unbound from the device. This can happen either due to the socket being closed or the device disappearing. In the early days of AF_XDP, the way we tested that a socket was not bound to a device was to simply check if the netdevice pointer in the xsk socket structure was NULL. Later, a better system was introduced by having an explicit state variable in the xsk socket struct. For example, the state of a socket that is on the way to being closed and has been unbound from the device is XSK_UNBOUND. The commit in the Fixes tag below deleted the old way of signalling that a socket is unbound, setting dev to NULL. This in the belief that all code using the old way had been exterminated. That was unfortunately not true as the xsk diagnostics code was still using the old way and thus does not work as intended when a socket is going down. Fix this by introducing a test against the state variable. If the socket is in the state XSK_UNBOUND, simply abort the diagnostic's netlink operation. Fixes: 18b1ab7aa76b ("xsk: Fix race at socket teardown") Reported-by: syzbot+822d1359297e2694f873@syzkaller.appspotmail.com Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: syzbot+822d1359297e2694f873@syzkaller.appspotmail.com Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230831100119.17408-1-magnus.karlsson@gmail.com
2023-08-31	NFS: switch back to using kill_anon_super	Christoph Hellwig
	NFS switch to open coding kill_anon_super in 7b14a213890a ("nfs: don't call bdi_unregister") to avoid the extra bdi_unregister call. At that point bdi_destroy was called in nfs_free_server and thus it required a later freeing of the anon dev_t. But since 0db10944a76b ("nfs: Convert to separately allocated bdi") the bdi has been free implicitly by the sb destruction, so this isn't needed anymore. By not open coding kill_anon_super, nfs now inherits the fix in dc3216b14160 ("super: ensure valid info"), and we remove the only open coded version of kill_anon_super. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230831052940.256193-1-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-31	mtd: key superblock by device number	Christian Brauner
	The mtd driver has similar problems than the one that was fixed in commit dc3216b14160 ("super: ensure valid info"). The kill_mtd_super() helper calls shuts the superblock down but leaves the superblock on fs_supers as the devices are still in use but puts the mtd device and cleans out the superblock's s_mtd field. This means another mounter can find the superblock on the list accessing its s_mtd field while it is curently in the process of being freed or already freed. Prevent that from happening by keying superblock by dev_t just as we do in the generic code. Link: https://lore.kernel.org/linux-fsdevel/20230829-weitab-lauwarm-49c40fc85863@brauner Acked-by: Richard Weinberger <richard@nod.at> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230829-vfs-super-mtd-v1-2-fecb572e5df3@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>