linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
4 days	Merge tag 'kvm-x86-mmu-6.17' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM x86 MMU changes for 6.17 - Exempt nested EPT from the the !USER + CR0.WP logic, as EPT doesn't interact with CR0.WP. - Move the TDX hardware setup code to tdx.c to better co-locate TDX code and eliminate a few global symbols. - Dynamically allocation the shadow MMU's hashed page list, and defer allocating the hashed list until it's actually needed (the TDP MMU doesn't use the list).
4 days	Merge tag 'kvm-x86-misc-6.17' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM x86 misc changes for 6.17 - Prevert the host's DEBUGCTL.FREEZE_IN_SMM (Intel only) when running the guest. Failure to honor FREEZE_IN_SMM can bleed host state into the guest. - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter (Intel only) to prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF. - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the vCPU's CPUID model. - Rework the MSR interception code so that the SVM and VMX APIs are more or less identical. - Recalculate all MSR intercepts from the "source" on MSR filter changes, and drop the dedicated "shadow" bitmaps (and their awful "max" size defines). - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the nested SVM MSRPM offsets tracker can't handle an MSR. - Advertise support for LKGS (Load Kernel GS base), a new instruction that's loosely related to FRED, but is supported and enumerated independently. - Fix a user-triggerable WARN that syzkaller found by stuffing INIT_RECEIVED, a.k.a. WFS, and then putting the vCPU into VMX Root Mode (post-VMXON). Use the same approach KVM uses for dealing with "impossible" emulation when running a !URG guest, and simply wait until KVM_RUN to detect that the vCPU has architecturally impossible state. - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of APERF/MPERF reads, so that a "properly" configured VM can "virtualize" APERF/MPERF (with many caveats). - Reject KVM_SET_TSC_KHZ if vCPUs have been created, as changing the "default" frequency is unsupported for VMs with a "secure" TSC, and there's no known use case for changing the default frequency for other VM types.
2025-07-17	Merge tag 'kvm-x86-fixes-6.16-rc7' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini
	KVM TDX fixes for 6.16 - Fix a formatting goof in the TDX documentation. - Reject KVM_SET_TSC_KHZ for guests with a protected TSC (currently only TDX). - Ensure struct kvm_tdx_capabilities fields that are not explicitly set by KVM are zeroed.
2025-07-17	KVM: TDX: Don't report base TDVMCALLs	Xiaoyao Li
	Remove TDVMCALLINFO_GET_QUOTE from user_tdvmcallinfo_1_r11 reported to userspace to align with the direction of the GHCI spec. Recently, concern was raised about a gap in the GHCI spec that left ambiguity in how to expose to the guest that only a subset of GHCI TDVMCalls were supported. During the back and forth on the spec details[0], <GetQuote> was moved from an individually enumerable TDVMCall, to one that is part of the 'base spec', meaning it doesn't have a specific bit in the <GetTDVMCallInfo> return values. Although the spec[1] is still in draft form, the GetQoute part has been agreed by the major TDX VMMs. Unfortunately the commits that were upstreamed still treat <GetQuote> as individually enumerable. They set bit 0 in the user_tdvmcallinfo_1_r11 which is reported to userspace to tell supported optional TDVMCalls, intending to say that <GetQuote> is supported. So stop reporting <GetQute> in user_tdvmcallinfo_1_r11 to align with the direction of the spec, and allow some future TDVMCall to use that bit. [0] https://lore.kernel.org/all/aEmuKII8FGU4eQZz@google.com/ [1] https://cdrdv2.intel.com/v1/dl/getContent/858626 Fixes: 28224ef02b56 ("KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20250717022010.677645-1-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-07-15	KVM: VMX: Ensure unused kvm_tdx_capabilities fields are zeroed out	Sean Christopherson
	Zero-allocate the kernel's kvm_tdx_capabilities structure and copy only the number of CPUID entries from the userspace structure. As is, KVM doesn't ensure kernel_tdvmcallinfo_1_{r11,r12} and user_tdvmcallinfo_1_r12 are zeroed, i.e. KVM will reflect whatever happens to be in the userspace structure back at userspace, and thus may report garbage to userspace. Zeroing the entire kernel structure also provides better semantics for the reserved field. E.g. if KVM extends kvm_tdx_capabilities to enumerate new information by repurposing bytes from the reserved field, userspace would be required to zero the new field in order to get useful information back (because older KVMs without support for the repurposed field would report garbage, a la the aforementioned tdvmcallinfo bugs). Fixes: 61bb28279623 ("KVM: TDX: Get system-wide info about TDX module on initialization") Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Closes: https://lore.kernel.org/all/3ef581f1-1ff1-4b99-b216-b316f6415318@intel.com Tested-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://lore.kernel.org/r/20250714221928.1788095-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	KVM: x86: Use kvzalloc() to allocate VM struct	Sean Christopherson
	Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical allocation before falling back to __vmalloc(), to avoid the overhead of establishing the virtual mappings. For non-debug builds, The SVM and VMX (and TDX) structures are now just below 7000 bytes in the worst case scenario (see below), i.e. are order-1 allocations, and will likely remain that way for quite some time. Add compile-time assertions in vendor code to ensure the size of the structures, sans the memslot hash tables, are order-0 allocations, i.e. are less than 4KiB. There's nothing fundamentally wrong with a larger kvm_{svm,vmx,tdx} size, but given that the size of the structure (without the memslots hash tables) is below 2KiB after 18+ years of existence, more than doubling the size would be quite notable. Add sanity checks on the memslot hash table sizes, partly to ensure they aren't resized without accounting for the impact on VM structure size, and partly to document that the majority of the size of VM structures comes from the memslots. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: TDX: Move TDX hardware setup from main.c to tdx.c	Sean Christopherson
	Move TDX hardware setup to tdx.c, as the code is obviously TDX specific, co-locating the setup with tdx_bringup() makes it easier to see and document the success_disable_tdx "error" path, and configuring the TDX specific hooks in tdx.c reduces the number of globally visible TDX symbols. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: x86: Convert vcpu_run()'s immediate exit param into a generic bitmap	Sean Christopherson
	Convert kvm_x86_ops.vcpu_run()'s "force_immediate_exit" boolean parameter into an a generic bitmap so that similar "take action" information can be passed to vendor code without creating a pile of boolean parameters. This will allow dropping kvm_x86_ops.set_dr6() in favor of a new flag, and will also allow for adding similar functionality for re-loading debugctl in the active VMCS. Opportunistically massage the TDX WARN and comment to prepare for adding more run_flags, all of which are expected to be mutually exclusive with TDX, i.e. should be WARNed on. No functional change intended. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: TDX: Use kvm_arch_vcpu.host_debugctl to restore the host's DEBUGCTL	Sean Christopherson
	Use the kvm_arch_vcpu.host_debugctl snapshot to restore DEBUGCTL after running a TD vCPU. The final TDX series rebase was mishandled, likely due to commit fb71c7959356 ("KVM: x86: Snapshot the host's DEBUGCTL in common x86") deleting the same line of code from vmx.h, i.e. creating a semantic conflict of sorts, but no syntactic conflict. Using the version in kvm_vcpu_arch picks up the ulong => u64 fix (which isn't relevant to TDX) as well as the IRQ fix from commit 189ecdb3e112 ("KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQs"). Link: https://lore.kernel.org/all/20250307212053.2948340-10-pbonzini@redhat.com Cc: Adrian Hunter <adrian.hunter@intel.com> Fixes: 8af099037527 ("KVM: TDX: Save and restore IA32_DEBUGCTL") Reviewed-by: Adrian Hunter <adrian.hunter@intel.com> Link: https://lore.kernel.org/r/20250610232010.162191-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities	Paolo Bonzini
	Allow userspace to advertise TDG.VP.VMCALL subfunctions that the kernel also supports. For each output register of GetTdVmCallInfo's leaf 1, add two fields to KVM_TDX_CAPABILITIES: one for kernel-supported TDVMCALLs (userspace can set those blindly) and one for user-supported TDVMCALLs (userspace can set those if it knows how to handle them). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20	KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt	Paolo Bonzini
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20	KVM: TDX: Exit to userspace for GetTdVmCallInfo	Binbin Wu
	Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX, to allow userspace to provide information about the support of TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API. GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>, <ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>, <Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and <Instruction.WRMSR>. They must be supported by VMM to support TDX guests. For GetTdVmCallInfo - When leaf (r12) to enumerate TDVMCALL functionality is set to 0, successful execution indicates all GHCI base TDVMCALLs listed above are supported. Update the KVM TDX document with the set of the GHCI base APIs. - When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it indicates the TDX guest is querying the supported TDVMCALLs beyond the GHCI base TDVMCALLs. Exit to userspace to let userspace set the TDVMCALL sub-function bit(s) accordingly to the leaf outputs. KVM could set the TDVMCALL bit(s) supported by itself when the TDVMCALLs don't need support from userspace after returning from userspace and before entering guest. Currently, no such TDVMCALLs implemented, KVM just sets the values returned from userspace. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20	KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>	Binbin Wu
	Handle TDVMCALL for GetQuote to generate a TD-Quote. GetQuote is a doorbell-like interface used by TDX guests to request VMM to generate a TD-Quote signed by a service hosting TD-Quoting Enclave operating on the host. A TDX guest passes a TD Report (TDREPORT_STRUCT) in a shared-memory area as parameter. Host VMM can access it and queue the operation for a service hosting TD-Quoting enclave. When completed, the Quote is returned via the same shared-memory area. KVM only checks the GPA from the TDX guest has the shared-bit set and drops the shared-bit before exiting to userspace to avoid bleeding the shared-bit into KVM's exit ABI. KVM forwards the request to userspace VMM (e.g. QEMU) and userspace VMM queues the operation asynchronously. KVM sets the return code according to the 'ret' field set by userspace to notify the TDX guest whether the request has been queued successfully or not. When the request has been queued successfully, the TDX guest can poll the status field in the shared-memory area to check whether the Quote generation is completed or not. When completed, the generated Quote is returned via the same buffer. Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is required to handle the KVM exit reason as the initial support for TDX, by reentering KVM to ensure that the TDVMCALL is complete. While at it, add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20	KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs	Binbin Wu
	Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and return it for unimplemented TDVMCALL subfunctions. Returning TDVMCALL_STATUS_INVALID_OPERAND when a subfunction is not implemented is vague because TDX guests can't tell the error is due to the subfunction is not supported or an invalid input of the subfunction. New GHCI spec adds TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED to avoid the ambiguity. Use it instead of TDVMCALL_STATUS_INVALID_OPERAND. Before the change, for common guest implementations, when a TDX guest receives TDVMCALL_STATUS_INVALID_OPERAND, it has two cases: 1. Some operand is invalid. It could change the operand to another value retry. 2. The subfunction is not supported. For case 1, an invalid operand usually means the guest implementation bug. Since the TDX guest can't tell which case is, the best practice for handling TDVMCALL_STATUS_INVALID_OPERAND is stopping calling such leaf, treating the failure as fatal if the TDVMCALL is essential or ignoring it if the TDVMCALL is optional. With this change, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED could be sent to old TDX guest that do not know about it, but it is expected that the guest will make the same action as TDVMCALL_STATUS_INVALID_OPERAND. Currently, no known TDX guest checks TDVMCALL_STATUS_INVALID_OPERAND specifically; for example Linux just checks for success. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> [Return it for untrapped KVM_HC_MAP_GPA_RANGE. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-07	Merge branch 'kvm-tdx-initial' into HEAD	Paolo Bonzini
	This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a software component that runs in a special CPU mode called SEAM (Secure Arbitration Mode). The series is in turn split into multiple sub-series, each with a separate merge commit: - Initialization: basic setup for using the TDX module from KVM, plus ioctls to create TDX VMs and vCPUs. - MMU: in TDX, private and shared halves of the address space are mapped by different EPT roots, and the private half is managed by the TDX module. Using the support that was added to the generic MMU code in 6.14, add support for TDX's secure page tables to the Intel side of KVM. Generic KVM code takes care of maintaining a mirror of the secure page tables so that they can be queried efficiently, and ensuring that changes are applied to both the mirror and the secure EPT. - vCPU enter/exit: implement the callbacks that handle the entry of a TDX vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore of host state. - Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits" but are triggered through a different mechanism, similar to VMGEXIT for SEV-ES and SEV-SNP. - Interrupt handling: support for virtual interrupt injection as well as handling VM-Exits that are caused by vectored events. Exclusive to TDX are machine-check SMIs, which the kernel already knows how to handle through the kernel machine check handler (commit 7911f145de5f, "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode") - Loose ends: handling of the remaining exits from the TDX module, including EPT violation/misconfig and several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning an error or ignoring operations that are not supported by TDX guests Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: KVM: TDX: Always honor guest PAT on TDX enabled guests	Yan Zhao
	Always honor guest PAT in KVM-managed EPTs on TDX enabled guests by making self-snoop feature a hard dependency for TDX and making quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT not a valid quirk once TDX is enabled. The quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT only affects memory type of KVM-managed EPTs. For the TDX-module-managed private EPT, memory type is always forced to WB now. Honoring guest PAT in KVM-managed EPTs ensures KVM does not invoke kvm_zap_gfn_range() when attaching/detaching non-coherent DMA devices, which would cause mirrored EPTs for TDs to be zapped, leading to the TDX-module-managed private EPT being incorrectly zapped. As a new feature, TDX always comes with support for self-snoop, and does not have to worry about unmodifiable but buggy guests. So, simply ignore KVM_X86_QUIRK_IGNORE_GUEST_PAT on TDX guests just like kvm-amd.ko already does. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250224071039.31511-1-yan.y.zhao@intel.com> [Only apply to TDX guests. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Enable guest access to MTRR MSRs	Binbin Wu
	Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e., KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR memory types. TDX module exposes MTRR feature to TDX guests unconditionally. KVM needs to support MTRR MSRs accesses for TDX guests to match the architectural behavior. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-19-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall	Isaku Yamahata
	Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall. If the input value is zero, return success code and zero in output registers. TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to enumerate which TDG.VP.VMCALL sub leaves are supported. This hypercall is for future enhancement of the Guest-Host-Communication Interface (GHCI) specification. The GHCI version of 344426-001US defines it to require input R12 to be zero and to return zero in output registers, R11, R12, R13, and R14 so that guest TD enumerates no enhancement. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-12-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Enable guest access to LMCE related MSRs	Isaku Yamahata
	Allow TDX guest to configure LMCE (Local Machine Check Event) by handling MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL. MCE and MCA are advertised via cpuid based on the TDX module spec. Guest kernel can access IA32_FEAT_CTL to check whether LMCE is opted-in by the platform or not. If LMCE is opted-in by the platform, guest kernel can access IA32_MCG_EXT_CTL to enable/disable LMCE. Handle MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL for TDX guests to avoid failure when a guest accesses them with TDG.VP.VMCALL<MSR> on #VE. E.g., Linux guest will treat the failure as a #GP(0). Userspace VMM may not opt-in LMCE by default, e.g., QEMU disables it by default, "-cpu lmce=on" is needed in QEMU command line to opt-in it. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rework changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-11-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall	Isaku Yamahata
	Morph PV RDMSR/WRMSR hypercall to EXIT_REASON_MSR_{READ,WRITE} and wire up KVM backend functions. For complete_emulated_msr() callback, instead of injecting #GP on error, implement tdx_complete_emulated_msr() to set return code on error. Also set return value on MSR read according to the values from kvm x86 registers. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250227012021.1778144-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Implement callbacks for MSR operations	Isaku Yamahata
	Add functions to implement MSR related callbacks, .set_msr(), .get_msr(), and .has_emulated_msr(), for preparation of handling hypercalls from TDX guest for PV RDMSR and WRMSR. Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX. There are three classes of MSR virtualization for TDX. - Non-configurable: TDX module directly virtualizes it. VMM can't configure it, the value set by KVM_SET_MSRS is ignored. - Configurable: TDX module directly virtualizes it. VMM can configure it at VM creation time. The value set by KVM_SET_MSRS is used. - #VE case: TDX guest would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> and VMM handles the MSR hypercall. The value set by KVM_SET_MSRS is used. For the MSRs belonging to the #VE case, the TDX module injects #VE to the TDX guest upon RDMSR or WRMSR. The exact list of such MSRs is defined in TDX Module ABI Spec. Upon #VE, the TDX guest may call TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>, which are defined in GHCI (Guest-Host Communication Interface) so that the host VMM (e.g. KVM) can virtualize the MSRs. TDX doesn't allow VMM to configure interception of MSR accesses. Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX guest. If the userspace has set any MSR filters, it will be applied when handling TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> in a later patch. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250227012021.1778144-9-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDX PV HLT hypercall	Isaku Yamahata
	Handle TDX PV HLT hypercall and the interrupt status due to it. TDX guest status is protected, KVM can't get the interrupt status of TDX guest and it assumes interrupt is always allowed unless TDX guest calls TDVMCALL with HLT, which passes the interrupt blocked flag. If the guest halted with interrupt enabled, also query pending RVI by checking bit 0 of TD_VCPU_STATE_DETAILS_NON_ARCH field via a seamcall. Update vt_interrupt_allowed() for TDX based on interrupt blocked flag passed by HLT TDVMCALL. Do not wakeup TD vCPU if interrupt is blocked for VT-d PI. For NMIs, KVM cannot determine the NMI blocking status for TDX guests, so KVM always assumes NMIs are not blocked. In the unlikely scenario where a guest invokes the PV HLT hypercall within an NMI handler, this could result in a spurious wakeup. The guest should implement the PV HLT hypercall within a loop if it truly requires no interruptions, since NMI could be unblocked by an IRET due to an exception occurring before the PV HLT is executed in the NMI handler. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-7-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDX PV CPUID hypercall	Isaku Yamahata
	Handle TDX PV CPUID hypercall for the CPUIDs virtualized by VMM according to TDX Guest Host Communication Interface (GHCI). For TDX, most CPUID leaf/sub-leaf combinations are virtualized by the TDX module while some trigger #VE. On #VE, TDX guest can issue TDG.VP.VMCALL<INSTRUCTION.CPUID> (same value as EXIT_REASON_CPUID) to request VMM to emulate CPUID operation. Morph PV CPUID hypercall to EXIT_REASON_CPUID and wire up to the KVM backend function. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rewrite changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-6-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal	Yan Zhao
	Kick off all vCPUs and prevent tdh_vp_enter() from executing whenever tdh_mem_range_block()/tdh_mem_track()/tdh_mem_page_remove() encounters contention, since the page removal path does not expect error and is less sensitive to the performance penalty caused by kicking off vCPUs. Although KVM has protected SEPT zap-related SEAMCALLs with kvm->mmu_lock, KVM may still encounter TDX_OPERAND_BUSY due to the contention in the TDX module. - tdh_mem_track() may contend with tdh_vp_enter(). - tdh_mem_range_block()/tdh_mem_page_remove() may contend with tdh_vp_enter() and TDCALLs. Resources SHARED users EXCLUSIVE users ------------------------------------------------------------ TDCS epoch tdh_vp_enter tdh_mem_track ------------------------------------------------------------ SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation) tdh_mem_range_block ------------------------------------------------------------ SEPT entry tdh_mem_range_block (Host lock) tdh_mem_page_remove (Host lock) tdg_mem_page_accept (Guest lock) tdg_mem_page_attr_rd (Guest lock) tdg_mem_page_attr_wr (Guest lock) Use a TDX specific per-VM flag wait_for_sept_zap along with KVM_REQ_OUTSIDE_GUEST_MODE to kick off vCPUs and prevent them from entering TD, thereby avoiding the potential contention. Apply the kick-off and no vCPU entering only after each SEAMCALL busy error to minimize the window of no TD entry, as the contention due to 0-step mitigation or TDCALLs is expected to be rare. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250227012021.1778144-5-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY	Yan Zhao
	Retry locally in the TDX EPT violation handler for private memory to reduce the chances for tdh_mem_sept_add()/tdh_mem_page_aug() to contend with tdh_vp_enter(). TDX EPT violation installs private pages via tdh_mem_sept_add() and tdh_mem_page_aug(). The two may have contention with tdh_vp_enter() or TDCALLs. Resources SHARED users EXCLUSIVE users ------------------------------------------------------------ SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) tdh_mem_page_aug ------------------------------------------------------------ SEPT entry tdh_mem_sept_add (Host lock) tdh_mem_page_aug (Host lock) tdg_mem_page_accept (Guest lock) tdg_mem_page_attr_rd (Guest lock) tdg_mem_page_attr_wr (Guest lock) Though the contention between tdh_mem_sept_add()/tdh_mem_page_aug() and TDCALLs may be removed in future TDX module, their contention with tdh_vp_enter() due to 0-step mitigation still persists. The TDX module may trigger 0-step mitigation in SEAMCALL TDH.VP.ENTER, which works as follows: 0. Each TDH.VP.ENTER records the guest RIP on TD entry. 1. When the TDX module encounters a VM exit with reason EPT_VIOLATION, it checks if the guest RIP is the same as last guest RIP on TD entry. -if yes, it means the EPT violation is caused by the same instruction that caused the last VM exit. Then, the TDX module increases the guest RIP no-progress count. When the count increases from 0 to the threshold (currently 6), the TDX module records the faulting GPA into a last_epf_gpa_list. -if no, it means the guest RIP has made progress. So, the TDX module resets the RIP no-progress count and the last_epf_gpa_list. 2. On the next TDH.VP.ENTER, the TDX module (after saving the guest RIP on TD entry) checks if the last_epf_gpa_list is empty. -if yes, TD entry continues without acquiring the lock on the SEPT tree. -if no, it triggers the 0-step mitigation by acquiring the exclusive lock on SEPT tree, walking the EPT tree to check if all page faults caused by the GPAs in the last_epf_gpa_list have been resolved before continuing TD entry. Since KVM TDP MMU usually re-enters guest whenever it exits to userspace (e.g. for KVM_EXIT_MEMORY_FAULT) or encounters a BUSY, it is possible for a tdh_vp_enter() to be called more than the threshold count before a page fault is addressed, triggering contention when tdh_vp_enter() attempts to acquire exclusive lock on SEPT tree. Retry locally in TDX EPT violation handler to reduce the count of invoking tdh_vp_enter(), hence reducing the possibility of its contention with tdh_mem_sept_add()/tdh_mem_page_aug(). However, the 0-step mitigation and the contention are still not eliminated due to KVM_EXIT_MEMORY_FAULT, signals/interrupts, and cases when one instruction faults more GFNs than the threshold count. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250227012021.1778144-4-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Detect unexpected SEPT violations due to pending SPTEs	Yan Zhao
	Detect SEPT violations that occur when an SEPT entry is in PENDING state while the TD is configured not to receive #VE on SEPT violations. A TD guest can be configured not to receive #VE by setting SEPT_VE_DISABLE to 1 in tdh_mng_init() or modifying pending_ve_disable to 1 in TDCS when flexible_pending_ve is permitted. In such cases, the TDX module will not inject #VE into the TD upon encountering an EPT violation caused by an SEPT entry in the PENDING state. Instead, TDX module will exit to VMM and set extended exit qualification type to PENDING_EPT_VIOLATION and exit qualification bit 6:3 to 0. Since #VE will not be injected to such TDs, they are not able to be notified to accept a GPA. TD accessing before accepting a private GPA is regarded as an error within the guest. Detect such guest error by inspecting the (extended) exit qualification bits and make such VM dead. Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-3-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle EPT violation/misconfig exit	Isaku Yamahata
	For TDX, on EPT violation, call common __vmx_handle_ept_violation() to trigger x86 MMU code; on EPT misconfiguration, bug the VM since it shouldn't happen. EPT violation due to instruction fetch should never be triggered from shared memory in TDX guest. If such EPT violation occurs, treat it as broken hardware. EPT misconfiguration shouldn't happen on neither shared nor secure EPT for TDX guests. - TDX module guarantees no EPT misconfiguration on secure EPT. Per TDX module v1.5 spec section 9.4 "Secure EPT Induced TD Exits": "By design, since secure EPT is fully controlled by the TDX module, an EPT misconfiguration on a private GPA indicates a TDX module bug and is handled as a fatal error." - For shared EPT, the MMIO caching optimization, which is the only case where current KVM configures EPT entries to generate EPT misconfiguration, is implemented in a different way for TDX guests. KVM configures EPT entries to non-present value without suppressing #VE bit. It causes #VE in the TDX guest and the guest will call TDG.VP.VMCALL to request MMIO emulation. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> [binbin: rework changelog] Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle EXIT_REASON_OTHER_SMI	Isaku Yamahata
	Handle VM exit caused by "other SMI" for TDX, by returning back to userspace for Machine Check System Management Interrupt (MSMI) case or ignoring it and resume vCPU for non-MSMI case. For VMX, SMM transition can happen in both VMX non-root mode and VMX root mode. Unlike VMX, in SEAM root mode (TDX module), all interrupts are blocked. If an SMI occurs in SEAM non-root mode (TD guest), the SMI causes VM exit to TDX module, then SEAMRET to KVM. Once it exits to KVM, SMI is delivered and handled by kernel handler right away. An SMI can be "I/O SMI" or "other SMI". For TDX, there will be no I/O SMI because I/O instructions inside TDX guest trigger #VE and TDX guest needs to use TDVMCALL to request VMM to do I/O emulation. For "other SMI", there are two cases: - MSMI case. When BIOS eMCA MCE-SMI morphing is enabled, the #MC occurs in TDX guest will be delivered as an MSMI. It causes an EXIT_REASON_OTHER_SMI VM exit with MSMI (bit 0) set in the exit qualification. On VM exit, TDX module checks whether the "other SMI" is caused by an MSMI or not. If so, TDX module marks TD as fatal, preventing further TD entries, and then completes the TD exit flow to KVM with the TDH.VP.ENTER outputs indicating TDX_NON_RECOVERABLE_TD. After TD exit, the MSMI is delivered and eventually handled by the kernel machine check handler (7911f145de5f x86/mce: Implement recovery for errors in TDX/SEAM non-root mode), i.e., the memory page is marked as poisoned and it won't be freed to the free list when the TDX guest is terminated. Since the TDX guest is dead, follow other non-recoverable cases, exit to userspace. - For non-MSMI case, KVM doesn't need to do anything, just continue TDX vCPU execution. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-17-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT	Isaku Yamahata
	Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT exits for TDX. NMI Handling: Just like the VMX case, NMI remains blocked after exiting from TDX guest for NMI-induced exits []. Handle NMI-induced exits for TDX guests in the same way as they are handled for VMX guests, i.e., handle NMI in tdx_vcpu_enter_exit() by calling the vmx_handle_nmi() helper. Interrupt and Exception Handling: Similar to the VMX case, external interrupts and exceptions (machine check is the only exception type KVM handles for TDX guests) are handled in the .handle_exit_irqoff() callback. For other exceptions, because TDX guest state is protected, exceptions in TDX guests can't be intercepted. TDX VMM isn't supposed to handle these exceptions. If unexpected exception occurs, exit to userspace with KVM_EXIT_EXCEPTION. For external interrupt, increase the statistics, same as the VMX case. []: Some old TDX modules have a bug which makes NMI unblocked after exiting from TDX guest for NMI-induced exits. This could potentially lead to nested NMIs: a new NMI arrives when KVM is manually calling the host NMI handler. This is an architectural violation, but it doesn't have real harm until FRED is enabled together with TDX (for non-FRED, the host NMI handler can handle nested NMIs). Given this is rare to happen and has no real harm, ignore this for the initial TDX support. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-16-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Force APICv active for TDX guest	Isaku Yamahata
	Force APICv active for TDX guests in KVM because APICv is always enabled by TDX module. From the view of KVM, whether APICv state is active or not is decided by: 1. APIC is hw enabled 2. VM and vCPU have no inhibit reasons set. After TDX vCPU init, APIC is set to x2APIC mode. KVM_SET_{SREGS,SREGS2} are rejected due to has_protected_state for TDs and guest_state_protected for TDX vCPUs are set. Reject KVM_{GET,SET}_LAPIC from userspace since migration is not supported yet, so that userspace cannot disable APIC. For various APICv inhibit reasons: - APICV_INHIBIT_REASON_DISABLED is impossible after checking enable_apicv in tdx_bringup(). If !enable_apicv, TDX support will be disabled. - APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED is impossible since x2APIC is mandatory, KVM emulates APIC_ID as read-only for x2APIC mode. (Note: APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED could be set if the memory allocation fails for KVM apic_map.) - APICV_INHIBIT_REASON_HYPERV is impossible since TDX doesn't support HyperV guest yet. - APICV_INHIBIT_REASON_ABSENT is impossible since in-kernel LAPIC is checked in tdx_vcpu_create(). - APICV_INHIBIT_REASON_BLOCKIRQ is impossible since TDX doesn't support KVM_SET_GUEST_DEBUG. - APICV_INHIBIT_REASON_APIC_ID_MODIFIED is impossible since x2APIC is mandatory. - APICV_INHIBIT_REASON_APIC_BASE_MODIFIED is impossible since KVM rejects userspace to set APIC base. - The rest inhibit reasons are relevant only to AMD's AVIC, including APICV_INHIBIT_REASON_NESTED, APICV_INHIBIT_REASON_IRQWIN, APICV_INHIBIT_REASON_PIT_REINJ, APICV_INHIBIT_REASON_SEV, and APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED. (For APICV_INHIBIT_REASON_PIT_REINJ, similar to AVIC, KVM can't intercept EOI for TDX guests neither, but KVM enforces KVM_IRQCHIP_SPLIT for TDX guests, which eliminates the in-kernel PIT.) Implement vt_refresh_apicv_exec_ctrl() to call KVM_BUG_ON() if APICv is disabled for TDX guests. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-12-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Enforce KVM_IRQCHIP_SPLIT for TDX guests	Binbin Wu
	Enforce KVM_IRQCHIP_SPLIT for TDX guests to disallow in-kernel I/O APIC while in-kernel local APIC is needed. APICv is always enabled by TDX module and TDX Module doesn't allow the hypervisor to modify the EOI-bitmap, i.e. all EOIs are accelerated and never trigger exits. Level-triggered interrupts and other things depending on EOI VM-Exit can't be faithfully emulated in KVM. Also, the lazy check of pending APIC EOI for RTC edge-triggered interrupts, which was introduced as a workaround when EOI cannot be intercepted, doesn't work for TDX either because kvm_apic_pending_eoi() checks vIRR and vISR, but both values are invisible in KVM. If the guest induces generation of a level-triggered interrupt, the VMM is left with the choice of dropping the interrupt, sending it as-is, or converting it to an edge-triggered interrupt. Ditto for KVM. All of those options will make the guest unhappy. There's no architectural behavior KVM can provide that's better than sending the interrupt and hoping for the best. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-11-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Always block INIT/SIPI	Isaku Yamahata
	Always block INIT and SIPI events for the TDX guest because the TDX module doesn't provide API for VMM to inject INIT IPI or SIPI. TDX defines its own vCPU creation and initialization sequence including multiple seamcalls. Also, it's only allowed during TD build time. Given that TDX guest is para-virtualized to boot BSP/APs, normally there shouldn't be any INIT/SIPI event for TDX guest. If any, three options to handle them: 1. Always block INIT/SIPI request. 2. (Silently) ignore INIT/SIPI request during delivery. 3. Return error to guest TDs somehow. Choose option 1 for simplicity. Since INIT and SIPI are always blocked, INIT handling and the OP vcpu_deliver_sipi_vector() won't be called, no need to add new interface or helper function for INIT/SIPI delivery. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Implement methods to inject NMI	Isaku Yamahata
	Inject NMI to TDX guest by setting the PEND_NMI TDVPS field to 1, i.e. make the NMI pending in the TDX module. If there is a further pending NMI in KVM, collapse it to the one pending in the TDX module. VMM can request the TDX module to inject a NMI into a TDX vCPU by setting the PEND_NMI TDVPS field to 1. Following that, VMM can call TDH.VP.ENTER to run the vCPU and the TDX module will attempt to inject the NMI as soon as possible. KVM has the following 3 cases to inject two NMIs when handling simultaneous NMIs and they need to be injected in a back-to-back way. Otherwise, OS kernel may fire a warning about the unknown NMI [1]: K1. One NMI is being handled in the guest and one NMI pending in KVM. KVM requests NMI window exit to inject the pending NMI. K2. Two NMIs are pending in KVM. KVM injects the first NMI and requests NMI window exit to inject the second NMI. K3. A previous NMI needs to be rejected and one NMI pending in KVM. KVM first requests force immediate exit followed by a VM entry to complete the NMI rejection. Then, during the force immediate exit, KVM requests NMI window exit to inject the pending NMI. For TDX, PEND_NMI TDVPS field is a 1-bit field, i.e. KVM can only pend one NMI in the TDX module. Also, the vCPU state is protected, KVM doesn't know the NMI blocking states of TDX vCPU, KVM has to assume NMI is always unmasked and allowed. When KVM sees PEND_NMI is 1 after a TD exit, it means the previous NMI needs to be re-injected. Based on KVM's NMI handling flow, there are following 6 cases: In NMI handler TDX module KVM T1. No PEND_NMI=0 1 pending NMI T2. No PEND_NMI=0 2 pending NMIs T3. No PEND_NMI=1 1 pending NMI T4. Yes PEND_NMI=0 1 pending NMI T5. Yes PEND_NMI=0 2 pending NMIs T6. Yes PEND_NMI=1 1 pending NMI K1 is mapped to T4. K2 is mapped to T2 or T5. K3 is mapped to T3 or T6. Note: KVM doesn't know whether NMI is blocked by a NMI or not, case T5 and T6 can happen. When handling pending NMI in KVM for TDX guest, what KVM can do is to add a pending NMI in TDX module when PEND_NMI is 0. T1 and T4 can be handled by this way. However, TDX doesn't allow KVM to request NMI window exit directly, if PEND_NMI is already set and there is still pending NMI in KVM, the only way KVM could try is to request a force immediate exit. But for case T5 and T6, force immediate exit will result in infinite loop because force immediate exit makes it no progress in the NMI handler, so that the pending NMI in the TDX module can never be injected. Considering on X86 bare metal, multiple NMIs could collapse into one NMI, e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all NMI sources in the NMI handler to avoid missing handling of some NMI events. Based on that, for the above 3 cases (K1-K3), only case K1 must inject the second NMI because the guest NMI handler may have already polled some of the NMI sources, which could include the source of the pending NMI, the pending NMI must be injected to avoid the lost of NMI. For case K2 and K3, the guest OS will poll all NMI sources (including the sources caused by the second NMI and further NMI collapsed) when the delivery of the first NMI, KVM doesn't have the necessity to inject the second NMI. To handle the NMI injection properly for TDX, there are two options: - Option 1: Modify the KVM's NMI handling common code, to collapse the second pending NMI for K2 and K3. - Option 2: Do it in TDX specific way. When the previous NMI is still pending in the TDX module, i.e. it has not been delivered to TDX guest yet, collapse the pending NMI in KVM into the previous one. This patch goes with option 2 because it is simple and doesn't impact other VM types. Option 1 may need more discussions. This is the first need to access vCPU scope metadata in the "management" class. Make needed accessors available. [1] https://lore.kernel.org/all/1317409584-23662-5-git-send-email-dzickus@redhat.com/ Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-8-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Wait lapic expire when timer IRQ was injected	Isaku Yamahata
	Call kvm_wait_lapic_expire() when POSTED_INTR_ON is set and the vector for LVTT is set in PIR before TD entry. KVM always assumes a timer IRQ was injected if APIC state is protected. For TDX guest, APIC state is protected and KVM injects timer IRQ via posted interrupt. To avoid unnecessary wait calls, only call kvm_wait_lapic_expire() when a timer IRQ was injected, i.e., POSTED_INTR_ON is set and the vector for LVTT is set in PIR. Add a helper to test PIR. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-7-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Implement non-NMI interrupt injection	Isaku Yamahata
	Implement non-NMI interrupt injection for TDX via posted interrupt. As CPU state is protected and APICv is enabled for the TDX guest, TDX supports non-NMI interrupt injection only by posted interrupt. Posted interrupt descriptors (PIDs) are allocated in shared memory, KVM can update them directly. If target vCPU is in non-root mode, send posted interrupt notification to the vCPU and hardware will sync PIR to vIRR atomically. Otherwise, kick it to pick up the interrupt from PID. To post pending interrupts in the PID, KVM can generate a self-IPI with notification vector prior to TD entry. Since the guest status of TD vCPU is protected, assume interrupt is always allowed. Ignore the code path for event injection mechanism or LAPIC emulation for TDX. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-5-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Disable PI wakeup for IPIv	Isaku Yamahata
	Disable PI wakeup for IPI virtualization (IPIv) case for TDX. When a vCPU is being scheduled out, notification vector is switched and pi_wakeup_handler() is enabled when the vCPU has interrupt enabled and posted interrupt is used to wake up the vCPU. For VMX, a blocked vCPU can be the target of posted interrupts when using IPIv or VT-d PI. TDX doesn't support IPIv, disable PI wakeup for IPIv. Also, since the guest status of TD vCPU is protected, assume interrupt is always enabled for TD. (PV HLT hypercall is not support yet, TDX guest tells VMM whether HLT is called with interrupt disabled or not.) Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: split into new patch] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-3-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Add support for find pending IRQ in a protected local APIC	Sean Christopherson
	Add flag and hook to KVM's local APIC management to support determining whether or not a TDX guest has a pending IRQ. For TDX vCPUs, the virtual APIC page is owned by the TDX module and cannot be accessed by KVM. As a result, registers that are virtualized by the CPU, e.g. PPR, cannot be read or written by KVM. To deliver interrupts for TDX guests, KVM must send an IRQ to the CPU on the posted interrupt notification vector. And to determine if TDX vCPU has a pending interrupt, KVM must check if there is an outstanding notification. Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is protected to short-circuit the various other flows that try to pull an IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is pending, KVM can't do anything based on _which_ IRQ is pending. Intentionally omit sanity checks from other flows, e.g. PPR update, so as not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM and userspace will never reach those flows for TDX guests, but reaching them is not fatal if something does go awry. For the TD exits not due to HLT TDCALL, skip checking RVI pending in tdx_protected_apic_has_interrupt(). Except for the guest being stupid (e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to have an interrupt in RVI that is fully unmasked. There is no any CPU flows that modify RVI in the middle of instruction execution. I.e. if RVI is non-zero, then either the interrupt has been pending since before the TD exit, or the instruction caused the TD exit is in an STI/SS shadow. KVM doesn't care about STI/SS shadows outside of the HALTED case. And if the interrupt was pending before TD exit, then it _must_ be blocked, otherwise the interrupt would have been serviced at the instruction boundary. For the HLT TDCALL case, it will be handled in a future patch when HLT TDCALL is supported. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDX PV MMIO hypercall	Sean Christopherson
	Handle TDX PV MMIO hypercall when TDX guest calls TDVMCALL with the leaf #VE.RequestMMIO (same value as EXIT_REASON_EPT_VIOLATION) according to TDX Guest Host Communication Interface (GHCI) spec. For TDX guests, VMM is not allowed to access vCPU registers and the private memory, and the code instructions must be fetched from the private memory. So MMIO emulation implemented for non-TDX VMs is not possible for TDX guests. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. The #VE handling is supposed to emulate the MMIO instruction inside the guest and convert it into a TDVMCALL with the leaf #VE.RequestMMIO, which equals to EXIT_REASON_EPT_VIOLATION. The requested MMIO address must be in shared GPA space. The shared bit is stripped after check because the existing code for MMIO emulation is not aware of the shared bit. The MMIO GPA shouldn't have a valid memslot, also the attribute of the GPA should be shared. KVM could do the checks before exiting to userspace, however, even if KVM does the check, there still will be race conditions between the check in KVM and the emulation of MMIO access in userspace due to a memslot hotplug, or a memory attribute conversion. If userspace doesn't check the attribute of the GPA and the attribute happens to be private, it will not pose a security risk or cause an MCE, but it can lead to another issue. E.g., in QEMU, treating a GPA with private attribute as shared when it falls within RAM's range can result in extra memory consumption during the emulation to the access to the HVA of the GPA. There are two options: 1) Do the check both in KVM and userspace. 2) Do the check only in QEMU. This patch chooses option 2, i.e. KVM omits the memslot and attribute checks, and expects userspace to do the checks. Similar to normal MMIO emulation, try to handle the MMIO in kernel first, if kernel can't support it, forward the request to userspace. Export needed symbols used for MMIO handling. Fragments handling is not needed for TDX PV MMIO because GPA is provided, if a MMIO access crosses page boundary, it should be continuous in GPA. Also, the size is limited to 1, 2, 4, 8 bytes. No further split needed. Allow cross page access because no extra handling needed after checking both start and end GPA are shared GPAs. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014225.897298-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDX PV port I/O hypercall	Isaku Yamahata
	Emulate port I/O requested by TDX guest via TDVMCALL with leaf Instruction.IO (same value as EXIT_REASON_IO_INSTRUCTION) according to TDX Guest Host Communication Interface (GHCI). All port I/O instructions inside the TDX guest trigger the #VE exception. On #VE triggered by I/O instructions, TDX guest can call TDVMCALL with leaf Instruction.IO to request VMM to emulate I/O instructions. Similar to normal port I/O emulation, try to handle the port I/O in kernel first, if kernel can't support it, forward the request to userspace. Note string I/O operations are not supported in TDX. Guest should unroll them before calling the TDVMCALL. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014225.897298-9-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError>	Binbin Wu
	Convert TDG.VP.VMCALL<ReportFatalError> to KVM_EXIT_SYSTEM_EVENT with a new type KVM_SYSTEM_EVENT_TDX_FATAL and forward it to userspace for handling. TD guest can use TDG.VP.VMCALL<ReportFatalError> to report the fatal error it has experienced. This hypercall is special because TD guest is requesting a termination with the error information, KVM needs to forward the hypercall to userspace anyway, KVM doesn't do parsing or conversion, it just dumps the 16 general-purpose registers to userspace and let userspace decide what to do. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014225.897298-8-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>	Binbin Wu
	Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling. MapGPA is used by TDX guest to request to map a GPA range as private or shared memory. It needs to exit to userspace for handling. KVM has already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will exit to userspace with exit reason KVM_EXIT_HYPERCALL. Do sanity checks, convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request to userspace. To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA range is split into 2MB chunks and check interrupt pending between chunks. This allows for timely injection of interrupts and prevents issues with guest lockup detection. TDX guest should retry the operation for the GPA starting at the address specified in R11 when the TDVMCALL return TDVMCALL_RETRY as status code. Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE bit set for TD VM. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014225.897298-7-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL	Isaku Yamahata
	Handle KVM hypercall for TDX according to TDX Guest-Host Communication Interface (GHCI) specification. The TDX GHCI specification defines the ABI for the guest TD to issue hypercalls. When R10 is non-zero, it indicates the TDG.VP.VMCALL is vendor-specific. KVM uses R10 as KVM hypercall number and R11-R14 as 4 arguments, while the error code is returned in R10. Morph the TDG.VP.VMCALL with KVM hypercall to EXIT_REASON_VMCALL and marshall r10~r14 from vp_enter_args to the appropriate x86 registers for KVM hypercall handling. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014225.897298-6-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)	Isaku Yamahata
	Add a place holder and related helper functions for preparation of TDG.VP.VMCALL handling. The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short) for the guest TD to call hypercall to VMM. When the guest TD issues a TDVMCALL, the guest TD exits to VMM with a new exit reason. The arguments from the guest TD and returned values from the VMM are passed in the guest registers. The guest RCX register indicates which registers are used. Define helper functions to access those registers. A new VMX exit reason TDCALL is added to indicate the exit is due to TDVMCALL from the guest TD. Define the TDCALL exit reason and add a place holder to handle such exit. Some leafs of TDCALL will be morphed to another VMX exit reason instead of EXIT_REASON_TDCALL, add a helper tdcall_to_vmx_exit_reason() as a place holder to do the conversion. Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Message-ID: <20250222014225.897298-5-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Add a place holder to handle TDX VM exit	Isaku Yamahata
	Introduce the wiring for handling TDX VM exits by implementing the callbacks .get_exit_info(), .get_entry_info(), and .handle_exit(). Additionally, add error handling during the TDX VM exit flow, and add a place holder to handle various exit reasons. Store VMX exit reason and exit qualification in struct vcpu_vt for TDX, so that TDX/VMX can use the same helpers to get exit reason and exit qualification. Store extended exit qualification and exit GPA info in struct vcpu_tdx because they are used by TDX code only. Contention Handling: The TDH.VP.ENTER operation may contend with TDH.MEM.* operations due to secure EPT or TD EPOCH. If the contention occurs, the return value will have TDX_OPERAND_BUSY set, prompting the vCPU to attempt re-entry into the guest with EXIT_FASTPATH_EXIT_HANDLED, not EXIT_FASTPATH_REENTER_GUEST, so that the interrupts pending during IN_GUEST_MODE can be delivered for sure. Otherwise, the requester of KVM_REQ_OUTSIDE_GUEST_MODE may be blocked endlessly. Error Handling: - TDX_SW_ERROR: This includes #UD caused by SEAMCALL instruction if the CPU isn't in VMX operation, #GP caused by SEAMCALL instruction when TDX isn't enabled by the BIOS, and TDX_SEAMCALL_VMFAILINVALID when SEAM firmware is not loaded or disabled. - TDX_ERROR: This indicates some check failed in the TDX module, preventing the vCPU from running. - Failed VM Entry: Exit to userspace with KVM_EXIT_FAIL_ENTRY. Handle it separately before handling TDX_NON_RECOVERABLE because when off-TD debug is not enabled, TDX_NON_RECOVERABLE is set. - TDX_NON_RECOVERABLE: Set by the TDX module when the error is non-recoverable, indicating that the TDX guest is dead or the vCPU is disabled. A special case is triple fault, which also sets TDX_NON_RECOVERABLE but exits to userspace with KVM_EXIT_SHUTDOWN, aligning with the VMX case. - Any unhandled VM exit reason will also return to userspace with KVM_EXIT_INTERNAL_ERROR. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Message-ID: <20250222014225.897298-4-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior	Isaku Yamahata
	Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest DRs. TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter, and resets DRs to architectural INIT state on TD exit. Use the new flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to save/restore guest DRs. KVM still needs to restore host DRs after TD exit if there are active breakpoints in the host, which is covered by the existing code. MOV-DR exiting is always cleared for TDX guests, so the handler for DR access is never called, and KVM_DEBUGREG_WONT_EXIT is never set. Add a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH are set. Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT(). Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rework changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20241210004946.3718496-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-13-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Save and restore IA32_DEBUGCTL	Adrian Hunter
	Save the IA32_DEBUGCTL MSR before entering a TDX VCPU and restore it afterwards. The TDX Module preserves bits 1, 12, and 14, so if no other bits are set, no restore is done. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-12-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Disable support for TSX and WAITPKG	Adrian Hunter
	Support for restoring IA32_TSX_CTRL MSR and IA32_UMWAIT_CONTROL MSR is not yet implemented, so disable support for TSX and WAITPKG for now. Clear the associated CPUID bits returned by KVM_TDX_CAPABILITIES, and return an error if those bits are set in KVM_TDX_INIT_VM. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-11-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: restore user ret MSRs	Isaku Yamahata
	Several MSRs are clobbered on TD exit that are not used by Linux while in ring 0. Ensure the cached value of the MSR is updated on vcpu_put, and the MSRs themselves before returning to ring 3. Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-10-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: restore host xsave state when exit from the guest TD	Isaku Yamahata
	On exiting from the guest TD, xsave state is clobbered; restore it. Do not use kvm_load_host_xsave_state(), as it relies on vcpu->arch to find out whether other KVM_RUN code has loaded guest state into XCR0/PKRU/XSS or not. In the case of TDX, the exit values are known independent of the guest CR0 and CR4, and in fact the latter are not available. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-8-adrian.hunter@intel.com> [Rewrite to not use kvm_load_host_xsave_state. - Paolo] Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: vcpu_run: save/restore host state(host kernel gs)	Isaku Yamahata
	On entering/exiting TDX vcpu, preserved or clobbered CPU state is different from the VMX case. Add TDX hooks to save/restore host/guest CPU state. Save/restore kernel GS base MSR. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-7-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>