linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2025-03-14	KVM: TDX: Handle TDX PV HLT hypercall	Isaku Yamahata
	Handle TDX PV HLT hypercall and the interrupt status due to it. TDX guest status is protected, KVM can't get the interrupt status of TDX guest and it assumes interrupt is always allowed unless TDX guest calls TDVMCALL with HLT, which passes the interrupt blocked flag. If the guest halted with interrupt enabled, also query pending RVI by checking bit 0 of TD_VCPU_STATE_DETAILS_NON_ARCH field via a seamcall. Update vt_interrupt_allowed() for TDX based on interrupt blocked flag passed by HLT TDVMCALL. Do not wakeup TD vCPU if interrupt is blocked for VT-d PI. For NMIs, KVM cannot determine the NMI blocking status for TDX guests, so KVM always assumes NMIs are not blocked. In the unlikely scenario where a guest invokes the PV HLT hypercall within an NMI handler, this could result in a spurious wakeup. The guest should implement the PV HLT hypercall within a loop if it truly requires no interruptions, since NMI could be unblocked by an IRET due to an exception occurring before the PV HLT is executed in the NMI handler. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-7-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDX PV CPUID hypercall	Isaku Yamahata
	Handle TDX PV CPUID hypercall for the CPUIDs virtualized by VMM according to TDX Guest Host Communication Interface (GHCI). For TDX, most CPUID leaf/sub-leaf combinations are virtualized by the TDX module while some trigger #VE. On #VE, TDX guest can issue TDG.VP.VMCALL<INSTRUCTION.CPUID> (same value as EXIT_REASON_CPUID) to request VMM to emulate CPUID operation. Morph PV CPUID hypercall to EXIT_REASON_CPUID and wire up to the KVM backend function. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rewrite changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-6-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal	Yan Zhao
	Kick off all vCPUs and prevent tdh_vp_enter() from executing whenever tdh_mem_range_block()/tdh_mem_track()/tdh_mem_page_remove() encounters contention, since the page removal path does not expect error and is less sensitive to the performance penalty caused by kicking off vCPUs. Although KVM has protected SEPT zap-related SEAMCALLs with kvm->mmu_lock, KVM may still encounter TDX_OPERAND_BUSY due to the contention in the TDX module. - tdh_mem_track() may contend with tdh_vp_enter(). - tdh_mem_range_block()/tdh_mem_page_remove() may contend with tdh_vp_enter() and TDCALLs. Resources SHARED users EXCLUSIVE users ------------------------------------------------------------ TDCS epoch tdh_vp_enter tdh_mem_track ------------------------------------------------------------ SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation) tdh_mem_range_block ------------------------------------------------------------ SEPT entry tdh_mem_range_block (Host lock) tdh_mem_page_remove (Host lock) tdg_mem_page_accept (Guest lock) tdg_mem_page_attr_rd (Guest lock) tdg_mem_page_attr_wr (Guest lock) Use a TDX specific per-VM flag wait_for_sept_zap along with KVM_REQ_OUTSIDE_GUEST_MODE to kick off vCPUs and prevent them from entering TD, thereby avoiding the potential contention. Apply the kick-off and no vCPU entering only after each SEAMCALL busy error to minimize the window of no TD entry, as the contention due to 0-step mitigation or TDCALLs is expected to be rare. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250227012021.1778144-5-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY	Yan Zhao
	Retry locally in the TDX EPT violation handler for private memory to reduce the chances for tdh_mem_sept_add()/tdh_mem_page_aug() to contend with tdh_vp_enter(). TDX EPT violation installs private pages via tdh_mem_sept_add() and tdh_mem_page_aug(). The two may have contention with tdh_vp_enter() or TDCALLs. Resources SHARED users EXCLUSIVE users ------------------------------------------------------------ SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) tdh_mem_page_aug ------------------------------------------------------------ SEPT entry tdh_mem_sept_add (Host lock) tdh_mem_page_aug (Host lock) tdg_mem_page_accept (Guest lock) tdg_mem_page_attr_rd (Guest lock) tdg_mem_page_attr_wr (Guest lock) Though the contention between tdh_mem_sept_add()/tdh_mem_page_aug() and TDCALLs may be removed in future TDX module, their contention with tdh_vp_enter() due to 0-step mitigation still persists. The TDX module may trigger 0-step mitigation in SEAMCALL TDH.VP.ENTER, which works as follows: 0. Each TDH.VP.ENTER records the guest RIP on TD entry. 1. When the TDX module encounters a VM exit with reason EPT_VIOLATION, it checks if the guest RIP is the same as last guest RIP on TD entry. -if yes, it means the EPT violation is caused by the same instruction that caused the last VM exit. Then, the TDX module increases the guest RIP no-progress count. When the count increases from 0 to the threshold (currently 6), the TDX module records the faulting GPA into a last_epf_gpa_list. -if no, it means the guest RIP has made progress. So, the TDX module resets the RIP no-progress count and the last_epf_gpa_list. 2. On the next TDH.VP.ENTER, the TDX module (after saving the guest RIP on TD entry) checks if the last_epf_gpa_list is empty. -if yes, TD entry continues without acquiring the lock on the SEPT tree. -if no, it triggers the 0-step mitigation by acquiring the exclusive lock on SEPT tree, walking the EPT tree to check if all page faults caused by the GPAs in the last_epf_gpa_list have been resolved before continuing TD entry. Since KVM TDP MMU usually re-enters guest whenever it exits to userspace (e.g. for KVM_EXIT_MEMORY_FAULT) or encounters a BUSY, it is possible for a tdh_vp_enter() to be called more than the threshold count before a page fault is addressed, triggering contention when tdh_vp_enter() attempts to acquire exclusive lock on SEPT tree. Retry locally in TDX EPT violation handler to reduce the count of invoking tdh_vp_enter(), hence reducing the possibility of its contention with tdh_mem_sept_add()/tdh_mem_page_aug(). However, the 0-step mitigation and the contention are still not eliminated due to KVM_EXIT_MEMORY_FAULT, signals/interrupts, and cases when one instruction faults more GFNs than the threshold count. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250227012021.1778144-4-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Detect unexpected SEPT violations due to pending SPTEs	Yan Zhao
	Detect SEPT violations that occur when an SEPT entry is in PENDING state while the TD is configured not to receive #VE on SEPT violations. A TD guest can be configured not to receive #VE by setting SEPT_VE_DISABLE to 1 in tdh_mng_init() or modifying pending_ve_disable to 1 in TDCS when flexible_pending_ve is permitted. In such cases, the TDX module will not inject #VE into the TD upon encountering an EPT violation caused by an SEPT entry in the PENDING state. Instead, TDX module will exit to VMM and set extended exit qualification type to PENDING_EPT_VIOLATION and exit qualification bit 6:3 to 0. Since #VE will not be injected to such TDs, they are not able to be notified to accept a GPA. TD accessing before accepting a private GPA is regarded as an error within the guest. Detect such guest error by inspecting the (extended) exit qualification bits and make such VM dead. Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-3-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle EPT violation/misconfig exit	Isaku Yamahata
	For TDX, on EPT violation, call common __vmx_handle_ept_violation() to trigger x86 MMU code; on EPT misconfiguration, bug the VM since it shouldn't happen. EPT violation due to instruction fetch should never be triggered from shared memory in TDX guest. If such EPT violation occurs, treat it as broken hardware. EPT misconfiguration shouldn't happen on neither shared nor secure EPT for TDX guests. - TDX module guarantees no EPT misconfiguration on secure EPT. Per TDX module v1.5 spec section 9.4 "Secure EPT Induced TD Exits": "By design, since secure EPT is fully controlled by the TDX module, an EPT misconfiguration on a private GPA indicates a TDX module bug and is handled as a fatal error." - For shared EPT, the MMIO caching optimization, which is the only case where current KVM configures EPT entries to generate EPT misconfiguration, is implemented in a different way for TDX guests. KVM configures EPT entries to non-present value without suppressing #VE bit. It causes #VE in the TDX guest and the guest will call TDG.VP.VMCALL to request MMIO emulation. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> [binbin: rework changelog] Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle EXIT_REASON_OTHER_SMI	Isaku Yamahata
	Handle VM exit caused by "other SMI" for TDX, by returning back to userspace for Machine Check System Management Interrupt (MSMI) case or ignoring it and resume vCPU for non-MSMI case. For VMX, SMM transition can happen in both VMX non-root mode and VMX root mode. Unlike VMX, in SEAM root mode (TDX module), all interrupts are blocked. If an SMI occurs in SEAM non-root mode (TD guest), the SMI causes VM exit to TDX module, then SEAMRET to KVM. Once it exits to KVM, SMI is delivered and handled by kernel handler right away. An SMI can be "I/O SMI" or "other SMI". For TDX, there will be no I/O SMI because I/O instructions inside TDX guest trigger #VE and TDX guest needs to use TDVMCALL to request VMM to do I/O emulation. For "other SMI", there are two cases: - MSMI case. When BIOS eMCA MCE-SMI morphing is enabled, the #MC occurs in TDX guest will be delivered as an MSMI. It causes an EXIT_REASON_OTHER_SMI VM exit with MSMI (bit 0) set in the exit qualification. On VM exit, TDX module checks whether the "other SMI" is caused by an MSMI or not. If so, TDX module marks TD as fatal, preventing further TD entries, and then completes the TD exit flow to KVM with the TDH.VP.ENTER outputs indicating TDX_NON_RECOVERABLE_TD. After TD exit, the MSMI is delivered and eventually handled by the kernel machine check handler (7911f145de5f x86/mce: Implement recovery for errors in TDX/SEAM non-root mode), i.e., the memory page is marked as poisoned and it won't be freed to the free list when the TDX guest is terminated. Since the TDX guest is dead, follow other non-recoverable cases, exit to userspace. - For non-MSMI case, KVM doesn't need to do anything, just continue TDX vCPU execution. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-17-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT	Isaku Yamahata
	Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT exits for TDX. NMI Handling: Just like the VMX case, NMI remains blocked after exiting from TDX guest for NMI-induced exits []. Handle NMI-induced exits for TDX guests in the same way as they are handled for VMX guests, i.e., handle NMI in tdx_vcpu_enter_exit() by calling the vmx_handle_nmi() helper. Interrupt and Exception Handling: Similar to the VMX case, external interrupts and exceptions (machine check is the only exception type KVM handles for TDX guests) are handled in the .handle_exit_irqoff() callback. For other exceptions, because TDX guest state is protected, exceptions in TDX guests can't be intercepted. TDX VMM isn't supposed to handle these exceptions. If unexpected exception occurs, exit to userspace with KVM_EXIT_EXCEPTION. For external interrupt, increase the statistics, same as the VMX case. []: Some old TDX modules have a bug which makes NMI unblocked after exiting from TDX guest for NMI-induced exits. This could potentially lead to nested NMIs: a new NMI arrives when KVM is manually calling the host NMI handler. This is an architectural violation, but it doesn't have real harm until FRED is enabled together with TDX (for non-FRED, the host NMI handler can handle nested NMIs). Given this is rare to happen and has no real harm, ignore this for the initial TDX support. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-16-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: VMX: Add a helper for NMI handling	Sean Christopherson
	Add a helper to handles NMI exit. TDX handles the NMI exit the same as VMX case. Add a helper to share the code with TDX, expose the helper in common.h. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-15-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: VMX: Move emulation_required to struct vcpu_vt	Binbin Wu
	Move emulation_required from struct vcpu_vmx to struct vcpu_vt so that vmx_handle_exit_irqoff() can be reused by TDX code. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-14-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Add methods to ignore virtual apic related operation	Isaku Yamahata
	TDX protects TDX guest APIC state from VMM. Implement access methods of TDX guest vAPIC state to ignore them or return zero. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-13-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Force APICv active for TDX guest	Isaku Yamahata
	Force APICv active for TDX guests in KVM because APICv is always enabled by TDX module. From the view of KVM, whether APICv state is active or not is decided by: 1. APIC is hw enabled 2. VM and vCPU have no inhibit reasons set. After TDX vCPU init, APIC is set to x2APIC mode. KVM_SET_{SREGS,SREGS2} are rejected due to has_protected_state for TDs and guest_state_protected for TDX vCPUs are set. Reject KVM_{GET,SET}_LAPIC from userspace since migration is not supported yet, so that userspace cannot disable APIC. For various APICv inhibit reasons: - APICV_INHIBIT_REASON_DISABLED is impossible after checking enable_apicv in tdx_bringup(). If !enable_apicv, TDX support will be disabled. - APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED is impossible since x2APIC is mandatory, KVM emulates APIC_ID as read-only for x2APIC mode. (Note: APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED could be set if the memory allocation fails for KVM apic_map.) - APICV_INHIBIT_REASON_HYPERV is impossible since TDX doesn't support HyperV guest yet. - APICV_INHIBIT_REASON_ABSENT is impossible since in-kernel LAPIC is checked in tdx_vcpu_create(). - APICV_INHIBIT_REASON_BLOCKIRQ is impossible since TDX doesn't support KVM_SET_GUEST_DEBUG. - APICV_INHIBIT_REASON_APIC_ID_MODIFIED is impossible since x2APIC is mandatory. - APICV_INHIBIT_REASON_APIC_BASE_MODIFIED is impossible since KVM rejects userspace to set APIC base. - The rest inhibit reasons are relevant only to AMD's AVIC, including APICV_INHIBIT_REASON_NESTED, APICV_INHIBIT_REASON_IRQWIN, APICV_INHIBIT_REASON_PIT_REINJ, APICV_INHIBIT_REASON_SEV, and APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED. (For APICV_INHIBIT_REASON_PIT_REINJ, similar to AVIC, KVM can't intercept EOI for TDX guests neither, but KVM enforces KVM_IRQCHIP_SPLIT for TDX guests, which eliminates the in-kernel PIT.) Implement vt_refresh_apicv_exec_ctrl() to call KVM_BUG_ON() if APICv is disabled for TDX guests. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-12-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Enforce KVM_IRQCHIP_SPLIT for TDX guests	Binbin Wu
	Enforce KVM_IRQCHIP_SPLIT for TDX guests to disallow in-kernel I/O APIC while in-kernel local APIC is needed. APICv is always enabled by TDX module and TDX Module doesn't allow the hypervisor to modify the EOI-bitmap, i.e. all EOIs are accelerated and never trigger exits. Level-triggered interrupts and other things depending on EOI VM-Exit can't be faithfully emulated in KVM. Also, the lazy check of pending APIC EOI for RTC edge-triggered interrupts, which was introduced as a workaround when EOI cannot be intercepted, doesn't work for TDX either because kvm_apic_pending_eoi() checks vIRR and vISR, but both values are invisible in KVM. If the guest induces generation of a level-triggered interrupt, the VMM is left with the choice of dropping the interrupt, sending it as-is, or converting it to an edge-triggered interrupt. Ditto for KVM. All of those options will make the guest unhappy. There's no architectural behavior KVM can provide that's better than sending the interrupt and hoping for the best. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-11-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Always block INIT/SIPI	Isaku Yamahata
	Always block INIT and SIPI events for the TDX guest because the TDX module doesn't provide API for VMM to inject INIT IPI or SIPI. TDX defines its own vCPU creation and initialization sequence including multiple seamcalls. Also, it's only allowed during TD build time. Given that TDX guest is para-virtualized to boot BSP/APs, normally there shouldn't be any INIT/SIPI event for TDX guest. If any, three options to handle them: 1. Always block INIT/SIPI request. 2. (Silently) ignore INIT/SIPI request during delivery. 3. Return error to guest TDs somehow. Choose option 1 for simplicity. Since INIT and SIPI are always blocked, INIT handling and the OP vcpu_deliver_sipi_vector() won't be called, no need to add new interface or helper function for INIT/SIPI delivery. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle SMI request as !CONFIG_KVM_SMM	Isaku Yamahata
	Handle SMI request as what KVM does for CONFIG_KVM_SMM=n, i.e. return -ENOTTY, and add KVM_BUG_ON() to SMI related OPs for TD. TDX doesn't support system-management mode (SMM) and system-management interrupt (SMI) in guest TDs. Because guest state (vCPU state, memory state) is protected, it must go through the TDX module APIs to change guest state. However, the TDX module doesn't provide a way for VMM to inject SMI into guest TD or a way for VMM to switch guest vCPU mode into SMM. MSR_IA32_SMBASE will not be emulated for TDX guest, -ENOTTY will be returned when SMI is requested. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-9-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Implement methods to inject NMI	Isaku Yamahata
	Inject NMI to TDX guest by setting the PEND_NMI TDVPS field to 1, i.e. make the NMI pending in the TDX module. If there is a further pending NMI in KVM, collapse it to the one pending in the TDX module. VMM can request the TDX module to inject a NMI into a TDX vCPU by setting the PEND_NMI TDVPS field to 1. Following that, VMM can call TDH.VP.ENTER to run the vCPU and the TDX module will attempt to inject the NMI as soon as possible. KVM has the following 3 cases to inject two NMIs when handling simultaneous NMIs and they need to be injected in a back-to-back way. Otherwise, OS kernel may fire a warning about the unknown NMI [1]: K1. One NMI is being handled in the guest and one NMI pending in KVM. KVM requests NMI window exit to inject the pending NMI. K2. Two NMIs are pending in KVM. KVM injects the first NMI and requests NMI window exit to inject the second NMI. K3. A previous NMI needs to be rejected and one NMI pending in KVM. KVM first requests force immediate exit followed by a VM entry to complete the NMI rejection. Then, during the force immediate exit, KVM requests NMI window exit to inject the pending NMI. For TDX, PEND_NMI TDVPS field is a 1-bit field, i.e. KVM can only pend one NMI in the TDX module. Also, the vCPU state is protected, KVM doesn't know the NMI blocking states of TDX vCPU, KVM has to assume NMI is always unmasked and allowed. When KVM sees PEND_NMI is 1 after a TD exit, it means the previous NMI needs to be re-injected. Based on KVM's NMI handling flow, there are following 6 cases: In NMI handler TDX module KVM T1. No PEND_NMI=0 1 pending NMI T2. No PEND_NMI=0 2 pending NMIs T3. No PEND_NMI=1 1 pending NMI T4. Yes PEND_NMI=0 1 pending NMI T5. Yes PEND_NMI=0 2 pending NMIs T6. Yes PEND_NMI=1 1 pending NMI K1 is mapped to T4. K2 is mapped to T2 or T5. K3 is mapped to T3 or T6. Note: KVM doesn't know whether NMI is blocked by a NMI or not, case T5 and T6 can happen. When handling pending NMI in KVM for TDX guest, what KVM can do is to add a pending NMI in TDX module when PEND_NMI is 0. T1 and T4 can be handled by this way. However, TDX doesn't allow KVM to request NMI window exit directly, if PEND_NMI is already set and there is still pending NMI in KVM, the only way KVM could try is to request a force immediate exit. But for case T5 and T6, force immediate exit will result in infinite loop because force immediate exit makes it no progress in the NMI handler, so that the pending NMI in the TDX module can never be injected. Considering on X86 bare metal, multiple NMIs could collapse into one NMI, e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all NMI sources in the NMI handler to avoid missing handling of some NMI events. Based on that, for the above 3 cases (K1-K3), only case K1 must inject the second NMI because the guest NMI handler may have already polled some of the NMI sources, which could include the source of the pending NMI, the pending NMI must be injected to avoid the lost of NMI. For case K2 and K3, the guest OS will poll all NMI sources (including the sources caused by the second NMI and further NMI collapsed) when the delivery of the first NMI, KVM doesn't have the necessity to inject the second NMI. To handle the NMI injection properly for TDX, there are two options: - Option 1: Modify the KVM's NMI handling common code, to collapse the second pending NMI for K2 and K3. - Option 2: Do it in TDX specific way. When the previous NMI is still pending in the TDX module, i.e. it has not been delivered to TDX guest yet, collapse the pending NMI in KVM into the previous one. This patch goes with option 2 because it is simple and doesn't impact other VM types. Option 1 may need more discussions. This is the first need to access vCPU scope metadata in the "management" class. Make needed accessors available. [1] https://lore.kernel.org/all/1317409584-23662-5-git-send-email-dzickus@redhat.com/ Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-8-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Wait lapic expire when timer IRQ was injected	Isaku Yamahata
	Call kvm_wait_lapic_expire() when POSTED_INTR_ON is set and the vector for LVTT is set in PIR before TD entry. KVM always assumes a timer IRQ was injected if APIC state is protected. For TDX guest, APIC state is protected and KVM injects timer IRQ via posted interrupt. To avoid unnecessary wait calls, only call kvm_wait_lapic_expire() when a timer IRQ was injected, i.e., POSTED_INTR_ON is set and the vector for LVTT is set in PIR. Add a helper to test PIR. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-7-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Implement non-NMI interrupt injection	Isaku Yamahata
	Implement non-NMI interrupt injection for TDX via posted interrupt. As CPU state is protected and APICv is enabled for the TDX guest, TDX supports non-NMI interrupt injection only by posted interrupt. Posted interrupt descriptors (PIDs) are allocated in shared memory, KVM can update them directly. If target vCPU is in non-root mode, send posted interrupt notification to the vCPU and hardware will sync PIR to vIRR atomically. Otherwise, kick it to pick up the interrupt from PID. To post pending interrupts in the PID, KVM can generate a self-IPI with notification vector prior to TD entry. Since the guest status of TD vCPU is protected, assume interrupt is always allowed. Ignore the code path for event injection mechanism or LAPIC emulation for TDX. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-5-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: VMX: Move posted interrupt delivery code to common header	Isaku Yamahata
	Move posted interrupt delivery code to common header so that TDX can leverage it. No functional change intended. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: split into new patch] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Message-ID: <20250222014757.897978-4-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Disable PI wakeup for IPIv	Isaku Yamahata
	Disable PI wakeup for IPI virtualization (IPIv) case for TDX. When a vCPU is being scheduled out, notification vector is switched and pi_wakeup_handler() is enabled when the vCPU has interrupt enabled and posted interrupt is used to wake up the vCPU. For VMX, a blocked vCPU can be the target of posted interrupts when using IPIv or VT-d PI. TDX doesn't support IPIv, disable PI wakeup for IPIv. Also, since the guest status of TD vCPU is protected, assume interrupt is always enabled for TD. (PV HLT hypercall is not support yet, TDX guest tells VMM whether HLT is called with interrupt disabled or not.) Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: split into new patch] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-3-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Add support for find pending IRQ in a protected local APIC	Sean Christopherson
	Add flag and hook to KVM's local APIC management to support determining whether or not a TDX guest has a pending IRQ. For TDX vCPUs, the virtual APIC page is owned by the TDX module and cannot be accessed by KVM. As a result, registers that are virtualized by the CPU, e.g. PPR, cannot be read or written by KVM. To deliver interrupts for TDX guests, KVM must send an IRQ to the CPU on the posted interrupt notification vector. And to determine if TDX vCPU has a pending interrupt, KVM must check if there is an outstanding notification. Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is protected to short-circuit the various other flows that try to pull an IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is pending, KVM can't do anything based on _which_ IRQ is pending. Intentionally omit sanity checks from other flows, e.g. PPR update, so as not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM and userspace will never reach those flows for TDX guests, but reaching them is not fatal if something does go awry. For the TD exits not due to HLT TDCALL, skip checking RVI pending in tdx_protected_apic_has_interrupt(). Except for the guest being stupid (e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to have an interrupt in RVI that is fully unmasked. There is no any CPU flows that modify RVI in the middle of instruction execution. I.e. if RVI is non-zero, then either the interrupt has been pending since before the TD exit, or the instruction caused the TD exit is in an STI/SS shadow. KVM doesn't care about STI/SS shadows outside of the HALTED case. And if the interrupt was pending before TD exit, then it _must_ be blocked, otherwise the interrupt would have been serviced at the instruction boundary. For the HLT TDCALL case, it will be handled in a future patch when HLT TDCALL is supported. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDX PV MMIO hypercall	Sean Christopherson
	Handle TDX PV MMIO hypercall when TDX guest calls TDVMCALL with the leaf #VE.RequestMMIO (same value as EXIT_REASON_EPT_VIOLATION) according to TDX Guest Host Communication Interface (GHCI) spec. For TDX guests, VMM is not allowed to access vCPU registers and the private memory, and the code instructions must be fetched from the private memory. So MMIO emulation implemented for non-TDX VMs is not possible for TDX guests. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. The #VE handling is supposed to emulate the MMIO instruction inside the guest and convert it into a TDVMCALL with the leaf #VE.RequestMMIO, which equals to EXIT_REASON_EPT_VIOLATION. The requested MMIO address must be in shared GPA space. The shared bit is stripped after check because the existing code for MMIO emulation is not aware of the shared bit. The MMIO GPA shouldn't have a valid memslot, also the attribute of the GPA should be shared. KVM could do the checks before exiting to userspace, however, even if KVM does the check, there still will be race conditions between the check in KVM and the emulation of MMIO access in userspace due to a memslot hotplug, or a memory attribute conversion. If userspace doesn't check the attribute of the GPA and the attribute happens to be private, it will not pose a security risk or cause an MCE, but it can lead to another issue. E.g., in QEMU, treating a GPA with private attribute as shared when it falls within RAM's range can result in extra memory consumption during the emulation to the access to the HVA of the GPA. There are two options: 1) Do the check both in KVM and userspace. 2) Do the check only in QEMU. This patch chooses option 2, i.e. KVM omits the memslot and attribute checks, and expects userspace to do the checks. Similar to normal MMIO emulation, try to handle the MMIO in kernel first, if kernel can't support it, forward the request to userspace. Export needed symbols used for MMIO handling. Fragments handling is not needed for TDX PV MMIO because GPA is provided, if a MMIO access crosses page boundary, it should be continuous in GPA. Also, the size is limited to 1, 2, 4, 8 bytes. No further split needed. Allow cross page access because no extra handling needed after checking both start and end GPA are shared GPAs. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014225.897298-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDX PV port I/O hypercall	Isaku Yamahata
	Emulate port I/O requested by TDX guest via TDVMCALL with leaf Instruction.IO (same value as EXIT_REASON_IO_INSTRUCTION) according to TDX Guest Host Communication Interface (GHCI). All port I/O instructions inside the TDX guest trigger the #VE exception. On #VE triggered by I/O instructions, TDX guest can call TDVMCALL with leaf Instruction.IO to request VMM to emulate I/O instructions. Similar to normal port I/O emulation, try to handle the port I/O in kernel first, if kernel can't support it, forward the request to userspace. Note string I/O operations are not supported in TDX. Guest should unroll them before calling the TDVMCALL. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014225.897298-9-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError>	Binbin Wu
	Convert TDG.VP.VMCALL<ReportFatalError> to KVM_EXIT_SYSTEM_EVENT with a new type KVM_SYSTEM_EVENT_TDX_FATAL and forward it to userspace for handling. TD guest can use TDG.VP.VMCALL<ReportFatalError> to report the fatal error it has experienced. This hypercall is special because TD guest is requesting a termination with the error information, KVM needs to forward the hypercall to userspace anyway, KVM doesn't do parsing or conversion, it just dumps the 16 general-purpose registers to userspace and let userspace decide what to do. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014225.897298-8-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TDG.VP.VMCALL<MapGPA>	Binbin Wu
	Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling. MapGPA is used by TDX guest to request to map a GPA range as private or shared memory. It needs to exit to userspace for handling. KVM has already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will exit to userspace with exit reason KVM_EXIT_HYPERCALL. Do sanity checks, convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request to userspace. To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA range is split into 2MB chunks and check interrupt pending between chunks. This allows for timely injection of interrupts and prevents issues with guest lockup detection. TDX guest should retry the operation for the GPA starting at the address specified in R11 when the TDVMCALL return TDVMCALL_RETRY as status code. Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE bit set for TD VM. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014225.897298-7-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL	Isaku Yamahata
	Handle KVM hypercall for TDX according to TDX Guest-Host Communication Interface (GHCI) specification. The TDX GHCI specification defines the ABI for the guest TD to issue hypercalls. When R10 is non-zero, it indicates the TDG.VP.VMCALL is vendor-specific. KVM uses R10 as KVM hypercall number and R11-R14 as 4 arguments, while the error code is returned in R10. Morph the TDG.VP.VMCALL with KVM hypercall to EXIT_REASON_VMCALL and marshall r10~r14 from vp_enter_args to the appropriate x86 registers for KVM hypercall handling. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014225.897298-6-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)	Isaku Yamahata
	Add a place holder and related helper functions for preparation of TDG.VP.VMCALL handling. The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short) for the guest TD to call hypercall to VMM. When the guest TD issues a TDVMCALL, the guest TD exits to VMM with a new exit reason. The arguments from the guest TD and returned values from the VMM are passed in the guest registers. The guest RCX register indicates which registers are used. Define helper functions to access those registers. A new VMX exit reason TDCALL is added to indicate the exit is due to TDVMCALL from the guest TD. Define the TDCALL exit reason and add a place holder to handle such exit. Some leafs of TDCALL will be morphed to another VMX exit reason instead of EXIT_REASON_TDCALL, add a helper tdcall_to_vmx_exit_reason() as a place holder to do the conversion. Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Message-ID: <20250222014225.897298-5-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Add a place holder to handle TDX VM exit	Isaku Yamahata
	Introduce the wiring for handling TDX VM exits by implementing the callbacks .get_exit_info(), .get_entry_info(), and .handle_exit(). Additionally, add error handling during the TDX VM exit flow, and add a place holder to handle various exit reasons. Store VMX exit reason and exit qualification in struct vcpu_vt for TDX, so that TDX/VMX can use the same helpers to get exit reason and exit qualification. Store extended exit qualification and exit GPA info in struct vcpu_tdx because they are used by TDX code only. Contention Handling: The TDH.VP.ENTER operation may contend with TDH.MEM.* operations due to secure EPT or TD EPOCH. If the contention occurs, the return value will have TDX_OPERAND_BUSY set, prompting the vCPU to attempt re-entry into the guest with EXIT_FASTPATH_EXIT_HANDLED, not EXIT_FASTPATH_REENTER_GUEST, so that the interrupts pending during IN_GUEST_MODE can be delivered for sure. Otherwise, the requester of KVM_REQ_OUTSIDE_GUEST_MODE may be blocked endlessly. Error Handling: - TDX_SW_ERROR: This includes #UD caused by SEAMCALL instruction if the CPU isn't in VMX operation, #GP caused by SEAMCALL instruction when TDX isn't enabled by the BIOS, and TDX_SEAMCALL_VMFAILINVALID when SEAM firmware is not loaded or disabled. - TDX_ERROR: This indicates some check failed in the TDX module, preventing the vCPU from running. - Failed VM Entry: Exit to userspace with KVM_EXIT_FAIL_ENTRY. Handle it separately before handling TDX_NON_RECOVERABLE because when off-TD debug is not enabled, TDX_NON_RECOVERABLE is set. - TDX_NON_RECOVERABLE: Set by the TDX module when the error is non-recoverable, indicating that the TDX guest is dead or the vCPU is disabled. A special case is triple fault, which also sets TDX_NON_RECOVERABLE but exits to userspace with KVM_EXIT_SHUTDOWN, aligning with the VMX case. - Any unhandled VM exit reason will also return to userspace with KVM_EXIT_INTERNAL_ERROR. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Message-ID: <20250222014225.897298-4-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior	Isaku Yamahata
	Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest DRs. TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter, and resets DRs to architectural INIT state on TD exit. Use the new flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to save/restore guest DRs. KVM still needs to restore host DRs after TD exit if there are active breakpoints in the host, which is covered by the existing code. MOV-DR exiting is always cleared for TDX guests, so the handler for DR access is never called, and KVM_DEBUGREG_WONT_EXIT is never set. Add a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH are set. Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT(). Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rework changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20241210004946.3718496-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-13-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Save and restore IA32_DEBUGCTL	Adrian Hunter
	Save the IA32_DEBUGCTL MSR before entering a TDX VCPU and restore it afterwards. The TDX Module preserves bits 1, 12, and 14, so if no other bits are set, no restore is done. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-12-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Disable support for TSX and WAITPKG	Adrian Hunter
	Support for restoring IA32_TSX_CTRL MSR and IA32_UMWAIT_CONTROL MSR is not yet implemented, so disable support for TSX and WAITPKG for now. Clear the associated CPUID bits returned by KVM_TDX_CAPABILITIES, and return an error if those bits are set in KVM_TDX_INIT_VM. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-11-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: restore user ret MSRs	Isaku Yamahata
	Several MSRs are clobbered on TD exit that are not used by Linux while in ring 0. Ensure the cached value of the MSR is updated on vcpu_put, and the MSRs themselves before returning to ring 3. Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-10-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: restore host xsave state when exit from the guest TD	Isaku Yamahata
	On exiting from the guest TD, xsave state is clobbered; restore it. Do not use kvm_load_host_xsave_state(), as it relies on vcpu->arch to find out whether other KVM_RUN code has loaded guest state into XCR0/PKRU/XSS or not. In the case of TDX, the exit values are known independent of the guest CR0 and CR4, and in fact the latter are not available. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-8-adrian.hunter@intel.com> [Rewrite to not use kvm_load_host_xsave_state. - Paolo] Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: vcpu_run: save/restore host state(host kernel gs)	Isaku Yamahata
	On entering/exiting TDX vcpu, preserved or clobbered CPU state is different from the VMX case. Add TDX hooks to save/restore host/guest CPU state. Save/restore kernel GS base MSR. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-7-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Implement TDX vcpu enter/exit path	Isaku Yamahata
	Implement callbacks to enter/exit a TDX VCPU by calling tdh_vp_enter(). Ensure the TDX VCPU is in a correct state to run. Do not pass arguments from/to vcpu->arch.regs[] unconditionally. Instead, marshall state to/from the appropriate x86 registers only when needed, i.e., to handle some TDVMCALL sub-leaves following KVM's ABI to leverage the existing code. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-6-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct	Binbin Wu
	Move common fields of struct vcpu_vmx and struct vcpu_tdx to struct vcpu_vt, to share the code between VMX/TDX as much as possible and to make TDX exit handling more VMX like. No functional change intended. [Adrian: move code that depends on struct vcpu_vmx back to vmx.h] Suggested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/Z1suNzg2Or743a7e@google.com Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-5-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle SEPT zap error due to page add error in premap	Yan Zhao
	Move the handling of SEPT zap errors caused by unsuccessful execution of tdh_mem_page_add() in KVM_TDX_INIT_MEM_REGION from tdx_sept_drop_private_spte() to tdx_sept_zap_private_spte(). Introduce a new helper function tdx_is_sept_zap_err_due_to_premap() to detect this specific error. During the IOCTL KVM_TDX_INIT_MEM_REGION, KVM premaps leaf SPTEs in the mirror page table before the corresponding entry in the private page table is successfully installed by tdh_mem_page_add(). If an error occurs during the invocation of tdh_mem_page_add(), a mismatch between the mirror and private page tables results in SEAMCALLs for SEPT zap returning the error code TDX_EPT_ENTRY_STATE_INCORRECT. The error TDX_EPT_WALK_FAILED is not possible because, during KVM_TDX_INIT_MEM_REGION, KVM only premaps leaf SPTEs after successfully mapping non-leaf SPTEs. Unlike leaf SPTEs, there is no mismatch in non-leaf PTEs between the mirror and private page tables. Therefore, during zap, SEAMCALLs should find an empty leaf entry in the private EPT, leading to the error TDX_EPT_ENTRY_STATE_INCORRECT instead of TDX_EPT_WALK_FAILED. Since tdh_mem_range_block() is always invoked before tdh_mem_page_remove(), move the handling of SEPT zap errors from tdx_sept_drop_private_spte() to tdx_sept_zap_private_spte(). In tdx_sept_zap_private_spte(), return 0 for errors due to premap to skip executing other SEAMCALLs for zap, which are unnecessary. Return 1 to indicate no other errors, allowing the execution of other zap SEAMCALLs to continue. The failure of tdh_mem_page_add() is uncommon and has not been observed in real workloads. Currently, this failure is only hypothetically triggered by skipping the real SEAMCALL and faking the add error in the SEAMCALL wrapper. Additionally, without this fix, there will be no host crashes or other severe issues. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250217085642.19696-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Skip updating CPU dirty logging request for TDs	Paolo Bonzini
	Wrap vmx_update_cpu_dirty_logging so as to ignore requests to update CPU dirty logging for TDs, as basic TDX does not support the PML feature. Invoking vmx_update_cpu_dirty_logging() for TDs would cause an incorrect access to a kvm_vmx struct for a TDX VM, so block that before it happens. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: x86: Make cpu_dirty_log_size a per-VM value	Yan Zhao
	Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled. Do not set it for TDs. Until now, cpu_dirty_log_size was a system-wide value that is used for all VMs and is set to the PML buffer size when PML was enabled in VMX. However, PML is not currently supported for TDs, though PML remains available for normal VMs as long as the feature is supported by hardware and enabled in VMX. Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer size for normal VMs and 0 for TDs. This allows functions like kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to determine if PML is supported, in order to kick off vCPUs or request them to update CPU dirty logging status (turn on/off PML in VMCS). This fixes an issue first reported in [1], where QEMU attaches an emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD. KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx struct for a TDX VM. Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com> Reported-by: Pedro Principeza <pedro.principeza@canonical.com> Reported-by: Farrah Chen <farrah.chen@intel.com> Closes: https://github.com/canonical/tdx/issues/202 Link: https://github.com/canonical/tdx/issues/202 [1] Suggested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle vCPU dissociation	Isaku Yamahata
	Handle vCPUs dissociations by invoking SEAMCALL TDH.VP.FLUSH which flushes the address translation caches and cached TD VMCS of a TD vCPU in its associated pCPU. In TDX, a vCPUs can only be associated with one pCPU at a time, which is done by invoking SEAMCALL TDH.VP.ENTER. For a successful association, the vCPU must be dissociated from its previous associated pCPU. To facilitate vCPU dissociation, introduce a per-pCPU list associated_tdvcpus. Add a vCPU into this list when it's loaded into a new pCPU (i.e. when a vCPU is loaded for the first time or migrated to a new pCPU). vCPU dissociations can happen under below conditions: - On the op hardware_disable is called. This op is called when virtualization is disabled on a given pCPU, e.g. when hot-unplug a pCPU or machine shutdown/suspend. In this case, dissociate all vCPUs from the pCPU by iterating its per-pCPU list associated_tdvcpus. - On vCPU migration to a new pCPU. Before adding a vCPU into associated_tdvcpus list of the new pCPU, dissociation from its old pCPU is required, which is performed by issuing an IPI and executing SEAMCALL TDH.VP.FLUSH on the old pCPU. On a successful dissociation, the vCPU will be removed from the associated_tdvcpus list of its previously associated pCPU. - On tdx_mmu_release_hkid() is called. TDX mandates that all vCPUs must be disassociated prior to the release of an hkid. Therefore, dissociation of all vCPUs is a must before executing the SEAMCALL TDH.MNG.VPFLUSHDONE and subsequently freeing the hkid. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073858.22312-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Finalize VM initialization	Isaku Yamahata
	Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand, KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization. Documentation for the API is added in another patch: "Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)" For the purpose of attestation, a measurement must be made of the TDX VM initial state. This is referred to as TD Measurement Finalization, and uses SEAMCALL TDH.MR.FINALIZE, after which: 1. The VMM adding TD private pages with arbitrary content is no longer allowed 2. The TDX VM is runnable Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240904030751.117579-21-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Add an ioctl to create initial guest memory	Isaku Yamahata
	Add a new ioctl for the user space VMM to initialize guest memory with the specified memory contents. Because TDX protects the guest's memory, the creation of the initial guest memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of directly copying the memory contents into the guest's memory in the case of the default VM type. Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped KVM_MEMORY_ENCRYPT_OP. Check if the GFN is already pre-allocated, assign the guest page in Secure-EPT, copy the initial memory contents into the guest memory, and encrypt the guest memory. Optionally, extend the memory measurement of the TDX guest. The ioctl uses the vCPU file descriptor because of the TDX module's requirement that the memory is added to the S-EPT (via TDH.MEM.SEPT.ADD) prior to initialization (TDH.MEM.PAGE.ADD). Accessing the MMU in turn requires a vCPU file descriptor, just like for KVM_PRE_FAULT_MEMORY. In fact, the post-populate callback is able to reuse the same logic used by KVM_PRE_FAULT_MEMORY, so that userspace can do everything with a single ioctl. Note that this is the only way to invoke TDH.MEM.SEPT.ADD before the TD in finalized, as userspace cannot use KVM_PRE_FAULT_MEMORY at that point. This ensures that there cannot be pages in the S-EPT awaiting TDH.MEM.PAGE.ADD, which would be treated incorrectly as spurious by tdp_mmu_map_handle_target_level() (KVM would see the SPTE as PRESENT, but the corresponding S-EPT entry will be !PRESENT). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- - KVM_BUG_ON() for kvm_tdx->nr_premapped (Paolo) - Use tdx_operand_busy() - Merge first patch in SEPT SEAMCALL retry series in to this base (Paolo) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Implement hook to get max mapping level of private pages	Isaku Yamahata
	Implement hook private_max_mapping_level for TDX to let TDP MMU core get max mapping level of private pages. The value is hard coded to 4K for no huge page support for now. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073816.22256-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table	Isaku Yamahata
	Implement hooks in TDX to propagate changes of mirror page table to private EPT, including changes for page table page adding/removing, guest page adding/removing. TDX invokes corresponding SEAMCALLs in the hooks. - Hook link_external_spt propagates adding page table page into private EPT. - Hook set_external_spte tdx_sept_set_private_spte() in this patch only handles adding of guest private page when TD is finalized. Later patches will handle the case of adding guest private pages before TD finalization. - Hook free_external_spt It is invoked when page table page is removed in mirror page table, which currently must occur at TD tear down phase, after hkid is freed. - Hook remove_external_spte It is invoked when guest private page is removed in mirror page table, which can occur when TD is active, e.g. during shared <-> private conversion and slot move/deletion. This hook is ensured to be triggered before hkid is freed, because gmem fd is released along with all private leaf mappings zapped before freeing hkid at VM destroy. TDX invokes below SEAMCALLs sequentially: 1) TDH.MEM.RANGE.BLOCK (remove RWX bits from a private EPT entry), 2) TDH.MEM.TRACK (increases TD epoch) 3) TDH.MEM.PAGE.REMOVE (remove the private EPT entry and untrack the guest page). TDH.MEM.PAGE.REMOVE can't succeed without TDH.MEM.RANGE.BLOCK and TDH.MEM.TRACK being called successfully. SEAMCALL TDH.MEM.TRACK is called in function tdx_track() to enforce that TLB tracking will be performed by TDX module for private EPT. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- - Remove TDX_ERROR_SEPT_BUSY and Add tdx_operand_busy() helper (Binbin) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Handle TLB tracking for TDX	Isaku Yamahata
	Handle TLB tracking for TDX by introducing function tdx_track() for private memory TLB tracking and implementing flush_tlb* hooks to flush TLBs for shared memory. Introduce function tdx_track() to do TLB tracking on private memory, which basically does two things: calling TDH.MEM.TRACK to increase TD epoch and kicking off all vCPUs. The private EPT will then be flushed when each vCPU re-enters the TD. This function is unused temporarily in this patch and will be called on a page-by-page basis on removal of private guest page in a later patch. In earlier revisions, tdx_track() relied on an atomic counter to coordinate the synchronization between the actions of kicking off vCPUs, incrementing the TD epoch, and the vCPUs waiting for the incremented TD epoch after being kicked off. However, the core MMU only actually needs to call tdx_track() while aleady under a write mmu_lock. So this sychnonization can be made to be unneeded. vCPUs are kicked off only after the successful execution of TDH.MEM.TRACK, eliminating the need for vCPUs to wait for TDH.MEM.TRACK completion after being kicked off. tdx_track() is therefore able to send requests KVM_REQ_OUTSIDE_GUEST_MODE rather than KVM_REQ_TLB_FLUSH. Hooks for flush_remote_tlb and flush_remote_tlbs_range are not necessary for TDX, as tdx_track() will handle TLB tracking of private memory on page-by-page basis when private guest pages are removed. There is no need to invoke tdx_track() again in kvm_flush_remote_tlbs() even after changes to the mirrored page table. For hooks flush_tlb_current and flush_tlb_all, which are invoked during kvm_mmu_load() and vcpu load for normal VMs, let VMM to flush all EPTs in the two hooks for simplicity, since TDX does not depend on the two hooks to notify TDX module to flush private EPT in those cases. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073753.22228-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Set per-VM shadow_mmio_value to 0	Isaku Yamahata
	Set per-VM shadow_mmio_value to 0 for TDX. With enable_mmio_caching on, KVM installs MMIO SPTEs for TDs. To correctly configure MMIO SPTEs, TDX requires the per-VM shadow_mmio_value to be set to 0. This is necessary to override the default value of the suppress VE bit in the SPTE, which is 1, and to ensure value 0 in RWX bits. For MMIO SPTE, the spte value changes as follows: 1. initial value (suppress VE bit is set) 2. Guest issues MMIO and triggers EPT violation 3. KVM updates SPTE value to MMIO value (suppress VE bit is cleared) 4. Guest MMIO resumes. It triggers VE exception in guest TD 5. Guest VE handler issues TDG.VP.VMCALL<MMIO> 6. KVM handles MMIO 7. Guest VE handler resumes its execution after MMIO instruction Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073743.22214-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Require TDP MMU, mmio caching and EPT A/D bits for TDX	Isaku Yamahata
	Disable TDX support when TDP MMU or mmio caching or EPT A/D bits aren't supported. As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU support for TDX isn't implemented. TDX requires KVM mmio caching. Without mmio caching, KVM will go to MMIO emulation without installing SPTEs for MMIOs. However, TDX guest is protected and KVM would meet errors when trying to emulate MMIOs for TDX guest during instruction decoding. So, TDX guest relies on SPTEs being installed for MMIOs, which are with no RWX bits and with VE suppress bit unset, to inject VE to TDX guest. The TDX guest would then issue TDVMCALL in the VE handler to perform instruction decoding and have host do MMIO emulation. TDX also relies on EPT A/D bits as EPT A/D bits have been supported in all CPUs since Haswell. Relying on it can avoid RWX bits being masked out in the mirror page table for prefaulted entries. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- Requested by Sean at [1]. [1] https://lore.kernel.org/kvm/Zva4aORxE9ljlMNe@google.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Set gfn_direct_bits to shared bit	Isaku Yamahata
	Make the direct root handle memslot GFNs at an alias with the TDX shared bit set. For TDX shared memory, the memslot GFNs need to be mapped at an alias with the shared bit set. These shared mappings will be mapped on the KVM MMU's "direct" root. The direct root has it's mappings shifted by applying "gfn_direct_bits" as a mask. The concept of "GPAW" (guest physical address width) determines the location of the shared bit. So set gfn_direct_bits based on this, to map shared memory at the proper GPA. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073613.22100-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Add load_mmu_pgd method for TDX	Sean Christopherson
	TDX uses two EPT pointers, one for the private half of the GPA space and one for the shared half. The private half uses the normal EPT_POINTER vmcs field, which is managed in a special way by the TDX module. For TDX, KVM is not allowed to operate on it directly. The shared half uses a new SHARED_EPT_POINTER field and will be managed by the conventional MMU management operations that operate directly on the EPT root. This means for TDX the .load_mmu_pgd() operation will need to know to use the SHARED_EPT_POINTER field instead of the normal one. Add a new wrapper in x86 ops for load_mmu_pgd() that either directs the write to the existing vmx implementation or a TDX one. tdx_load_mmu_pgd() is so much simpler than vmx_load_mmu_pgd() since for the TDX mode of operation, EPT will always be used and KVM does not need to be involved in virtualization of CR3 behavior. So tdx_load_mmu_pgd() can simply write to SHARED_EPT_POINTER. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073601.22084-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14	KVM: TDX: Add accessors VMX VMCS helpers	Isaku Yamahata
	TDX defines SEAMCALL APIs to access TDX control structures corresponding to the VMX VMCS. Introduce helper accessors to hide its SEAMCALL ABI details. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073551.22070-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>