summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-06-24KVM: nSVM: Add a comment to document why nNPT uses vmcb01, not vCPU stateSean Christopherson
Add a comment in the nested NPT initialization flow to call out that it intentionally uses vmcb01 instead current vCPU state to get the effective hCR4 and hEFER for L1's NPT context. Note, despite nSVM's efforts to handle the case where vCPU state doesn't reflect L1 state, the MMU may still do the wrong thing due to pulling state from the vCPU instead of the passed in CR0/CR4/EFER values. This will be addressed in future commits. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86: Fix sizes used to pass around CR0, CR4, and EFERSean Christopherson
When configuring KVM's MMU, pass CR0 and CR4 as unsigned longs, and EFER as a u64 in various flows (mostly MMU). Passing the params as u32s is functionally ok since all of the affected registers reserve bits 63:32 to zero (enforced by KVM), but it's technically wrong. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Rename unsync helper and update related commentsSean Christopherson
Rename mmu_need_write_protect() to mmu_try_to_unsync_pages() and update a variety of related, stale comments. Add several new comments to call out subtle details, e.g. that upper-level shadow pages are write-tracked, and that can_unsync is false iff KVM is in the process of synchronizing pages. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Drop the intermediate "transient" __kvm_sync_page()Sean Christopherson
Nove the kvm_unlink_unsync_page() call out of kvm_sync_page() and into it's sole caller, and fold __kvm_sync_page() into kvm_sync_page() since the latter becomes a pure pass-through. There really should be no reason for code to do a complete sync of a shadow page outside of the full kvm_mmu_sync_roots(), e.g. the one use case that creeped in turned out to be flawed and counter-productive. Drop the stale comment about @sp->gfn needing to be write-protected, as it directly contradicts the kvm_mmu_get_page() usage. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: comment on kvm_mmu_get_page's syncing of pagesSean Christopherson
Explain the usage of sync_page() in kvm_mmu_get_page(), which is subtle in how and why it differs from mmu_sync_children(). Signed-off-by: Sean Christopherson <seanjc@google.com> [Split out of a different patch by Sean. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: WARN and zap SP when sync'ing if MMU role mismatchesSean Christopherson
When synchronizing a shadow page, WARN and zap the page if its mmu role isn't compatible with the current MMU context, where "compatible" is an exact match sans the bits that have no meaning in the overall MMU context or will be explicitly overwritten during the sync. Many of the helpers used by sync_page() are specific to the current context, updating a SMM vs. non-SMM shadow page would use the wrong memslots, updating L1 vs. L2 PTEs might work but would be extremely bizaree, and so on and so forth. Drop the guard with respect to 8-byte vs. 4-byte PTEs in __kvm_sync_page(), it was made useless when kvm_mmu_get_page() stopped trying to sync shadow pages irrespective of the current MMU context. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Use MMU role to check for matching guest page sizesSean Christopherson
Originally, __kvm_sync_page used to check the cr4_pae bit in the role to avoid zapping 4-byte kvm_mmu_pages when guest page size are 8-byte or the other way round. However, in commit 47c42e6b4192 ("KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'", 2019-03-28) it was observed that this did not work for nested EPT, where the page table size would be 8 bytes even if CR4.PAE=0. (Note that the check still has to be done for nested *NPT*, so it is not possible to use tdp_enabled or similar). Therefore, a hack was introduced to identify nested EPT shadow pages and unconditionally call __kvm_sync_page() on them. However, it is possible to do without the hack to identify nested EPT shadow pages: if EPT is active, there will be no shadow pages in non-EPT format, and all of them will have gpte_is_8_bytes set to true; we can just check the MMU role directly, and the test will always be true. Even for non-EPT shadow MMUs, this test should really always be true now that __kvm_sync_page() is called if and only if the role is an exact match (kvm_mmu_get_page()) or is part of the current MMU context (kvm_mmu_sync_roots()). A future commit will convert the likely-pointless check into a meaningful WARN to enforce that the mmu_roles of the current context and the shadow page are compatible. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at GFNSean Christopherson
When creating a new upper-level shadow page, zap unsync shadow pages at the same target gfn instead of attempting to sync the pages. This fixes a bug where an unsync shadow page could be sync'd with an incompatible context, e.g. wrong smm, is_guest, etc... flags. In practice, the bug is relatively benign as sync_page() is all but guaranteed to fail its check that the guest's desired gfn (for the to-be-sync'd page) matches the current gfn associated with the shadow page. I.e. kvm_sync_page() would end up zapping the page anyways. Alternatively, __kvm_sync_page() could be modified to explicitly verify the mmu_role of the unsync shadow page is compatible with the current MMU context. But, except for this specific case, __kvm_sync_page() is called iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page() requires an exact role match, and the call from kvm_sync_mmu_roots() is only synchronizing shadow pages from the current MMU (which better be compatible or KVM has problems). And as described above, attempting to sync shadow pages when creating an upper-level shadow page is unlikely to succeed, e.g. zero successful syncs were observed when running Linux guests despite over a million attempts. Fixes: 9f1a122f970d ("KVM: MMU: allow more page become unsync at getting sp time") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-10-seanjc@google.com> [Remove WARN_ON after __kvm_sync_page. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24Revert "KVM: MMU: record maximum physical address width in ↵Sean Christopherson
kvm_mmu_extended_role" Drop MAXPHYADDR from mmu_role now that all MMUs have their role invalidated after a CPUID update. Invalidating the role forces all MMUs to re-evaluate the guest's MAXPHYADDR, and the guest's MAXPHYADDR can only be changed only through a CPUID update. This reverts commit de3ccd26fafc707b09792d9b633c8b5b48865315. Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is brokenSean Christopherson
Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest instability. Initialize last_vmentry_cpu to -1 and use it to detect if the vCPU has been run at least once when its CPUID model is changed. KVM does not correctly handle changes to paging related settings in the guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc... KVM could theoretically zap all shadow pages, but actually making that happen is a mess due to lock inversion (vcpu->mutex is held). And even then, updating paging settings on the fly would only work if all vCPUs are stopped, updated in concert with identical settings, then restarted. To support running vCPUs with different vCPU models (that affect paging), KVM would need to track all relevant information in kvm_mmu_page_role. Note, that's the _page_ role, not the full mmu_role. Updating mmu_role isn't sufficient as a vCPU can reuse a shadow page translation that was created by a vCPU with different settings and thus completely skip the reserved bit checks (that are tied to CPUID). Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as it would require doubling gfn_track from a u16 to a u32, i.e. would increase KVM's memory footprint by 2 bytes for every 4kb of guest memory. E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT would all need to be tracked. In practice, there is no remotely sane use case for changing any paging related CPUID entries on the fly, so just sweep it under the rug (after yelling at userspace). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86: Force all MMUs to reinitialize if guest CPUID is modifiedSean Christopherson
Invalidate all MMUs' roles after a CPUID update to force reinitizliation of the MMU context/helpers. Despite the efforts of commit de3ccd26fafc ("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"), there are still a handful of CPUID-based properties that affect MMU behavior but are not incorporated into mmu_role. E.g. 1gb hugepage support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all factor into the guest's reserved PTE bits. The obvious alternative would be to add all such properties to mmu_role, but doing so provides no benefit over simply forcing a reinitialization on every CPUID update, as setting guest CPUID is a rare operation. Note, reinitializing all MMUs after a CPUID update does not fix all of KVM's woes. Specifically, kvm_mmu_page_role doesn't track the CPUID properties, which means that a vCPU can reuse shadow pages that should not exist for the new vCPU model, e.g. that map GPAs that are now illegal (due to MAXPHYADDR changes) or that set bits that are now reserved (PAGE_SIZE for 1gb pages), etc... Tracking the relevant CPUID properties in kvm_mmu_page_role would address the majority of problems, but fully tracking that much state in the shadow page role comes with an unpalatable cost as it would require a non-trivial increase in KVM's memory footprint. The GBPAGES case is even worse, as neither Intel nor AMD provides a way to disable 1gb hugepage support in the hardware page walker, i.e. it's a virtualization hole that can't be closed when using TDP. In other words, resetting the MMU after a CPUID update is largely a superficial fix. But, it will allow reverting the tracking of MAXPHYADDR in the mmu_role, and that case in particular needs to mostly work because KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging is supported. For cases where KVM botches guest behavior, the damage is limited to that guest. But for the shadow_root_level, a misconfigured MMU can cause KVM to incorrectly access memory, e.g. due to walking off the end of its shadow page tables. Fixes: 7dcd57552008 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed") Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack"Sean Christopherson
Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested virtualization. When KVM (L0) is using TDP, CR4.LA57 is not reflected in mmu_role.base.level because that tracks the shadow root level, i.e. TDP level. Normally, this is not an issue because LA57 can't be toggled while long mode is active, i.e. the guest has to first disable paging, then toggle LA57, then re-enable paging, thus ensuring an MMU reinitialization. But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57 without having to bounce through an unpaged section. L1 can also load a new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a single entry PML5 is sufficient. Such shenanigans are only problematic if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode. Note, in the L2 case with nested TDP, even though L1 can switch between L2s with different LA57 settings, thus bypassing the paging requirement, in that case KVM's nested_mmu will track LA57 in base.level. This reverts commit 8053f924cad30bf9f9a24e02b6c8ddfabf5202ea. Fixes: 8053f924cad3 ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Use MMU's role to detect CR4.SMEP value in nested NPT walkSean Christopherson
Use the MMU's role to get its effective SMEP value when injecting a fault into the guest. When walking L1's (nested) NPT while L2 is active, vCPU state will reflect L2, whereas NPT uses the host's (L1 in this case) CR0, CR4, EFER, etc... If L1 and L2 have different settings for SMEP and L1 does not have EFER.NX=1, this can result in an incorrect PFEC.FETCH when injecting #NPF. Fixes: e57d4a356ad3 ("KVM: Add instruction fetch checking when walking guest page table") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86: Properly reset MMU context at vCPU RESET/INITSean Christopherson
Reset the MMU context at vCPU INIT (and RESET for good measure) if CR0.PG was set prior to INIT. Simply re-initializing the current MMU is not sufficient as the current root HPA may not be usable in the new context. E.g. if TDP is disabled and INIT arrives while the vCPU is in long mode, KVM will fail to switch to the 32-bit pae_root and bomb on the next VM-Enter due to running with a 64-bit CR3 in 32-bit mode. This bug was papered over in both VMX and SVM, but still managed to rear its head in the MMU role on VMX. Because EFER.LMA=1 requires CR0.PG=1, kvm_calc_shadow_mmu_root_page_role() checks for EFER.LMA without first checking CR0.PG. VMX's RESET/INIT flow writes CR0 before EFER, and so an INIT with the vCPU in 64-bit mode will cause the hack-a-fix to generate the wrong MMU role. In VMX, the INIT issue is specific to running without unrestricted guest since unrestricted guest is available if and only if EPT is enabled. Commit 8668a3c468ed ("KVM: VMX: Reset mmu context when entering real mode") resolved the issue by forcing a reset when entering emulated real mode. In SVM, commit ebae871a509d ("kvm: svm: reset mmu on VCPU reset") forced a MMU reset on every INIT to workaround the flaw in common x86. Note, at the time the bug was fixed, the SVM problem was exacerbated by a complete lack of a CR4 update. The vendor resets will be reverted in future patches, primarily to aid bisection in case there are non-INIT flows that rely on the existing VMX logic. Because CR0.PG is unconditionally cleared on INIT, and because CR0.WP and all CR4/EFER paging bits are ignored if CR0.PG=0, simply checking that CR0.PG was '1' prior to INIT/RESET is sufficient to detect a required MMU context reset. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Treat NX as used (not reserved) for all !TDP shadow MMUsSean Christopherson
Mark NX as being used for all non-nested shadow MMUs, as KVM will set the NX bit for huge SPTEs if the iTLB mutli-hit mitigation is enabled. Checking the mitigation itself is not sufficient as it can be toggled on at any time and KVM doesn't reset MMU contexts when that happens. KVM could reset the contexts, but that would require purging all SPTEs in all MMUs, for no real benefit. And, KVM already forces EFER.NX=1 when TDP is disabled (for WP=0, SMEP=1, NX=0), so technically NX is never reserved for shadow MMUs. Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Remove broken WARN that fires on 32-bit KVM w/ nested EPTSean Christopherson
Remove a misguided WARN that attempts to detect the scenario where using a special A/D tracking flag will set reserved bits on a non-MMIO spte. The WARN triggers false positives when using EPT with 32-bit KVM because of the !64-bit clause, which is just flat out wrong. The whole A/D tracking goo is specific to EPT, and one of the big selling points of EPT is that EPT is decoupled from the host's native paging mode. Drop the WARN instead of trying to salvage the check. Keeping a check specific to A/D tracking bits would essentially regurgitate the same code that led to KVM needed the tracking bits in the first place. A better approach would be to add a generic WARN on reserved bits being set, which would naturally cover the A/D tracking bits, work for all flavors of paging, and be self-documenting to some extent. Fixes: 8a406c89532c ("KVM: x86/mmu: Rename and document A/D scheme for TDP SPTEs") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: debugfs: Reuse binary stats descriptorsJing Zhang
To remove code duplication, use the binary stats descriptors in the implementation of the debugfs interface for statistics. This unifies the definition of statistics for the binary and debugfs interfaces. Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-8-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Add selftest for KVM statistics data binary interfaceJing Zhang
Add selftest to check KVM stats descriptors validity. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-7-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: stats: Add documentation for binary statistics interfaceJing Zhang
This new API provides a file descriptor for every VM and VCPU to read KVM statistics data in binary format. It is meant to provide a lightweight, flexible, scalable and efficient lock-free solution for user space telemetry applications to pull the statistics data periodically for large scale systems. The pulling frequency could be as high as a few times per second. The statistics descriptors are defined by KVM in kernel and can be by userspace to discover VM/VCPU statistics during the one-time setup stage. The statistics data itself could be read out by userspace telemetry periodically without any extra parsing or setup effort. There are a few existed interface protocols and definitions, but no one can fulfil all the requirements this interface implemented as below: 1. During high frequency periodic stats reading, there should be no extra efforts except the stats data read itself. 2. Support stats annotation, like type (cumulative, instantaneous, peak, histogram, etc) and unit (counter, time, size, cycles, etc). 3. The stats data reading should be free of lock/synchronization. We don't care about the consistency between all the stats data. All stats data can not be read out at exactly the same time. We really care about the change or trend of the stats data. The lock-free solution is not just for efficiency and scalability, also for the stats data accuracy and usability. For example, in the situation that all the stats data readings are protected by a global lock, if one VCPU died somehow with that lock held, then all stats data reading would be blocked, then we have no way from stats data that which VCPU has died. 4. The stats data reading workload can be handed over to other unprivileged process. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-6-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: stats: Support binary stats retrieval for a VCPUJing Zhang
Add a VCPU ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VCPU stats header, descriptors and data. Define VCPU statistics descriptors and header for all architectures. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-5-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: stats: Support binary stats retrieval for a VMJing Zhang
Add a VM ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VM stats header, descriptors and data. Define VM statistics descriptors and header for all architectures. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-4-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: stats: Add fd-based API to read binary stats dataJing Zhang
This commit defines the API for userspace and prepare the common functionalities to support per VM/VCPU binary stats data readings. The KVM stats now is only accessible by debugfs, which has some shortcomings this change series are supposed to fix: 1. The current debugfs stats solution in KVM could be disabled when kernel Lockdown mode is enabled, which is a potential rick for production. 2. The current debugfs stats solution in KVM is organized as "one stats per file", it is good for debugging, but not efficient for production. 3. The stats read/clear in current debugfs solution in KVM are protected by the global kvm_lock. Besides that, there are some other benefits with this change: 1. All KVM VM/VCPU stats can be read out in a bulk by one copy to userspace. 2. A schema is used to describe KVM statistics. From userspace's perspective, the KVM statistics are self-describing. 3. With the fd-based solution, a separate telemetry would be able to read KVM stats in a less privileged environment. 4. After the initial setup by reading in stats descriptors, a telemetry only needs to read the stats data itself, no more parsing or setup is needed. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-3-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: stats: Separate generic stats from architecture specific onesJing Zhang
Generic KVM stats are those collected in architecture independent code or those supported by all architectures; put all generic statistics in a separate structure. This ensures that they are defined the same way in the statistics API which is being added, removing duplication among different architectures in the declaration of the descriptors. No functional change intended. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-2-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Don't WARN on a NULL shadow page in TDP MMU checkSean Christopherson
Treat a NULL shadow page in the "is a TDP MMU" check as valid, non-TDP root. KVM uses a "direct" PAE paging MMU when TDP is disabled and the guest is running with paging disabled. In that case, root_hpa points at the pae_root page (of which only 32 bytes are used), not a standard shadow page, and the WARN fires (a lot). Fixes: 0b873fd7fb53 ("KVM: x86/mmu: Remove redundant is_tdp_mmu_enabled check") Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622072454.3449146-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: sefltests: Add x86-64 test to verify MMU reacts to CPUID updatesSean Christopherson
Add an x86-only test to verify that x86's MMU reacts to CPUID updates that impact the MMU. KVM has had multiple bugs where it fails to reconfigure the MMU after the guest's vCPU model changes. Sadly, this test is effectively limited to shadow paging because the hardware page walk handler doesn't support software disabling of GBPAGES support, and KVM doesn't manually walk the GVA->GPA on faults for performance reasons (doing so would large defeat the benefits of TDP). Don't require !TDP for the tests as there is still value in running the tests with TDP, even though the tests will fail (barring KVM hacks). E.g. KVM should not completely explode if MAXPHYADDR results in KVM using 4-level vs. 5-level paging for the guest. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Add hugepage support for x86-64Sean Christopherson
Add x86-64 hugepage support in the form of a x86-only variant of virt_pg_map() that takes an explicit page size. To keep things simple, follow the existing logic for 4k pages and disallow creating a hugepage if the upper-level entry is present, even if the desired pfn matches. Opportunistically fix a double "beyond beyond" reported by checkpatch. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Genericize upper level page table entry structSean Christopherson
In preparation for adding hugepage support, replace "pageMapL4Entry", "pageDirectoryPointerEntry", and "pageDirectoryEntry" with a common "pageUpperEntry", and add a helper to create an upper level entry. All upper level entries have the same layout, using unique structs provides minimal value and requires a non-trivial amount of code duplication. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Add PTE helper for x86-64 in preparation for hugepagesSean Christopherson
Add a helper to retrieve a PTE pointer given a PFN, address, and level in preparation for adding hugepage support. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Rename x86's page table "address" to "pfn"Sean Christopherson
Rename the "address" field to "pfn" in x86's page table structs to match reality. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Add wrapper to allocate page table pageSean Christopherson
Add a helper to allocate a page for use in constructing the guest's page tables. All architectures have identical address and memslot requirements (which appear to be arbitrary anyways). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Unconditionally allocate EPT tables in memslot 0Sean Christopherson
Drop the EPTP memslot param from all EPT helpers and shove the hardcoded '0' down to the vm_phy_page_alloc() calls. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Unconditionally use memslot '0' for page table allocationsSean Christopherson
Drop the memslot param from virt_pg_map() and virt_map() and shove the hardcoded '0' down to the vm_phy_page_alloc() calls. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Unconditionally use memslot 0 for vaddr allocationsSean Christopherson
Drop the memslot param(s) from vm_vaddr_alloc() now that all callers directly specific '0' as the memslot. Drop the memslot param from virt_pgd_alloc() as well since vm_vaddr_alloc() is its only user. I.e. shove the hardcoded '0' down to the vm_phy_pages_alloc() calls. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Use "standard" min virtual address for CPUID test allocSean Christopherson
Use KVM_UTIL_MIN_ADDR as the minimum for x86-64's CPUID array. The system page size was likely used as the minimum because _something_ had to be provided. Increasing the min from 0x1000 to 0x2000 should have no meaningful impact on the test, and will allow changing vm_vaddr_alloc() to use KVM_UTIL_MIN_VADDR as the default. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Use alloc page helper for xAPIC IPI testSean Christopherson
Use the common page allocation helper for the xAPIC IPI test, effectively raising the minimum virtual address from 0x1000 to 0x2000. Presumably the test won't explode if it can't get a page at address 0x1000... Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Use alloc_page helper for x86-64's GDT/IDT/TSS allocationsSean Christopherson
Switch to the vm_vaddr_alloc_page() helper for x86-64's "kernel" allocations now that the helper uses the same min virtual address as the open coded versions. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Lower the min virtual address for misc page allocationsSean Christopherson
Reduce the minimum virtual address of page allocations from 0x10000 to KVM_UTIL_MIN_VADDR (0x2000). Both values appear to be completely arbitrary, and reducing the min to KVM_UTIL_MIN_VADDR will allow for additional consolidation of code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Add helpers to allocate N pages of virtual memorySean Christopherson
Add wrappers to allocate 1 and N pages of memory using de facto standard values as the defaults for minimum virtual address, data memslot, and page table memslot. Convert all compatible users. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Use "standard" min virtual address for Hyper-V pagesSean Christopherson
Use the de facto standard minimum virtual address for Hyper-V's hcall params page. It's the allocator's job to not double-allocate memory, i.e. there's no reason to force different regions for the params vs. hcall page. This will allow adding a page allocation helper with a "standard" minimum address. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Unconditionally use memslot 0 for x86's GDT/TSS setupSean Christopherson
Refactor x86's GDT/TSS allocations to for memslot '0' at its vm_addr_alloc() call sites instead of passing in '0' from on high. This is a step toward using a common helper for allocating pages. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Unconditionally use memslot 0 when loading elf binarySean Christopherson
Use memslot '0' for all vm_vaddr_alloc() calls when loading the test binary. This is the first step toward adding a helper to handle page allocations with a default value for the target memslot. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Zero out the correct page in the Hyper-V features testSean Christopherson
Fix an apparent copy-paste goof in hyperv_features where hcall_page (which is two pages, so technically just the first page) gets zeroed twice, and hcall_params gets zeroed none times. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: Remove errant asm/barrier.h include to fix arm64 buildSean Christopherson
Drop an unnecessary include of asm/barrier.h from dirty_log_test.c to allow the test to build on arm64. arm64, s390, and x86 all build cleanly without the include (PPC and MIPS aren't supported in KVM's selftests). arm64's barrier.h includes linux/kasan-checks.h, which is not copied into tools/. In file included from ../../../../tools/include/asm/barrier.h:8, from dirty_log_test.c:19: .../arm64/include/asm/barrier.h:12:10: fatal error: linux/kasan-checks.h: No such file or directory 12 | #include <linux/kasan-checks.h> | ^~~~~~~~~~~~~~~~~~~~~~ compilation terminated. Fixes: 84292e565951 ("KVM: selftests: Add dirty ring buffer test") Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622200529.3650424-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: nVMX: Handle split-lock #AC exceptions that happen in L2Sean Christopherson
Mark #ACs that won't be reinjected to the guest as wanted by L0 so that KVM handles split-lock #AC from L2 instead of forwarding the exception to L1. Split-lock #AC isn't yet virtualized, i.e. L1 will treat it like a regular #AC and do the wrong thing, e.g. reinject it into L2. Fixes: e6f8b6c12f03 ("KVM: VMX: Extend VMXs #AC interceptor to handle split lock #AC in guest") Cc: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622172244.3561540-1-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Fix uninitialized boolean variable flushColin Ian King
In the case where kvm_memslots_have_rmaps(kvm) is false the boolean variable flush is not set and is uninitialized. If is_tdp_mmu_enabled(kvm) is true then the call to kvm_tdp_mmu_zap_collapsible_sptes passes the uninitialized value of flush into the call. Fix this by initializing flush to false. Addresses-Coverity: ("Uninitialized scalar variable") Fixes: e2209710ccc5 ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622150912.23429-1-colin.king@canonical.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: selftests: fix triple fault if ept=0 in dirty_log_testHou Wenlong
Commit 22f232d134e1 ("KVM: selftests: x86: Set supported CPUIDs on default VM") moved vcpu_set_cpuid into vm_create_with_vcpus, but dirty_log_test doesn't use it to create vm. So vcpu's CPUIDs is not set, the guest's pa_bits in kvm would be smaller than the value queried by userspace. However, the dirty track memory slot is in the highest GPA, the reserved bits in gpte would be set with wrong pa_bits. For shadow paging, page fault would fail in permission_fault and be injected into guest. Since guest doesn't have idt, it finally leads to vm_exit for triple fault. Move vcpu_set_cpuid into vm_vcpu_add_default to set supported CPUIDs on default vcpu, since almost all tests need it. Fixes: 22f232d134e1 ("KVM: selftests: x86: Set supported CPUIDs on default VM") Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com> Message-Id: <411ea2173f89abce56fc1fca5af913ed9c5a89c9.1624351343.git.houwenlong93@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86: Print CPU of last attempted VM-entry when dumping VMCS/VMCBJim Mattson
Failed VM-entry is often due to a faulty core. To help identify bad cores, print the id of the last logical processor that attempted VM-entry whenever dumping a VMCS or VMCB. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210621221648.1833148-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-23Merge branch 'topic/ppc-kvm' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux into HEAD - Support for the H_RPT_INVALIDATE hypercall - Conversion of Book3S entry/exit to C - Bug fixes
2021-06-23KVM: PPC: Book3S HV: Workaround high stack usage with clangNathan Chancellor
LLVM does not emit optimal byteswap assembly, which results in high stack usage in kvmhv_enter_nested_guest() due to the inlining of byteswap_pt_regs(). With LLVM 12.0.0: arch/powerpc/kvm/book3s_hv_nested.c:289:6: error: stack frame size of 2512 bytes in function 'kvmhv_enter_nested_guest' [-Werror,-Wframe-larger-than=] long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu) ^ 1 error generated. While this gets fixed in LLVM, mark byteswap_pt_regs() as noinline_for_stack so that it does not get inlined and break the build due to -Werror by default in arch/powerpc/. Not inlining saves approximately 800 bytes with LLVM 12.0.0: arch/powerpc/kvm/book3s_hv_nested.c:290:6: warning: stack frame size of 1728 bytes in function 'kvmhv_enter_nested_guest' [-Wframe-larger-than=] long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu) ^ 1 warning generated. Cc: stable@vger.kernel.org # v4.20+ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://github.com/ClangBuiltLinux/linux/issues/1292 Link: https://bugs.llvm.org/show_bug.cgi?id=49610 Link: https://lore.kernel.org/r/202104031853.vDT0Qjqj-lkp@intel.com/ Link: https://gist.github.com/ba710e3703bf45043a31e2806c843ffd Link: https://lore.kernel.org/r/20210621182440.990242-1-nathan@kernel.org
2021-06-22KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVMBharata B Rao
In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall H_RPT_INVALIDATE if available. The availability of this hcall is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions DT property. Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210621085003.904767-7-bharata@linux.ibm.com