Age | Commit message (Collapse) | Author |
|
With the introduction of stage-2 huge mappings in the pKVM hypervisor,
guest pages CMO is needed for PMD_SIZE size. Fixmap only supports
PAGE_SIZE and iterating over the huge-page is time consuming (mostly due
to TLBI on hyp_fixmap_unmap) which is a problem for EL2 latency.
Introduce a shared PMD_SIZE fixmap (hyp_fixblock_map/hyp_fixblock_unmap)
to improve guest page CMOs when stage-2 huge mappings are installed.
On a Pixel6, the iterative solution resulted in a latency of ~700us,
while the PMD_SIZE fixmap reduces it to ~100us.
Because of the horrendous private range allocation that would be
necessary, this is disabled for 64KiB pages systems.
Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-11-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull KVM/arm64 updates from Will Deacon:
"New features:
- Support for non-protected guest in protected mode, achieving near
feature parity with the non-protected mode
- Support for the EL2 timers as part of the ongoing NV support
- Allow control of hardware tracing for nVHE/hVHE
Improvements, fixes and cleanups:
- Massive cleanup of the debug infrastructure, making it a bit less
awkward and definitely easier to maintain. This should pave the way
for further optimisations
- Complete rewrite of pKVM's fixed-feature infrastructure, aligning
it with the rest of KVM and making the code easier to follow
- Large simplification of pKVM's memory protection infrastructure
- Better handling of RES0/RES1 fields for memory-backed system
registers
- Add a workaround for Qualcomm's Snapdragon X CPUs, which suffer
from a pretty nasty timer bug
- Small collection of cleanups and low-impact fixes"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (87 commits)
arm64/sysreg: Get rid of TRFCR_ELx SysregFields
KVM: arm64: nv: Fix doc header layout for timers
KVM: arm64: nv: Apply RESx settings to sysreg reset values
KVM: arm64: nv: Always evaluate HCR_EL2 using sanitising accessors
KVM: arm64: Fix selftests after sysreg field name update
coresight: Pass guest TRFCR value to KVM
KVM: arm64: Support trace filtering for guests
KVM: arm64: coresight: Give TRBE enabled state to KVM
coresight: trbe: Remove redundant disable call
arm64/sysreg/tools: Move TRFCR definitions to sysreg
tools: arm64: Update sysreg.h header files
KVM: arm64: Drop pkvm_mem_transition for host/hyp donations
KVM: arm64: Drop pkvm_mem_transition for host/hyp sharing
KVM: arm64: Drop pkvm_mem_transition for FF-A
KVM: arm64: Explicitly handle BRBE traps as UNDEFINED
KVM: arm64: vgic: Use str_enabled_disabled() in vgic_v3_probe()
arm64: kvm: Introduce nvhe stack size constants
KVM: arm64: Fix nVHE stacktrace VA bits mask
KVM: arm64: Fix FEAT_MTE in pKVM
Documentation: Update the behaviour of "kvm-arm.mode"
...
|
|
kvm_pgtable_stage2_relax_perms currently assumes that it is being called
from a 'shared' walker, which will not be true once called from pKVM. To
allow for the re-use of that function, make the walk flags one of its
parameters.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-7-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
kvm_pgtable_stage2_mkyoung currently assumes that it is being called
from a 'shared' walker, which will not be true once called from pKVM.
To allow for the re-use of that function, make the walk flags one of
its parameters.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-6-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The pKVM stage2 mapping code relies on an invalid physical address to
signal to the internal API that only the annotations of descriptors
should be updated, and these are stored in the high bits of invalid
descriptors covering memory that has been donated to protected guests,
and is therefore unmapped from the host stage-2 page tables.
Given that these invalid PAs are never stored into the descriptors, it
is better to rely on an explicit flag, to clarify the API and to avoid
confusion regarding whether or not the output address of a descriptor
can ever be invalid to begin with (which is not the case with LPA2).
That removes a dependency on the logic that reasons about the maximum PA
range, which differs on LPA2 capable CPUs based on whether LPA2 is
enabled or not, and will be further clarified in subsequent patches.
Cc: Quentin Perret <qperret@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241212081841.2168124-12-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Don't mark pages/folios as accessed in the primary MMU when making a SPTE
young in KVM's secondary MMU, as doing so relies on
kvm_pfn_to_refcounted_page(), and generally speaking is unnecessary and
wasteful. KVM participates in page aging via mmu_notifiers, so there's no
need to push "accessed" updates to the primary MMU.
Dropping use of kvm_set_pfn_accessed() also paves the way for removing
kvm_pfn_to_refcounted_page() and all its users.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-84-seanjc@google.com>
|
|
* kvm-arm64/s2-ptdump:
: .
: Stage-2 page table dumper, reusing the main ptdump infrastructure,
: courtesy of Sebastian Ene. From the cover letter:
:
: "This series extends the ptdump support to allow dumping the guest
: stage-2 pagetables. When CONFIG_PTDUMP_STAGE2_DEBUGFS is enabled, ptdump
: registers the new following files under debugfs:
: - /sys/debug/kvm/<guest_id>/stage2_page_tables
: - /sys/debug/kvm/<guest_id>/stage2_levels
: - /sys/debug/kvm/<guest_id>/ipa_range
:
: This allows userspace tools (eg. cat) to dump the stage-2 pagetables by
: reading the 'stage2_page_tables' file.
: [...]"
: .
KVM: arm64: Register ptdump with debugfs on guest creation
arm64: ptdump: Don't override the level when operating on the stage-2 tables
arm64: ptdump: Use the ptdump description from a local context
arm64: ptdump: Expose the attribute parsing functionality
KVM: arm64: Move pagetable definitions to common header
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
In preparation for using the stage-2 definitions in ptdump, move some of
these macros in the common header.
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Link: https://lore.kernel.org/r/20240909124721.1672199-2-sebastianene@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
This DSB guarantees page table updates have been made visible to the
hardware table walker. Moving the DSB from stage2_split_walker() to
after the walk is finished in kvm_pgtable_stage2_split() results in a
roughly 70% reduction in Clear Dirty Log Time in
dirty_log_perf_test (modified to use eager page splitting) when using
huge pages. This gain holds steady through a range of vcpus
used (tested 1-64) and memory used (tested 1-64GB).
This is safe to do because nothing else is using the page tables while
they are still being mapped and this is how other page table walkers
already function. None of them have a data barrier in the walker
itself because relative ordering of table PTEs to table contents comes
from the release semantics of stage2_make_pte().
Signed-off-by: Colton Lewis <coltonlewis@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240808174243.2836363-1-coltonlewis@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/pkvm-6.10: (25 commits)
: .
: At last, a bunch of pKVM patches, courtesy of Fuad Tabba.
: From the cover letter:
:
: "This series is a bit of a bombay-mix of patches we've been
: carrying. There's no one overarching theme, but they do improve
: the code by fixing existing bugs in pKVM, refactoring code to
: make it more readable and easier to re-use for pKVM, or adding
: functionality to the existing pKVM code upstream."
: .
KVM: arm64: Force injection of a data abort on NISV MMIO exit
KVM: arm64: Restrict supported capabilities for protected VMs
KVM: arm64: Refactor setting the return value in kvm_vm_ioctl_enable_cap()
KVM: arm64: Document the KVM/arm64-specific calls in hypercalls.rst
KVM: arm64: Rename firmware pseudo-register documentation file
KVM: arm64: Reformat/beautify PTP hypercall documentation
KVM: arm64: Clarify rationale for ZCR_EL1 value restored on guest exit
KVM: arm64: Introduce and use predicates that check for protected VMs
KVM: arm64: Add is_pkvm_initialized() helper
KVM: arm64: Simplify vgic-v3 hypercalls
KVM: arm64: Move setting the page as dirty out of the critical section
KVM: arm64: Change kvm_handle_mmio_return() return polarity
KVM: arm64: Fix comment for __pkvm_vcpu_init_traps()
KVM: arm64: Prevent kmemleak from accessing .hyp.data
KVM: arm64: Do not map the host fpsimd state to hyp in pKVM
KVM: arm64: Rename __tlb_switch_to_{guest,host}() in VHE
KVM: arm64: Support TLB invalidation in guest context
KVM: arm64: Avoid BBM when changing only s/w bits in Stage-2 PTE
KVM: arm64: Check for PTE validity when checking for executable/cacheable
KVM: arm64: Avoid BUG-ing from the host abort path
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Break-before-make (BBM) can be expensive, as transitioning via an
invalid mapping (i.e. the "break" step) requires the completion of TLB
invalidation and can also cause other agents to fault concurrently on
the invalid mapping.
Since BBM is not required when changing only the software bits of a PTE,
avoid the sequence in this case and just update the PTE directly.
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-9-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Don't just assume that the PTE is valid when checking whether it
describes an executable or cacheable mapping.
This makes sure that we don't issue CMOs for invalid mappings.
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240423150538.2103045-8-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
When zapping a table entry in stage2_try_break_pte(), we issue range
TLB invalidation for the region that was mapped by the table. However,
we neglect to align the base address down to the granule size and so
if we ended up reaching the table entry via a misaligned address then
we will accidentally skip invalidation for some prefix of the affected
address range.
Align 'ctx->addr' down to the granule size when performing TLB
invalidation for an unmapped table in stage2_try_break_pte().
Cc: Raghavendra Rao Ananta <rananta@google.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Shaoqin Huang <shahuang@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Fixes: defc8cc7abf0 ("KVM: arm64: Invalidate the table entries upon a range")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-5-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The TLBI level hints are for leaf entries only, so take care not to pass
them incorrectly after clearing a table entry.
Cc: Gavin Shan <gshan@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Quentin Perret <qperret@google.com>
Fixes: 82bb02445de5 ("KVM: arm64: Implement kvm_pgtable_hyp_unmap() at EL2")
Fixes: 6d9d2115c480 ("KVM: arm64: Add support for stage-2 map()/unmap() in generic page-table")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-3-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Commit 7657ea920c54 ("KVM: arm64: Use TLBI range-based instructions for
unmap") introduced deferred TLB invalidation for the stage-2 page-table
so that range-based invalidation can be used for the accumulated
addresses. This works fine if the structure of the page-tables remains
unchanged, but if entire tables are zapped and subsequently freed then
we transiently leave the hardware page-table walker with a reference
to freed memory thanks to the translation walk caches. For example,
stage2_unmap_walker() will free page-table pages:
if (childp)
mm_ops->put_page(childp);
and issue the TLB invalidation later in kvm_pgtable_stage2_unmap():
if (stage2_unmap_defer_tlb_flush(pgt))
/* Perform the deferred TLB invalidations */
kvm_tlb_flush_vmid_range(pgt->mmu, addr, size);
For now, take the conservative approach and invalidate the TLB eagerly
when we clear a table entry. Note, however, that the existing level
hint passed to __kvm_tlb_flush_vmid_ipa() is incorrect and will be
fixed in a subsequent patch.
Cc: Raghavendra Rao Ananta <rananta@google.com>
Cc: Shaoqin Huang <shahuang@redhat.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240327124853.11206-2-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.9
- Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
- Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
- Conversion of KVM's representation of LPIs to an xarray, utilized to
address serialization some of the serialization on the LPI injection
path
- Support for _architectural_ VHE-only systems, advertised through the
absence of FEAT_E2H0 in the CPU's ID register
- Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
|
|
Currently, KVM for ARM64 maps at stage 2 memory that is considered device
(i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
overrides (as per the ARM architecture [1]) any device MMIO mapping
present at stage 1, resulting in a set-up whereby a guest operating
system cannot determine device MMIO mapping memory attributes on its
own but it is always overridden by the KVM stage 2 default.
This set-up does not allow guest operating systems to select device
memory attributes independently from KVM stage-2 mappings
(refer to [1], "Combining stage 1 and stage 2 memory type attributes"),
which turns out to be an issue in that guest operating systems
(e.g. Linux) may request to map devices MMIO regions with memory
attributes that guarantee better performance (e.g. gathering
attribute - that for some devices can generate larger PCIe memory
writes TLPs) and specific operations (e.g. unaligned transactions)
such as the NormalNC memory type.
The default device stage 2 mapping was chosen in KVM for ARM64 since
it was considered safer (i.e. it would not allow guests to trigger
uncontained failures ultimately crashing the machine) but this
turned out to be asynchronous (SError) defeating the purpose.
Failures containability is a property of the platform and is independent
from the memory type used for MMIO device memory mappings.
Actually, DEVICE_nGnRE memory type is even more problematic than
Normal-NC memory type in terms of faults containability in that e.g.
aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally,
synchronous (i.e. that would imply that the processor should issue at
most 1 load transaction at a time - it cannot pipeline them - otherwise
the synchronous abort semantics would break the no-speculation attribute
attached to DEVICE_XXX memory).
This means that regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures and that in turn relies on platform capabilities
and the device type being assigned (i.e. PCIe AER/DPC error containment
and RAS architecture[3]); therefore the default KVM device stage 2
memory attributes play no role in making device assignment safer
for a given platform (if the platform design adheres to design
guidelines outlined in [3]) and therefore can be relaxed.
For all these reasons, relax the KVM stage 2 device memory attributes
from DEVICE_nGnRE to Normal-NC.
The NormalNC was chosen over a different Normal memory type default
at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.
Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (i.e. device
MMIO unmap followed by a device reset) at least for PCIe devices, in that
in PCIe a device reset is architected and carried out through PCI config
space transactions that are naturally ordered with respect to MMIO
transactions according to the PCI ordering rules.
Having Normal-NC S2 default puts guests in control (thanks to
stage1+stage2 combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules described in [1]
and summarized here ([(S1) - stage1], [(S2) - stage 2]):
S1 | S2 | Result
NORMAL-WB | NORMAL-NC | NORMAL-NC
NORMAL-WT | NORMAL-NC | NORMAL-NC
NORMAL-NC | NORMAL-NC | NORMAL-NC
DEVICE<attr> | NORMAL-NC | DEVICE<attr>
It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].
Defaulting KVM device stage-2 mappings to Normal-NC attributes does not
change anything in this respect, in that the mismatched aliases would
only affect (refer to [2] for a detailed explanation) ordering between
the userspace and GuestOS mappings resulting stream of transactions
(i.e. it does not cause loss of property for either stream of
transactions on its own), which is harmless given that the userspace
and GuestOS access to the device is carried out through independent
transactions streams.
A Normal-NC flag is not present today. So add a new kvm_pgtable_prot
(KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its
corresponding PTE value 0x5 (0b101) determined from [1].
Lastly, adapt the stage2 PTE property setter function
(stage2_set_prot_attr) to handle the NormalNC attribute.
The entire discussion leading to this patch series may be followed through
the following links.
Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com
Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com
[1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
kvm_pgtable_stage2_free_unlinked() does the final put_page() on the
root page of the sub-tree before returning, so remove the additional
put_page() invocations in the callers.
Cc: Ricardo Koller <ricarkol@google.com>
Fixes: f6a27d6dc51b2 ("KVM: arm64: Drop last page ref in kvm_pgtable_stage2_free_removed()")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240212193052.27765-1-will@kernel.org
|
|
FEAT_LPA2 increases the maximum levels of translation from 4 to 5 for
the 4KB page case, when IA is >48 bits. While we can still use 4 levels
for stage2 translation in this case (due to stage2 allowing concatenated
page tables for first level lookup), the same kvm_pgtable library is
used for the hyp stage1 page tables and stage1 does not support
concatenation.
Therefore, modify the library to support up to 5 levels. Previous
patches already laid the groundwork for this by refactoring code to work
in terms of KVM_PGTABLE_FIRST_LEVEL and KVM_PGTABLE_LAST_LEVEL. So we
just need to change these macros.
The hardware sometimes encodes the new level differently from the
others: One such place is when reading the level from the FSC field in
the ESR_EL2 register. We never expect to see the lowest level (-1) here
since the stage 2 page tables always use concatenated tables for first
level lookup and therefore only use 4 levels of lookup. So we get away
with just adding a comment to explain why we are not being careful about
decoding level -1.
For stage2 VTCR_EL2.SL2 is introduced to encode the new start level.
However, since we always use concatenated page tables for first level
look up at stage2 (and therefore we will never need the new extra level)
we never touch this new field.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231127111737.1897081-10-ryan.roberts@arm.com
|
|
With the introduction of FEAT_LPA2, the Arm ARM adds a new level of
translation, level -1, so levels can now be in the range [-1;3]. 3 is
always the last level and the first level is determined based on the
number of VA bits in use.
Convert level variables to use a signed type in preparation for
supporting this new level -1.
Since the last level is always anchored at 3, and the first level varies
to suit the number of VA/IPA bits, take the opportunity to replace
KVM_PGTABLE_MAX_LEVELS with the 2 macros KVM_PGTABLE_FIRST_LEVEL and
KVM_PGTABLE_LAST_LEVEL. This removes the assumption from the code that
levels run from 0 to KVM_PGTABLE_MAX_LEVELS - 1, which will soon no
longer be true.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231127111737.1897081-9-ryan.roberts@arm.com
|
|
Implement a simple policy whereby if the HW supports FEAT_LPA2 for the
page size we are using, always use LPA2-style page-tables for stage 2
and hyp stage 1 (assuming an nvhe hyp), regardless of the VMM-requested
IPA size or HW-implemented PA size. When in use we can now support up to
52-bit IPA and PA sizes.
We use the previously created cpu feature to track whether LPA2 is
supported for deciding whether to use the LPA2 or classic pte format.
Note that FEAT_LPA2 brings support for bigger block mappings (512GB with
4KB, 64GB with 16KB). We explicitly don't enable these in the library
because stage2_apply_range() works on batch sizes of the largest used
block mapping, and increasing the size of the batch would lead to soft
lockups. See commit 5994bc9e05c2 ("KVM: arm64: Limit
stage2_apply_range() batch size to largest block").
With the addition of LPA2 support in the hypervisor, the PA size
supported by the HW must be capped with a runtime decision, rather than
simply using a compile-time decision based on PA_BITS. For example, on a
system that advertises 52 bit PA but does not support FEAT_LPA2, A 4KB
or 16KB kernel compiled with LPA2 support must still limit the PA size
to 48 bits.
Therefore, move the insertion of the PS field into TCR_EL2 out of
__kvm_hyp_init assembly code and instead do it in cpu_prepare_hyp_mode()
where the rest of TCR_EL2 is prepared. This allows us to figure out PS
with kvm_get_parange(), which has the appropriate logic to ensure the
above requirement. (and the PS field of VTCR_EL2 is already populated
this way).
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231127111737.1897081-8-ryan.roberts@arm.com
|
|
Pull kvm updates from Paolo Bonzini:
"ARM:
- Generalized infrastructure for 'writable' ID registers, effectively
allowing userspace to opt-out of certain vCPU features for its
guest
- Optimization for vSGI injection, opportunistically compressing
MPIDR to vCPU mapping into a table
- Improvements to KVM's PMU emulation, allowing userspace to select
the number of PMCs available to a VM
- Guest support for memory operation instructions (FEAT_MOPS)
- Cleanups to handling feature flags in KVM_ARM_VCPU_INIT, squashing
bugs and getting rid of useless code
- Changes to the way the SMCCC filter is constructed, avoiding wasted
memory allocations when not in use
- Load the stage-2 MMU context at vcpu_load() for VHE systems,
reducing the overhead of errata mitigations
- Miscellaneous kernel and selftest fixes
LoongArch:
- New architecture for kvm.
The hardware uses the same model as x86, s390 and RISC-V, where
guest/host mode is orthogonal to supervisor/user mode. The
virtualization extensions are very similar to MIPS, therefore the
code also has some similarities but it's been cleaned up to avoid
some of the historical bogosities that are found in arch/mips. The
kernel emulates MMU, timer and CSR accesses, while interrupt
controllers are only emulated in userspace, at least for now.
RISC-V:
- Support for the Smstateen and Zicond extensions
- Support for virtualizing senvcfg
- Support for virtualized SBI debug console (DBCN)
S390:
- Nested page table management can be monitored through tracepoints
and statistics
x86:
- Fix incorrect handling of VMX posted interrupt descriptor in
KVM_SET_LAPIC, which could result in a dropped timer IRQ
- Avoid WARN on systems with Intel IPI virtualization
- Add CONFIG_KVM_MAX_NR_VCPUS, to allow supporting up to 4096 vCPUs
without forcing more common use cases to eat the extra memory
overhead.
- Add virtualization support for AMD SRSO mitigation (IBPB_BRTYPE and
SBPB, aka Selective Branch Predictor Barrier).
- Fix a bug where restoring a vCPU snapshot that was taken within 1
second of creating the original vCPU would cause KVM to try to
synchronize the vCPU's TSC and thus clobber the correct TSC being
set by userspace.
- Compute guest wall clock using a single TSC read to avoid
generating an inaccurate time, e.g. if the vCPU is preempted
between multiple TSC reads.
- "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which
complain about a "Firmware Bug" if the bit isn't set for select
F/M/S combos. Likewise "virtualize" (ignore) MSR_AMD64_TW_CFG to
appease Windows Server 2022.
- Don't apply side effects to Hyper-V's synthetic timer on writes
from userspace to fix an issue where the auto-enable behavior can
trigger spurious interrupts, i.e. do auto-enabling only for guest
writes.
- Remove an unnecessary kick of all vCPUs when synchronizing the
dirty log without PML enabled.
- Advertise "support" for non-serializing FS/GS base MSR writes as
appropriate.
- Harden the fast page fault path to guard against encountering an
invalid root when walking SPTEs.
- Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
- Use the fast path directly from the timer callback when delivering
Xen timer events, instead of waiting for the next iteration of the
run loop. This was not done so far because previously proposed code
had races, but now care is taken to stop the hrtimer at critical
points such as restarting the timer or saving the timer information
for userspace.
- Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future
flag.
- Optimize injection of PMU interrupts that are simultaneous with
NMIs.
- Usual handful of fixes for typos and other warts.
x86 - MTRR/PAT fixes and optimizations:
- Clean up code that deals with honoring guest MTRRs when the VM has
non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.
- Zap EPT entries when non-coherent DMA assignment stops/start to
prevent using stale entries with the wrong memtype.
- Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y
This was done as a workaround for virtual machine BIOSes that did
not bother to clear CR0.CD (because ancient KVM/QEMU did not bother
to set it, in turn), and there's zero reason to extend the quirk to
also ignore guest PAT.
x86 - SEV fixes:
- Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts
SHUTDOWN while running an SEV-ES guest.
- Clean up the recognition of emulation failures on SEV guests, when
KVM would like to "skip" the instruction but it had already been
partially emulated. This makes it possible to drop a hack that
second guessed the (insufficient) information provided by the
emulator, and just do the right thing.
Documentation:
- Various updates and fixes, mostly for x86
- MTRR and PAT fixes and optimizations"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (164 commits)
KVM: selftests: Avoid using forced target for generating arm64 headers
tools headers arm64: Fix references to top srcdir in Makefile
KVM: arm64: Add tracepoint for MMIO accesses where ISV==0
KVM: arm64: selftest: Perform ISB before reading PAR_EL1
KVM: arm64: selftest: Add the missing .guest_prepare()
KVM: arm64: Always invalidate TLB for stage-2 permission faults
KVM: x86: Service NMI requests after PMI requests in VM-Enter path
KVM: arm64: Handle AArch32 SPSR_{irq,abt,und,fiq} as RAZ/WI
KVM: arm64: Do not let a L1 hypervisor access the *32_EL2 sysregs
KVM: arm64: Refine _EL2 system register list that require trap reinjection
arm64: Add missing _EL2 encodings
arm64: Add missing _EL12 encodings
KVM: selftests: aarch64: vPMU test for validating user accesses
KVM: selftests: aarch64: vPMU register test for unimplemented counters
KVM: selftests: aarch64: vPMU register test for implemented counters
KVM: selftests: aarch64: Introduce vpmu_counter_access test
tools: Import arm_pmuv3.h
KVM: arm64: PMU: Allow userspace to limit PMCR_EL0.N for the guest
KVM: arm64: Sanitize PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} before first run
KVM: arm64: Add {get,set}_user for PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
...
|
|
* kvm-arm64/stage2-vhe-load:
: Setup stage-2 MMU from vcpu_load() for VHE
:
: Unlike nVHE, there is no need to switch the stage-2 MMU around on guest
: entry/exit in VHE mode as the host is running at EL2. Despite this KVM
: reloads the stage-2 on every guest entry, which is needless.
:
: This series moves the setup of the stage-2 MMU context to vcpu_load()
: when running in VHE mode. This is likely to be a win across the board,
: but also allows us to remove an ISB on the guest entry path for systems
: with one of the speculative AT errata.
KVM: arm64: Move VTCR_EL2 into struct s2_mmu
KVM: arm64: Load the stage-2 MMU context in kvm_vcpu_load_vhe()
KVM: arm64: Rename helpers for VHE vCPU load/put
KVM: arm64: Reload stage-2 for VMID change on VHE
KVM: arm64: Restore the stage-2 context in VHE's __tlb_switch_to_host()
KVM: arm64: Don't zero VTTBR in __tlb_switch_to_host()
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
It is possible for multiple vCPUs to fault on the same IPA and attempt
to resolve the fault. One of the page table walks will actually update
the PTE and the rest will return -EAGAIN per our race detection scheme.
KVM elides the TLB invalidation on the racing threads as the return
value is nonzero.
Before commit a12ab1378a88 ("KVM: arm64: Use local TLBI on permission
relaxation") KVM always used broadcast TLB invalidations when handling
permission faults, which had the convenient property of making the
stage-2 updates visible to all CPUs in the system. However now we do a
local invalidation, and TLBI elision leads to the vCPU thread faulting
again on the stale entry. Remember that the architecture permits the TLB
to cache translations that precipitate a permission fault.
Invalidate the TLB entry responsible for the permission fault if the
stage-2 descriptor has been relaxed, regardless of which thread actually
did the job.
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230922223229.1608155-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
We currently have a global VTCR_EL2 value for each guest, even
if the guest uses NV. This implies that the guest's own S2 must
fit in the host's. This is odd, for multiple reasons:
- the PARange values and the number of IPA bits don't necessarily
match: you can have 33 bits of IPA space, and yet you can only
describe 32 or 36 bits of PARange
- When userspace set the IPA space, it creates a contract with the
kernel saying "this is the IPA space I'm prepared to handle".
At no point does it constraint the guest's own IPA space as
long as the guest doesn't try to use a [I]PA outside of the
IPA space set by userspace
- We don't even try to hide the value of ID_AA64MMFR0_EL1.PARange.
And then there is the consequence of the above: if a guest tries
to create a S2 that has for input address something that is larger
than the IPA space defined by the host, we inject a fatal exception.
This is no good. For all intent and purposes, a guest should be
able to have the S2 it really wants, as long as the *output* address
of that S2 isn't outside of the IPA space.
For that, we need to have a per-s2_mmu VTCR_EL2 setting, which
allows us to represent the full PARange. Move the vctr field into
the s2_mmu structure, which has no impact whatsoever, except for NV.
Note that once we are able to override ID_AA64MMFR0_EL1.PARange
from userspace, we'll also be able to restrict the size of the
shadow S2 that NV uses.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20231012205108.3937270-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
In system_supports_bti() we use cpus_have_const_cap() to check for
ARM64_HAS_BTI, but this is not necessary and alternative_has_cap_*() or
cpus_have_final_*cap() would be preferable.
For historical reasons, cpus_have_const_cap() is more complicated than
it needs to be. Before cpucaps are finalized, it will perform a bitmap
test of the system_cpucaps bitmap, and once cpucaps are finalized it
will use an alternative branch. This used to be necessary to handle some
race conditions in the window between cpucap detection and the
subsequent patching of alternatives and static branches, where different
branches could be out-of-sync with one another (or w.r.t. alternative
sequences). Now that we use alternative branches instead of static
branches, these are all patched atomically w.r.t. one another, and there
are only a handful of cases that need special care in the window between
cpucap detection and alternative patching.
Due to the above, it would be nice to remove cpus_have_const_cap(), and
migrate callers over to alternative_has_cap_*(), cpus_have_final_cap(),
or cpus_have_cap() depending on when their requirements. This will
remove redundant instructions and improve code generation, and will make
it easier to determine how each callsite will behave before, during, and
after alternative patching.
When CONFIG_ARM64_BTI_KERNEL=y, the ARM64_HAS_BTI cpucap is a strict
boot cpu feature which is detected and patched early on the boot cpu.
All uses guarded by CONFIG_ARM64_BTI_KERNEL happen after the boot CPU
has detected ARM64_HAS_BTI and patched boot alternatives, and hence can
safely use alternative_has_cap_*() or cpus_have_final_boot_cap().
Regardless of CONFIG_ARM64_BTI_KERNEL, all other uses of ARM64_HAS_BTI
happen after system capabilities have been finalized and alternatives
have been patched. Hence these can safely use alternative_has_cap_*) or
cpus_have_final_cap().
This patch splits system_supports_bti() into system_supports_bti() and
system_supports_bti_kernel(), with the former handling where the cpucap
affects userspace functionality, and ther latter handling where the
cpucap affects kernel functionality. The use of cpus_have_const_cap() is
replaced by cpus_have_final_cap() in cpus_have_const_cap, and
cpus_have_final_boot_cap() in system_supports_bti_kernel(). This will
avoid generating code to test the system_cpucaps bitmap and should be
better for all subsequent calls at runtime. The use of
cpus_have_final_cap() and cpus_have_final_boot_cap() will make it easier
to spot if code is chaanged such that these run before the ARM64_HAS_BTI
cpucap is guaranteed to have been finalized.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Much of the arm64 KVM code uses cpus_have_const_cap() to check for
cpucaps, but this is unnecessary and it would be preferable to use
cpus_have_final_cap().
For historical reasons, cpus_have_const_cap() is more complicated than
it needs to be. Before cpucaps are finalized, it will perform a bitmap
test of the system_cpucaps bitmap, and once cpucaps are finalized it
will use an alternative branch. This used to be necessary to handle some
race conditions in the window between cpucap detection and the
subsequent patching of alternatives and static branches, where different
branches could be out-of-sync with one another (or w.r.t. alternative
sequences). Now that we use alternative branches instead of static
branches, these are all patched atomically w.r.t. one another, and there
are only a handful of cases that need special care in the window between
cpucap detection and alternative patching.
Due to the above, it would be nice to remove cpus_have_const_cap(), and
migrate callers over to alternative_has_cap_*(), cpus_have_final_cap(),
or cpus_have_cap() depending on when their requirements. This will
remove redundant instructions and improve code generation, and will make
it easier to determine how each callsite will behave before, during, and
after alternative patching.
KVM is initialized after cpucaps have been finalized and alternatives
have been patched. Since commit:
d86de40decaa14e6 ("arm64: cpufeature: upgrade hyp caps to final")
... use of cpus_have_const_cap() in hyp code is automatically converted
to use cpus_have_final_cap():
| static __always_inline bool cpus_have_const_cap(int num)
| {
| if (is_hyp_code())
| return cpus_have_final_cap(num);
| else if (system_capabilities_finalized())
| return __cpus_have_const_cap(num);
| else
| return cpus_have_cap(num);
| }
Thus, converting hyp code to use cpus_have_final_cap() directly will not
result in any functional change.
Non-hyp KVM code is also not executed until cpucaps have been finalized,
and it would be preferable to extent the same treatment to this code and
use cpus_have_final_cap() directly.
This patch converts instances of cpus_have_const_cap() in KVM-only code
over to cpus_have_final_cap(). As all of this code runs after cpucaps
have been finalized, there should be no functional change as a result of
this patch, but the redundant instructions generated by
cpus_have_const_cap() will be removed from the non-hyp KVM code.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The current implementation of the stage-2 unmap walker traverses
the given range and, as a part of break-before-make, performs
TLB invalidations with a DSB for every PTE. A multitude of this
combination could cause a performance bottleneck on some systems.
Hence, if the system supports FEAT_TLBIRANGE, defer the TLB
invalidations until the entire walk is finished, and then
use range-based instructions to invalidate the TLBs in one go.
Condition deferred TLB invalidation on the system supporting FWB,
as the optimization is entirely pointless when the unmap walker
needs to perform CMOs.
Rename stage2_put_pte() to stage2_unmap_put_pte() as the function
now serves the stage-2 unmap walker specifically, rather than
acting generic.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-15-rananta@google.com
|
|
Currently, during the operations such as a hugepage collapse,
KVM would flush the entire VM's context using 'vmalls12e1is'
TLBI operation. Specifically, if the VM is faulting on many
hugepages (say after dirty-logging), it creates a performance
penalty for the guest whose pages have already been faulted
earlier as they would have to refill their TLBs again.
Instead, leverage kvm_tlb_flush_vmid_range() for table entries.
If the system supports it, only the required range will be
flushed. Else, it'll fallback to the previous mechanism.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-14-rananta@google.com
|
|
Implement the helper kvm_tlb_flush_vmid_range() that acts
as a wrapper for range-based TLB invalidations. For the
given VMID, use the range-based TLBI instructions to do
the job or fallback to invalidating all the TLB entries.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230811045127.3308641-11-rananta@google.com
|
|
Userspace is allowed to select any PAGE_SIZE aligned hva to back guest
memory. This is even the case with hugepages, although it is a rather
suboptimal configuration as PTE level mappings are used at stage-2.
The arm64 page aging handlers have an assumption that the specified
range is exactly one page/block of memory, which in the aforementioned
case is not necessarily true. All together this leads to the WARN() in
kvm_age_gfn() firing.
However, the WARN is only part of the issue as the table walkers visit
at most a single leaf PTE. For hugepage-backed memory in a memslot that
isn't hugepage-aligned, page aging entirely misses accesses to the
hugepage beyond the first page in the memslot.
Add a new walker dedicated to handling page aging MMU notifiers capable
of walking a range of PTEs. Convert kvm(_test)_age_gfn() over to the new
walker and drop the WARN that caught the issue in the first place. The
implementation of this walker was inspired by the test_clear_young()
implementation by Yu Zhao [*], but repurposed to address a bug in the
existing aging implementation.
Cc: stable@vger.kernel.org # v5.15
Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers")
Link: https://lore.kernel.org/kvmarm/20230526234435.662652-6-yuzhao@google.com/
Co-developed-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Link: https://lore.kernel.org/r/20230627235405.4069823-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.5
- Eager page splitting optimization for dirty logging, optionally
allowing for a VM to avoid the cost of block splitting in the stage-2
fault path.
- Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact with
services that live in the Secure world. pKVM intervenes on FF-A calls
to guarantee the host doesn't misuse memory donated to the hyp or a
pKVM guest.
- Support for running the split hypervisor with VHE enabled, known as
'hVHE' mode. This is extremely useful for testing the split
hypervisor on VHE-only systems, and paves the way for new use cases
that depend on having two TTBRs available at EL2.
- Generalized framework for configurable ID registers from userspace.
KVM/arm64 currently prevents arbitrary CPU feature set configuration
from userspace, but the intent is to relax this limitation and allow
userspace to select a feature set consistent with the CPU.
- Enable the use of Branch Target Identification (FEAT_BTI) in the
hypervisor.
- Use a separate set of pointer authentication keys for the hypervisor
when running in protected mode, as the host is untrusted at runtime.
- Ensure timer IRQs are consistently released in the init failure
paths.
- Avoid trapping CTR_EL0 on systems with Enhanced Virtualization Traps
(FEAT_EVT), as it is a register commonly read from userspace.
- Erratum workaround for the upcoming AmpereOne part, which has broken
hardware A/D state management.
As a consequence of the hVHE series reworking the arm64 software
features framework, the for-next/module-alloc branch from the arm64 tree
comes along for the ride.
|
|
* kvm-arm64/ampere1-hafdbs-mitigation:
: AmpereOne erratum AC03_CPU_38 mitigation
:
: AmpereOne does not advertise support for FEAT_HAFDBS due to an
: underlying erratum in the feature. The associated control bits do not
: have RES0 behavior as required by the architecture.
:
: Introduce mitigations to prevent KVM from enabling the feature at
: stage-2 as well as preventing KVM guests from enabling HAFDBS at
: stage-1.
KVM: arm64: Prevent guests from enabling HA/HD on Ampere1
KVM: arm64: Refactor HFGxTR configuration into separate helpers
arm64: errata: Mitigate Ampere1 erratum AC03_CPU_38 at stage-2
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
AmpereOne has an erratum in its implementation of FEAT_HAFDBS that
required disabling the feature on the design. This was done by reporting
the feature as not implemented in the ID register, although the
corresponding control bits were not actually RES0. This does not align
well with the requirements of the architecture, which mandates these
bits be RES0 if HAFDBS isn't implemented.
The kernel's use of stage-1 is unaffected, as the HA and HD bits are
only set if HAFDBS is detected in the ID register. KVM, on the other
hand, relies on the RES0 behavior at stage-2 to use the same value for
VTCR_EL2 on any cpu in the system. Mitigate the non-RES0 behavior by
leaving VTCR_EL2.HA clear on affected systems.
Cc: stable@vger.kernel.org
Cc: D Scott Phillips <scott@os.amperecomputing.com>
Cc: Darren Hart <darren@os.amperecomputing.com>
Acked-by: D Scott Phillips <scott@os.amperecomputing.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20230609220104.1836988-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/misc:
: Miscellaneous updates
:
: - Avoid trapping CTR_EL0 on systems with FEAT_EVT, as the register is
: commonly read by userspace
:
: - Make use of FEAT_BTI at hyp stage-1, setting the Guard Page bit to 1
: for executable mappings
:
: - Use a separate set of pointer authentication keys for the hypervisor
: when running in protected mode (i.e. pKVM)
:
: - Plug a few holes in timer initialization where KVM fails to free the
: timer IRQ(s)
KVM: arm64: Use different pointer authentication keys for pKVM
KVM: arm64: timers: Fix resource leaks in kvm_timer_hyp_init()
KVM: arm64: Use BTI for nvhe
KVM: arm64: Relax trapping of CTR_EL0 when FEAT_EVT is available
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/hvhe:
: Support for running split-hypervisor w/VHE, courtesy of Marc Zyngier
:
: From the cover letter:
:
: KVM (on ARMv8.0) and pKVM (on all revisions of the architecture) use
: the split hypervisor model that makes the EL2 code more or less
: standalone. In the later case, we totally ignore the VHE mode and
: stick with the good old v8.0 EL2 setup.
:
: We introduce a new "mode" for KVM called hVHE, in reference to the
: nVHE mode, and indicating that only the hypervisor is using VHE.
KVM: arm64: Fix hVHE init on CPUs where HCR_EL2.E2H is not RES1
arm64: Allow arm64_sw.hvhe on command line
KVM: arm64: Force HCR_E2H in guest context when ARM64_KVM_HVHE is set
KVM: arm64: Program the timer traps with VHE layout in hVHE mode
KVM: arm64: Rework CPTR_EL2 programming for HVHE configuration
KVM: arm64: Adjust EL2 stage-1 leaf AP bits when ARM64_KVM_HVHE is set
KVM: arm64: Disable TTBR1_EL2 when using ARM64_KVM_HVHE
KVM: arm64: Force HCR_EL2.E2H when ARM64_KVM_HVHE is set
KVM: arm64: Key use of VHE instructions in nVHE code off ARM64_KVM_HVHE
KVM: arm64: Remove alternatives from sysreg accessors in VHE hypervisor context
arm64: Use CPACR_EL1 format to set CPTR_EL2 when E2H is set
arm64: Allow EL1 physical timer access when running VHE
arm64: Don't enable VHE for the kernel if OVERRIDE_HVHE is set
arm64: Add KVM_HVHE capability and has_hvhe() predicate
arm64: Turn kaslr_feature_override into a generic SW feature override
arm64: Prevent the use of is_kernel_in_hyp_mode() in hypervisor code
KVM: arm64: Drop is_kernel_in_hyp_mode() from __invalidate_icache_guest_page()
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
* kvm-arm64/eager-page-splitting:
: Eager Page Splitting, courtesy of Ricardo Koller.
:
: Dirty logging performance is dominated by the cost of splitting
: hugepages to PTE granularity. On systems that mere mortals can get their
: hands on, each fault incurs the cost of a full break-before-make
: pattern, wherein the broadcast invalidation and ensuing serialization
: significantly increases fault latency.
:
: The goal of eager page splitting is to move the cost of hugepage
: splitting out of the stage-2 fault path and instead into the ioctls
: responsible for managing the dirty log:
:
: - If manual protection is enabled for the VM, hugepage splitting
: happens in the KVM_CLEAR_DIRTY_LOG ioctl. This is desirable as it
: provides userspace granular control over hugepage splitting.
:
: - Otherwise, if userspace relies on the legacy dirty log behavior
: (clear on collection), hugepage splitting is done at the moment dirty
: logging is enabled for a particular memslot.
:
: Support for eager page splitting requires explicit opt-in from
: userspace, which is realized through the
: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE capability.
arm64: kvm: avoid overflow in integer division
KVM: arm64: Use local TLBI on permission relaxation
KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG
KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked()
KVM: arm64: Split huge pages when dirty logging is enabled
KVM: arm64: Add kvm_uninit_stage2_mmu()
KVM: arm64: Refactor kvm_arch_commit_memory_region()
KVM: arm64: Add kvm_pgtable_stage2_split()
KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE
KVM: arm64: Export kvm_are_all_memslots_empty()
KVM: arm64: Add helper for creating unlinked stage2 subtrees
KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs
KVM: arm64: Rename free_removed to free_unlinked
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
El2 stage-1 page-table format is subtly (and annoyingly) different
when HCR_EL2.E2H is set.
Take the ARM64_KVM_HVHE configuration into account when setting
the AP bits.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230609162200.2024064-13-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The reference count on page table allocations is increased for every
'counted' PTE (valid or donated) in the table in addition to the initial
reference from ->zalloc_page(). kvm_pgtable_stage2_free_removed() fails
to drop the last reference on the root of the table walk, meaning we
leak memory.
Fix it by dropping the last reference after the free walker returns,
at which point all references for 'counted' PTEs have been released.
Cc: stable@vger.kernel.org
Fixes: 5c359cca1faf ("KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make")
Reported-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Tested-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230530193213.1663411-1-oliver.upton@linux.dev
|
|
CONFIG_ARM64_BTI_KERNEL compiles the kernel to support ARMv8.5-BTI.
However, the nvhe code doesn't make use of it as it doesn't map any
pages with Guarded Page(GP) bit.
kvm pgtable code is modified to map executable pages with GP bit
if BTI is enabled for the kernel.
At hyp init, SCTLR_EL2.BT is set to 1 to match EL1 configuration
(SCTLR_EL1.BT1) set in bti_enable().
One difference between kernel and nvhe code, is that the kernel maps
.text with GP while nvhe maps all the executable pages, this makes
nvhe code need to deal with special initialization code coming from
other executable sections (.idmap.text).
For this we need to add bti instruction at the beginning of
__kvm_handle_stub_hvc as it can be called by __host_hvc through
branch instruction(br) and unlike SYM_FUNC_START, SYM_CODE_START
doesn’t add bti instruction at the beginning, and it can’t be modified
to add it as it is used with vector tables.
Another solution which is more intrusive is to convert
__kvm_handle_stub_hvc to a function and inject “bti jc” instead of
“bti c” in SYM_FUNC_START
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20230530150845.2856828-1-smostafa@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The preorder callback on the kvm_pgtable_stage2_map() path can replace
a table with a block, then recursively free the detached table. The
higher-level walking logic stashes the old page table entry and
then walks the freed table, invoking the leaf callback and
potentially freeing pgtable pages prematurely.
In normal operation, the call to tear down the detached stage-2
is indirected and uses an RCU callback to trigger the freeing.
RCU is not available to pKVM, which is where this bug is
triggered.
Change the behavior of the walker to reload the page table entry
after invoking the walker callback on preorder traversal, as it
does for leaf entries.
Tested on Pixel 6.
Fixes: 5c359cca1faf ("KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make")
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230522103258.402272-1-tabba@google.com
|
|
Broadcast TLB invalidations (TLBIs) targeting the Inner Shareable
Domain are usually less performant than their non-shareable variant.
In particular, we observed some implementations that take
millliseconds to complete parallel broadcasted TLBIs.
It's safe to use non-shareable TLBIs when relaxing permissions on a
PTE in the KVM case. According to the ARM ARM (0487I.a) section
D8.13.1 "Using break-before-make when updating translation table
entries", permission relaxation does not need break-before-make.
Specifically, R_WHZWS states that these are the only changes that
require a break-before-make sequence: changes of memory type
(Shareability or Cacheability), address changes, or changing the block
size.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-13-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Add a new stage2 function, kvm_pgtable_stage2_split(), for splitting a
range of huge pages. This will be used for eager-splitting huge pages
into PAGE_SIZE pages. The goal is to avoid having to split huge pages
on write-protection faults, and instead use this function to do it
ahead of time for large ranges (e.g., all guest memory in 1G chunks at
a time).
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-7-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Add a stage2 helper, kvm_pgtable_stage2_create_unlinked(), for
creating unlinked tables (which is the opposite of
kvm_pgtable_stage2_free_unlinked()). Creating an unlinked table is
useful for splitting level 1 and 2 entries into subtrees of PAGE_SIZE
PTEs. For example, a level 1 entry can be split into PAGE_SIZE PTEs
by first creating a fully populated tree, and then use it to replace
the level 1 entry in a single step. This will be used in a subsequent
commit for eager huge-page splitting (a dirty-logging optimization).
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-4-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Add two flags to kvm_pgtable_visit_ctx, KVM_PGTABLE_WALK_SKIP_BBM_TLBI
and KVM_PGTABLE_WALK_SKIP_CMO, to indicate that the walk should not
perform TLB invalidations (TLBIs) in break-before-make (BBM) nor cache
maintenance operations (CMO). This will be used by a future commit to
create unlinked tables not accessible to the HW page-table walker.
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-3-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Normalize on referring to tables outside of an active paging structure
as 'unlinked'.
A subsequent change to KVM will add support for building page tables
that are not part of an active paging structure. The existing
'removed_table' terminology is quite clunky when applied in this
context.
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20230426172330.1439644-2-ricarkol@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
As we are revamping the way the pgtable walker evaluates some of the
data, make it clear that we rely on somew of the fields to be constant
across the lifetime of a walk.
For this, flag the start, end and phys fields of the walk data as
'const', which will generate an error if we were to accidentally
update these fields again.
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Similar to the recently fixed stage-2 walker, the hyp map walker
increments the PA and VA of a walk separately. Unlike stage-2, there is
no bug here as the map walker has exclusive access to the stage-1 page
tables.
Nonetheless, in the interest of continuity throughout the page table
code, tweak the hyp map walker to avoid incrementing the PA and instead
use the VA as the authoritative source of how far along a table walk has
gotten. Calculate the PA to use for a leaf PTE by adding the offset of
the VA from the start of the walk to the starting PA.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230421071606.1603916-3-oliver.upton@linux.dev
|
|
Until now, the page table walker counted increments to the PA and IPA
of a walk in two separate places. While the PA is incremented as soon as
a leaf PTE is installed in stage2_map_walker_try_leaf(), the IPA is
actually bumped in the generic table walker context. Critically,
__kvm_pgtable_visit() rereads the PTE after the LEAF callback returns
to work out if a table or leaf was installed, and only bumps the IPA for
a leaf PTE.
This arrangement worked fine when we handled faults behind the write lock,
as the walker had exclusive access to the stage-2 page tables. However,
commit 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel")
started handling all stage-2 faults behind the read lock, opening up a
race where a walker could increment the PA but not the IPA of a walk.
Nothing good ensues, as the walker starts mapping with the incorrect
IPA -> PA relationship.
For example, assume that two vCPUs took a data abort on the same IPA.
One observes that dirty logging is disabled, and the other observed that
it is enabled:
vCPU attempting PMD mapping vCPU attempting PTE mapping
====================================== =====================================
/* install PMD */
stage2_make_pte(ctx, leaf);
data->phys += granule;
/* replace PMD with a table */
stage2_try_break_pte(ctx, data->mmu);
stage2_make_pte(ctx, table);
/* table is observed */
ctx.old = READ_ONCE(*ptep);
table = kvm_pte_table(ctx.old, level);
/*
* map walk continues w/o incrementing
* IPA.
*/
__kvm_pgtable_walk(..., level + 1);
Bring an end to the whole mess by using the IPA as the single source of
truth for how far along a walk has gotten. Work out the correct PA to
map by calculating the IPA offset from the beginning of the walk and add
that to the starting physical address.
Cc: stable@vger.kernel.org
Fixes: 1577cb5823ce ("KVM: arm64: Handle stage-2 faults in parallel")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230421071606.1603916-2-oliver.upton@linux.dev
|
|
As it currently stands, KVM makes use of FEAT_HAFDBS unconditionally.
Use of the feature in the rest of the kernel is guarded by an associated
Kconfig option.
Align KVM with the rest of the kernel and only enable VTCR_HA when
ARM64_HW_AFDBM is enabled. This can be helpful for testing changes to
the stage-2 access fault path on Armv8.1+ implementations.
Link: https://lore.kernel.org/r/20221202185156.696189-7-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|