Age | Commit message (Collapse) | Author |
|
Add helpers to prettify code that tests and/or marks whether or not a
register is available and/or dirty.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that indexing into arch.regs is either protected by WARN_ON_ONCE or
done with hardcoded enums, combine all definitions for registers that
are tracked by regs_avail and regs_dirty into 'enum kvm_reg'. Having a
single enum type will simplify additional cleanup related to regs_avail
and regs_dirty.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add WARN_ON_ONCE() checks in kvm_register_{read,write}() to detect reg
values that would cause KVM to overflow vcpu->arch.regs. Change the reg
param to an 'int' to make it clear that the reg index is unverified.
Regarding the overhead of WARN_ON_ONCE(), now that all fixed GPR reads
and writes use dedicated accessors, e.g. kvm_rax_read(), the overhead
is limited to flows where the reg index is generated at runtime. And
there is at least one historical bug where KVM has generated an out-of-
bounds access to arch.regs (see commit b68f3cc7d9789, "KVM: x86: Always
use 32-bit SMRAM save state for 32-bit kernels").
Adding the WARN_ON_ONCE() protection paves the way for additional
cleanup related to kvm_reg and kvm_reg_ex.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Rework vmx_set_rflags() to avoid the extra code need to handle emulation
of real mode and invalid state when unrestricted guest is disabled. The
primary reason for doing so is to avoid the call to vmx_get_rflags(),
which will incur a VMREAD when RFLAGS is not already available. When
running nested VMs, the majority of calls to vmx_set_rflags() will occur
without an associated vmx_get_rflags(), i.e. when stuffing GUEST_RFLAGS
during transitions between vmcs01 and vmcs02.
Note, vmx_get_rflags() guarantees RFLAGS is marked available.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Replace "else" with early "return" in the unrestricted guest branch. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Capture struct vcpu_vmx in a local variable to improve the readability
of vmx_{g,s}et_rflags().
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Skip the VMWRITE to update GUEST_CR3 if CR3 is not available, i.e. has
not been read from the VMCS since the last VM-Enter. If vcpu->arch.cr3
is stale, kvm_read_cr3(vcpu) will refresh vcpu->arch.cr3 from the VMCS,
meaning KVM will do a VMREAD and then VMWRITE the value it just pulled
from the VMCS.
Note, this is a purely theoretical change, no instances of skipping
the VMREAD+VMWRITE have been observed with this change.
Tested-by: Reto Buerki <reet@codelabs.ch>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Performing a WBINVD and DF_FLUSH are expensive operations. Currently, a
WBINVD/DF_FLUSH is performed every time an SEV guest terminates. However,
the WBINVD/DF_FLUSH is only required when an ASID is being re-allocated
to a new SEV guest. Also, a single WBINVD/DF_FLUSH can enable all ASIDs
that have been disassociated from guests through DEACTIVATE.
To reduce the number of WBINVD/DF_FLUSH invocations, introduce a new ASID
bitmap to track ASIDs that need to be reclaimed. When an SEV guest is
terminated, add its ASID to the reclaim bitmap instead of clearing the
bitmap in the existing SEV ASID bitmap. This delays the need to perform a
WBINVD/DF_FLUSH invocation when an SEV guest terminates until all of the
available SEV ASIDs have been used. At that point, the WBINVD/DF_FLUSH
invocation can be performed and all ASIDs in the reclaim bitmap moved to
the available ASIDs bitmap.
The semaphore around DEACTIVATE can be changed to a read semaphore with
the semaphore taken in write mode before performing the WBINVD/DF_FLUSH.
Tested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Performing a WBINVD and DF_FLUSH are expensive operations. The SEV support
currently performs this WBINVD/DF_FLUSH combination when an SEV guest is
terminated, so there is no need for it to be done before LAUNCH.
However, when the SEV firmware transitions the platform from UNINIT state
to INIT state, all ASIDs will be marked invalid across all threads.
Therefore, as part of transitioning the platform to INIT state, perform a
WBINVD/DF_FLUSH after a successful INIT in the PSP/SEV device driver.
Since the PSP/SEV device driver is x86 only, it can reference and use the
WBINVD related functions directly.
Cc: Gary Hook <gary.hook@amd.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Tested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
instead of deferring the VMWRITE until vmx_set_cr3(). If the VMWRITE
is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
VM-Exit occurs without actually entering L2, e.g. if the nested run
is squashed because nested VM-Enter (from L1) is putting L2 into HLT.
Note, the above scenario can occur regardless of whether L1 is
intercepting HLT, e.g. L1 can intercept HLT and then re-enter L2 with
vmcs.GUEST_ACTIVITY_STATE=HALTED. But practically speaking, a VMM will
likely put a guest into HALTED if and only if it's not intercepting HLT.
In an ideal world where EPT *requires* unrestricted guest (and vice
versa), VMX could handle CR3 similar to how it handles RSP and RIP,
e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run(). But
the unrestricted guest silliness complicates the dirty tracking logic
to the point that explicitly handling vmcs02.GUEST_CR3 during nested
VM-Enter is a simpler overall implementation.
Cc: stable@vger.kernel.org
Reported-and-tested-by: Reto Buerki <reet@codelabs.ch>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The SEV firmware DEACTIVATE command disassociates an SEV guest from an
ASID, clears the WBINVD indicator on all threads and indicates that the
SEV firmware DF_FLUSH command must be issued before the ASID can be
re-used. The SEV firmware DF_FLUSH command will return an error if a
WBINVD has not been performed on every thread before it has been invoked.
A window exists between the WBINVD and the invocation of the DF_FLUSH
command where an SEV firmware DEACTIVATE command could be invoked on
another thread, clearing the WBINVD indicator. This will cause the
subsequent SEV firmware DF_FLUSH command to fail which, in turn, results
in the SEV firmware ACTIVATE command failing for the reclaimed ASID.
This results in the SEV guest failing to start.
Use a mutex to close the WBINVD/DF_FLUSH window by obtaining the mutex
before the DEACTIVATE and releasing it after the DF_FLUSH. This ensures
that any DEACTIVATE cannot run before a DF_FLUSH has completed.
Fixes: 59414c989220 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_START command")
Tested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The SEV ASID bitmap currently is not protected against parallel SEV guest
startups. This can result in an SEV guest failing to start because another
SEV guest could have been assigned the same ASID value. Use a mutex to
serialize access to the SEV ASID bitmap.
Fixes: 1654efcbc431 ("KVM: SVM: Add KVM_SEV_INIT command")
Tested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
PPC KVM fix for 5.4
- Fix a bug in the XIVE code which can cause a host crash.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm fixes for 5.4, take #2
Special PMU edition:
- Fix cycle counter truncation
- Fix cycle counter overflow limit on pure 64bit system
- Allow chained events to be actually functional
- Correct sample period after overflow
|
|
After resetting the vCPU, the kvmclock MSR keeps the previous value but it is
not enabled. This can be confusing, so fix it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use BUG_ON instead of a if condition followed by BUG.
Generated by: scripts/coccinelle/misc/bugon.cocci
Fixes: 4b526de50e39 ("KVM: x86: Check kvm_rebooting in kvm_spurious_fault()")
CC: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Commit bf653b78f960 ("KVM: vmx: Introduce handle_unexpected_vmexit
and handle WAITPKG vmexit") introduced specialized handling of
specific exit-reasons that should not be raised by CPU because
KVM configures VMCS such that they should never be raised.
However, since commit 7396d337cfad ("KVM: x86: Return to userspace
with internal error on unexpected exit reason"), VMX & SVM
exit handlers were modified to generically handle all unexpected
exit-reasons by returning to userspace with internal error.
Therefore, there is no need for specialized handling of specific
unexpected exit-reasons (This specialized handling also introduced
inconsistency for these exit-reasons to silently skip guest instruction
instead of return to userspace on internal-error).
Fixes: bf653b78f960 ("KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit")
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Commit 204c91eff798a ("KVM: selftests: do not blindly clobber registers in
guest asm") was intended to make test more gcc-proof, however, the result
is exactly the opposite: on newer gccs (e.g. 8.2.1) the test breaks with
==== Test Assertion Failure ====
x86_64/sync_regs_test.c:168: run->s.regs.regs.rbx == 0xBAD1DEA + 1
pid=14170 tid=14170 - Invalid argument
1 0x00000000004015b3: main at sync_regs_test.c:166 (discriminator 6)
2 0x00007f413fb66412: ?? ??:0
3 0x000000000040191d: _start at ??:?
rbx sync regs value incorrect 0x1.
Apparently, compile is still free to play games with registers even
when they have variables attached.
Re-write guest code with 'asm volatile' by embedding ucall there and
making sure rbx is preserved.
Fixes: 204c91eff798a ("KVM: selftests: do not blindly clobber registers in guest asm")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
vmx_dirty_log_test fails on AMD and this is no surprise as it is VMX
specific. Bail early when nested VMX is unsupported.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
vmx_* tests require VMX and three of them implement the same check. Move it
to vmx library.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
vmx_set_nested_state_test() checks if VMX is supported twice: in the very
beginning (and skips the whole test if it's not) and before doing
test_vmx_nested_state(). One should be enough.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Don't waste cycles to shrink/grow vCPU halt_poll_ns if host
side polling is disabled.
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Because "Untracked files:" are annoying.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When the RDPID instruction is supported on the host, enumerate it in
KVM_GET_SUPPORTED_CPUID.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"Here is a bunch of pin control fixes. I was lagging behind on this
one, some fixes should have come in earlier, sorry about that.
Anyways here it is, pretty straight-forward fixes, the Strago fix
stand out as something serious affecting a lot of machines.
Summary:
- Handle multiple instances of Intel chips without complaining.
- Restore the Intel Strago DMI workaround
- Make the Armada 37xx handle pins over 32
- Fix the polarity of the LED group on Armada 37xx
- Fix an off-by-one bug in the NS2 driver
- Fix error path for iproc's platform_get_irq()
- Fix error path on the STMFX driver
- Fix a typo in the Berlin AS370 driver
- Fix up misc errors in the Aspeed 2600 BMC support
- Fix a stray SPDX tag"
* tag 'pinctrl-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: aspeed-g6: Rename SD3 to EMMC and rework pin groups
pinctrl: aspeed-g6: Fix UART13 group pinmux
pinctrl: aspeed-g6: Make SIG_DESC_CLEAR() behave intuitively
pinctrl: aspeed-g6: Fix I3C3/I3C4 pinmux configuration
pinctrl: aspeed-g6: Fix I2C14 SDA description
pinctrl: aspeed-g6: Sort pins for sanity
dt-bindings: pinctrl: aspeed-g6: Rework SD3 function and groups
pinctrl: berlin: as370: fix a typo s/spififib/spdifib
pinctrl: armada-37xx: swap polarity on LED group
pinctrl: stmfx: fix null pointer on remove
pinctrl: iproc: allow for error from platform_get_irq()
pinctrl: ns2: Fix off by one bugs in ns2_pinmux_enable()
pinctrl: bcm-iproc: Use SPDX header
pinctrl: armada-37xx: fix control of pins 32 and up
pinctrl: cherryview: restore Strago DMI workaround for all versions
pinctrl: intel: Allocate IRQ chip dynamic
|
|
AIL=2 mode has no known users, so is not well tested or supported.
Disallow guests from selecting this mode because it may become
deprecated in future versions of the architecture.
This policy decision is not left to QEMU because KVM support is
required for AIL=2 (when injecting interrupts).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
kvmppc_inject_interrupt does not implement LPCR[AIL]!=0 modes, which
can result in the guest receiving interrupts as if LPCR[AIL]=0
contrary to the ISA.
In practice, Linux guests cope with this deviation, but it should be
fixed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
This consolidates the HV interrupt delivery logic into one place.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
reset_msr sets the MSR for interrupt injection, but it's cleaner and
more flexible to provide a single op to set both MSR and PC for the
interrupt.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Add a new attribute to both XIVE and XICS-on-XIVE KVM devices so that
userspace can tell how many interrupt servers it needs. If a VM needs
less than the current default of KVM_MAX_VCPUS (2048), we can allocate
less VPs in OPAL. Combined with a core stride (VSMT) that matches the
number of guest threads per core, this may substantially increases the
number of VMs that can run concurrently with an in-kernel XIVE device.
Since the legacy XIVE KVM device is exposed to userspace through the
XICS KVM API, a new attribute group is added to it for this purpose.
While here, fix the syntax of the existing KVM_DEV_XICS_GRP_SOURCES
in the XICS documentation.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
The XIVE VP is an internal structure which allow the XIVE interrupt
controller to maintain the interrupt context state of vCPUs non
dispatched on HW threads.
When a guest is started, the XIVE KVM device allocates a block of
XIVE VPs in OPAL, enough to accommodate the highest possible vCPU
id KVM_MAX_VCPU_ID (16384) packed down to KVM_MAX_VCPUS (2048).
With a guest's core stride of 8 and a threading mode of 1 (QEMU's
default), a VM must run at least 256 vCPUs to actually need such a
range of VPs.
A POWER9 system has a limited XIVE VP space : 512k and KVM is
currently wasting this HW resource with large VP allocations,
especially since a typical VM likely runs with a lot less vCPUs.
Make the size of the VP block configurable. Add an nr_servers
field to the XIVE structure and a function to set it for this
purpose.
Split VP allocation out of the device create function. Since the
VP block isn't used before the first vCPU connects to the XIVE KVM
device, allocation is now performed by kvmppc_xive_connect_vcpu().
This gives the opportunity to set nr_servers in between:
kvmppc_xive_create() / kvmppc_xive_native_create()
.
.
kvmppc_xive_set_nr_servers()
.
.
kvmppc_xive_connect_vcpu() / kvmppc_xive_native_connect_vcpu()
The connect_vcpu() functions check that the vCPU id is below nr_servers
and if it is the first vCPU they allocate the VP block. This is protected
against a concurrent update of nr_servers by kvmppc_xive_set_nr_servers()
with the xive->lock mutex.
Also, the block is allocated once for the device lifetime: nr_servers
should stay constant otherwise connect_vcpu() could generate a boggus
VP id and likely crash OPAL. It is thus forbidden to update nr_servers
once the block is allocated.
If the VP allocation fail, return ENOSPC which seems more appropriate to
report the depletion of system wide HW resource than ENOMEM or ENXIO.
A VM using a stride of 8 and 1 thread per core with 32 vCPUs would hence
only need 256 VPs instead of 2048. If the stride is set to match the number
of threads per core, this goes further down to 32.
This will be exposed to userspace by a subsequent patch.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Reduce code duplication by consolidating the checking of vCPU ids and VP
ids to a common helper used by both legacy and native XIVE KVM devices.
And explain the magic with a comment.
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Print out the VP id of each connected vCPU, this allow to see:
- the VP block base in which OPAL encodes information that may be
useful when debugging
- the packed vCPU id which may differ from the raw vCPU id if the
latter is >= KVM_MAX_VCPUS (2048)
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
If we cannot allocate the XIVE VPs in OPAL, the creation of a XIVE or
XICS-on-XIVE device is aborted as expected, but we leave kvm->arch.xive
set forever since the release method isn't called in this case. Any
subsequent tentative to create a XIVE or XICS-on-XIVE for this VM will
thus always fail (DoS). This is a problem for QEMU since it destroys
and re-creates these devices when the VM is reset: the VM would be
restricted to using the much slower emulated XIVE or XICS forever.
As an alternative to adding rollback, do not assign kvm->arch.xive before
making sure the XIVE VPs are allocated in OPAL.
Cc: stable@vger.kernel.org # v5.2
Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method")
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Given that in kvm_create_vm() there is:
kvm->mm = current->mm;
And that on every kvm_*_ioctl we have:
if (kvm->mm != current->mm)
return -EIO;
I see no reason to keep using current->mm instead of kvm->mm.
By doing so, we would reduce the use of 'global' variables on code, relying
more in the contents of kvm struct.
Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Reduces the number of calls to get_current() in order to get the value of
current->mm by doing it once and storing the value, since it is not
supposed to change inside the same process).
Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
When calling the KVM_SET_GUEST_DEBUG ioctl, userspace might request
the next instruction to be single stepped via the
KVM_GUESTDBG_SINGLESTEP control bit of the kvm_guest_debug structure.
This patch adds the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability in order
to inform userspace about the state of single stepping support.
We currently don't have support for guest single stepping implemented
in Book3S HV so the capability is only present for Book3S PR and
BookE.
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild fixes from Masahiro Yamada:
- fix a bashism of setlocalversion
- do not use the too new --sort option of tar
* tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kheaders: substituting --sort in archive creation
scripts: setlocalversion: fix a bashism
kbuild: update comment about KBUILD_ALLDIRS
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A small set of x86 fixes:
- Prevent a NULL pointer dereference in the X2APIC code in case of a
CPU hotplug failure.
- Prevent boot failures on HP superdome machines by invalidating the
level2 kernel pagetable entries outside of the kernel area as
invalid so BIOS reserved space won't be touched unintentionally.
Also ensure that memory holes are rounded up to the next PMD
boundary correctly.
- Enable X2APIC support on Hyper-V to prevent boot failures.
- Set the paravirt name when running on Hyper-V for consistency
- Move a function under the appropriate ifdef guard to prevent build
warnings"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/acpi: Move get_cmdline_acpi_rsdp() under #ifdef guard
x86/hyperv: Set pv_info.name to "Hyper-V"
x86/apic/x2apic: Fix a NULL pointer deref when handling a dying cpu
x86/hyperv: Make vapic support x2apic mode
x86/boot/64: Round memory hole size up to next PMD page
x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A small set of irq chip driver fixes and updates:
- Update the SIFIVE PLIC interrupt driver to use the fasteoi handler
to address the shortcomings of the existing flow handling which was
prone to lose interrupts
- Use the proper limit for GIC interrupt line numbers
- Add retrigger support for the recently merged Anapurna Labs Fabric
interrupt controller to make it complete
- Enable the ATMEL AIC5 interrupt controller driver on the new
SAM9X60 SoC"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/sifive-plic: Switch to fasteoi flow
irqchip/gic-v3: Fix GIC_LINE_NR accessor
irqchip/atmel-aic5: Add support for sam9x60 irqchip
irqchip/al-fic: Add support for irq retrigger
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull hrtimer fixlet from Thomas Gleixner:
"A single commit annotating the lockcless access to timer->base with
READ_ONCE() and adding the WRITE_ONCE() counterparts for completeness"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Annotate lockless access to timer->base
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stop-machine fix from Thomas Gleixner:
"A single fix, amending stop machine with WRITE/READ_ONCE() to address
the fallout of KCSAN"
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
stop_machine: Avoid potential race behaviour
|
|
The PMU emulation code uses the perf event sample period to trigger
the overflow detection. This works fine for the *first* overflow
handling, but results in a huge number of interrupts on the host,
unrelated to the number of interrupts handled in the guest (a x20
factor is pretty common for the cycle counter). On a slow system
(such as a SW model), this can result in the guest only making
forward progress at a glacial pace.
It turns out that the clue is in the name. The sample period is
exactly that: a period. And once the an overflow has occured,
the following period should be the full width of the associated
counter, instead of whatever the guest had initially programed.
Reset the sample period to the architected value in the overflow
handler, which now results in a number of host interrupts that is
much closer to the number of interrupts in the guest.
Fixes: b02386eb7dac ("arm64: KVM: Add PMU overflow interrupt routing")
Reviewed-by: Andrew Murray <andrew.murray@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The current convention for KVM to request a chained event from the
host PMU is to set bit[0] in attr.config1 (PERF_ATTR_CFG1_KVM_PMU_CHAINED).
But as it turns out, this bit gets set *after* we create the kernel
event that backs our virtual counter, meaning that we never get
a 64bit counter.
Moving the setting to an earlier point solves the problem.
Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
Reviewed-by: Andrew Murray <andrew.murray@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Of PMCR_EL0.LC, the ARMv8 ARM says:
"In an AArch64 only implementation, this field is RES 1."
So be it.
Fixes: ab9468340d2bc ("arm64: KVM: Add access handler for PMCR register")
Reviewed-by: Andrew Murray <andrew.murray@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
When a counter is disabled, its value is sampled before the event
is being disabled, and the value written back in the shadow register.
In that process, the value gets truncated to 32bit, which is adequate
for any counter but the cycle counter (defined as a 64bit counter).
This obviously results in a corrupted counter, and things like
"perf record -e cycles" not working at all when run in a guest...
A similar, but less critical bug exists in kvm_pmu_get_counter_value.
Make the truncation conditional on the counter not being the cycle
counter, which results in a minor code reorganisation.
Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
Reviewed-by: Andrew Murray <andrew.murray@arm.com>
Reported-by: Julien Thierry <julien.thierry.kdev@gmail.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Pull networking fixes from David Miller:
"I was battling a cold after some recent trips, so quite a bit piled up
meanwhile, sorry about that.
Highlights:
1) Fix fd leak in various bpf selftests, from Brian Vazquez.
2) Fix crash in xsk when device doesn't support some methods, from
Magnus Karlsson.
3) Fix various leaks and use-after-free in rxrpc, from David Howells.
4) Fix several SKB leaks due to confusion of who owns an SKB and who
should release it in the llc code. From Eric Biggers.
5) Kill a bunc of KCSAN warnings in TCP, from Eric Dumazet.
6) Jumbo packets don't work after resume on r8169, as the BIOS resets
the chip into non-jumbo mode during suspend. From Heiner Kallweit.
7) Corrupt L2 header during MPLS push, from Davide Caratti.
8) Prevent possible infinite loop in tc_ctl_action, from Eric
Dumazet.
9) Get register bits right in bcmgenet driver, based upon chip
version. From Florian Fainelli.
10) Fix mutex problems in microchip DSA driver, from Marek Vasut.
11) Cure race between route lookup and invalidation in ipv4, from Wei
Wang.
12) Fix performance regression due to false sharing in 'net'
structure, from Eric Dumazet"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (145 commits)
net: reorder 'struct net' fields to avoid false sharing
net: dsa: fix switch tree list
net: ethernet: dwmac-sun8i: show message only when switching to promisc
net: aquantia: add an error handling in aq_nic_set_multicast_list
net: netem: correct the parent's backlog when corrupted packet was dropped
net: netem: fix error path for corrupted GSO frames
macb: propagate errors when getting optional clocks
xen/netback: fix error path of xenvif_connect_data()
net: hns3: fix mis-counting IRQ vector numbers issue
net: usb: lan78xx: Connect PHY before registering MAC
vsock/virtio: discard packets if credit is not respected
vsock/virtio: send a credit update when buffer size is changed
mlxsw: spectrum_trap: Push Ethernet header before reporting trap
net: ensure correct skb->tstamp in various fragmenters
net: bcmgenet: reset 40nm EPHY on energy detect
net: bcmgenet: soft reset 40nm EPHYs before MAC init
net: phy: bcm7xxx: define soft_reset for 40nm EPHY
net: bcmgenet: don't set phydev->link from MAC
net: Update address for MediaTek ethernet driver in MAINTAINERS
ipv4: fix race condition between route lookup and invalidation
...
|
|
Intel test robot reported a ~7% regression on TCP_CRR tests
that they bisected to the cited commit.
Indeed, every time a new TCP socket is created or deleted,
the atomic counter net->count is touched (via get_net(net)
and put_net(net) calls)
So cpus might have to reload a contended cache line in
net_hash_mix(net) calls.
We need to reorder 'struct net' fields to move @hash_mix
in a read mostly cache line.
We move in the first cache line fields that can be
dirtied often.
We probably will have to address in a followup patch
the __randomize_layout that was added in linux-4.13,
since this might break our placement choices.
Fixes: 355b98553789 ("netns: provide pure entropy for net_hash_mix()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If there are multiple switch trees on the device, only the last one
will be listed, because the arguments of list_add_tail are swapped.
Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation")
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|