Age | Commit message (Collapse) | Author |
|
commit e3a3bbe3e99de73043a1d32d36cf4d211dc58c7e upstream.
A PCMD (Paging Crypto MetaData) page contains the PCMD
structures of enclave pages that have been encrypted and
moved to the shmem backing store. When all enclave pages
sharing a PCMD page are loaded in the enclave, there is no
need for the PCMD page and it can be truncated from the
backing store.
A few issues appeared around the truncation of PCMD pages. The
known issues have been addressed but the PCMD handling code could
be made more robust by loudly complaining if any new issue appears
in this area.
Add a check that will complain with a warning if the PCMD page is not
actually empty after it has been truncated. There should never be data
in the PCMD page at this point since it is was just checked to be empty
and truncated with enclave mutex held and is updated with the
enclave mutex held.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Link: https://lkml.kernel.org/r/6495120fed43fafc1496d09dd23df922b9a32709.1652389823.git.reinette.chatre@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit af117837ceb9a78e995804ade4726ad2c2c8981f upstream.
Haitao reported encountering a WARN triggered by the ENCLS[ELDU]
instruction faulting with a #GP.
The WARN is encountered when the reclaimer evicts a range of
pages from the enclave when the same pages are faulted back right away.
Consider two enclave pages (ENCLAVE_A and ENCLAVE_B)
sharing a PCMD page (PCMD_AB). ENCLAVE_A is in the
enclave memory and ENCLAVE_B is in the backing store. PCMD_AB contains
just one entry, that of ENCLAVE_B.
Scenario proceeds where ENCLAVE_A is being evicted from the enclave
while ENCLAVE_B is faulted in.
sgx_reclaim_pages() {
...
/*
* Reclaim ENCLAVE_A
*/
mutex_lock(&encl->lock);
/*
* Get a reference to ENCLAVE_A's
* shmem page where enclave page
* encrypted data will be stored
* as well as a reference to the
* enclave page's PCMD data page,
* PCMD_AB.
* Release mutex before writing
* any data to the shmem pages.
*/
sgx_encl_get_backing(...);
encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
mutex_unlock(&encl->lock);
/*
* Fault ENCLAVE_B
*/
sgx_vma_fault() {
mutex_lock(&encl->lock);
/*
* Get reference to
* ENCLAVE_B's shmem page
* as well as PCMD_AB.
*/
sgx_encl_get_backing(...)
/*
* Load page back into
* enclave via ELDU.
*/
/*
* Release reference to
* ENCLAVE_B' shmem page and
* PCMD_AB.
*/
sgx_encl_put_backing(...);
/*
* PCMD_AB is found empty so
* it and ENCLAVE_B's shmem page
* are truncated.
*/
/* Truncate ENCLAVE_B backing page */
sgx_encl_truncate_backing_page();
/* Truncate PCMD_AB */
sgx_encl_truncate_backing_page();
mutex_unlock(&encl->lock);
...
}
mutex_lock(&encl->lock);
encl_page->desc &=
~SGX_ENCL_PAGE_BEING_RECLAIMED;
/*
* Write encrypted contents of
* ENCLAVE_A to ENCLAVE_A shmem
* page and its PCMD data to
* PCMD_AB.
*/
sgx_encl_put_backing(...)
/*
* Reference to PCMD_AB is
* dropped and it is truncated.
* ENCLAVE_A's PCMD data is lost.
*/
mutex_unlock(&encl->lock);
}
What happens next depends on whether it is ENCLAVE_A being faulted
in or ENCLAVE_B being evicted - but both end up with ENCLS[ELDU] faulting
with a #GP.
If ENCLAVE_A is faulted then at the time sgx_encl_get_backing() is called
a new PCMD page is allocated and providing the empty PCMD data for
ENCLAVE_A would cause ENCLS[ELDU] to #GP
If ENCLAVE_B is evicted first then a new PCMD_AB would be allocated by the
reclaimer but later when ENCLAVE_A is faulted the ENCLS[ELDU] instruction
would #GP during its checks of the PCMD value and the WARN would be
encountered.
Noting that the reclaimer sets SGX_ENCL_PAGE_BEING_RECLAIMED at the time
it obtains a reference to the backing store pages of an enclave page it
is in the process of reclaiming, fix the race by only truncating the PCMD
page after ensuring that no page sharing the PCMD page is in the process
of being reclaimed.
Cc: stable@vger.kernel.org
Fixes: 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page")
Reported-by: Haitao Huang <haitao.huang@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Link: https://lkml.kernel.org/r/ed20a5db516aa813873268e125680041ae11dfcf.1652389823.git.reinette.chatre@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0e4e729a830c1e7f31d3b3fbf8feb355a402b117 upstream.
Haitao reported encountering a WARN triggered by the ENCLS[ELDU]
instruction faulting with a #GP.
The WARN is encountered when the reclaimer evicts a range of
pages from the enclave when the same pages are faulted back
right away.
The SGX backing storage is accessed on two paths: when there
are insufficient free pages in the EPC the reclaimer works
to move enclave pages to the backing storage and as enclaves
access pages that have been moved to the backing storage
they are retrieved from there as part of page fault handling.
An oversubscribed SGX system will often run the reclaimer and
page fault handler concurrently and needs to ensure that the
backing store is accessed safely between the reclaimer and
the page fault handler. This is not the case because the
reclaimer accesses the backing store without the enclave mutex
while the page fault handler accesses the backing store with
the enclave mutex.
Consider the scenario where a page is faulted while a page sharing
a PCMD page with the faulted page is being reclaimed. The
consequence is a race between the reclaimer and page fault
handler, the reclaimer attempting to access a PCMD at the
same time it is truncated by the page fault handler. This
could result in lost PCMD data. Data may still be
lost if the reclaimer wins the race, this is addressed in
the following patch.
The reclaimer accesses pages from the backing storage without
holding the enclave mutex and runs the risk of concurrently
accessing the backing storage with the page fault handler that
does access the backing storage with the enclave mutex held.
In the scenario below a PCMD page is truncated from the backing
store after all its pages have been loaded in to the enclave
at the same time the PCMD page is loaded from the backing store
when one of its pages are reclaimed:
sgx_reclaim_pages() { sgx_vma_fault() {
...
mutex_lock(&encl->lock);
...
__sgx_encl_eldu() {
...
if (pcmd_page_empty) {
/*
* EPC page being reclaimed /*
* shares a PCMD page with an * PCMD page truncated
* enclave page that is being * while requested from
* faulted in. * reclaimer.
*/ */
sgx_encl_get_backing() <----------> sgx_encl_truncate_backing_page()
}
mutex_unlock(&encl->lock);
} }
In this scenario there is a race between the reclaimer and the page fault
handler when the reclaimer attempts to get access to the same PCMD page
that is being truncated. This could result in the reclaimer writing to
the PCMD page that is then truncated, causing the PCMD data to be lost,
or in a new PCMD page being allocated. The lost PCMD data may still occur
after protecting the backing store access with the mutex - this is fixed
in the next patch. By ensuring the backing store is accessed with the mutex
held the enclave page state can be made accurate with the
SGX_ENCL_PAGE_BEING_RECLAIMED flag accurately reflecting that a page
is in the process of being reclaimed.
Consistently protect the reclaimer's backing store access with the
enclave's mutex to ensure that it can safely run concurrently with the
page fault handler.
Cc: stable@vger.kernel.org
Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
Reported-by: Haitao Huang <haitao.huang@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Link: https://lkml.kernel.org/r/fa2e04c561a8555bfe1f4e7adc37d60efc77387b.1652389823.git.reinette.chatre@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2154e1c11b7080aa19f47160bd26b6f39bbd7824 upstream.
Recent commit 08999b2489b4 ("x86/sgx: Free backing memory
after faulting the enclave page") expanded __sgx_encl_eldu()
to clear an enclave page's PCMD (Paging Crypto MetaData)
from the PCMD page in the backing store after the enclave
page is restored to the enclave.
Since the PCMD page in the backing store is modified the page
should be marked as dirty to ensure the modified data is retained.
Cc: stable@vger.kernel.org
Fixes: 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Link: https://lkml.kernel.org/r/00cd2ac480db01058d112e347b32599c1a806bc4.1652389823.git.reinette.chatre@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6bd429643cc265e94a9d19839c771bcc5d008fa8 upstream.
SGX uses shmem backing storage to store encrypted enclave pages
and their crypto metadata when enclave pages are moved out of
enclave memory. Two shmem backing storage pages are associated with
each enclave page - one backing page to contain the encrypted
enclave page data and one backing page (shared by a few
enclave pages) to contain the crypto metadata used by the
processor to verify the enclave page when it is loaded back into
the enclave.
sgx_encl_put_backing() is used to release references to the
backing storage and, optionally, mark both backing store pages
as dirty.
Managing references and dirty status together in this way results
in both backing store pages marked as dirty, even if only one of
the backing store pages are changed.
Additionally, waiting until the page reference is dropped to set
the page dirty risks a race with the page fault handler that
may load outdated data into the enclave when a page is faulted
right after it is reclaimed.
Consider what happens if the reclaimer writes a page to the backing
store and the page is immediately faulted back, before the reclaimer
is able to set the dirty bit of the page:
sgx_reclaim_pages() { sgx_vma_fault() {
...
sgx_encl_get_backing();
... ...
sgx_reclaimer_write() {
mutex_lock(&encl->lock);
/* Write data to backing store */
mutex_unlock(&encl->lock);
}
mutex_lock(&encl->lock);
__sgx_encl_eldu() {
...
/*
* Enclave backing store
* page not released
* nor marked dirty -
* contents may not be
* up to date.
*/
sgx_encl_get_backing();
...
/*
* Enclave data restored
* from backing store
* and PCMD pages that
* are not up to date.
* ENCLS[ELDU] faults
* because of MAC or PCMD
* checking failure.
*/
sgx_encl_put_backing();
}
...
/* set page dirty */
sgx_encl_put_backing();
...
mutex_unlock(&encl->lock);
} }
Remove the option to sgx_encl_put_backing() to set the backing
pages as dirty and set the needed pages as dirty right after
receiving important data while enclave mutex is held. This ensures that
the page fault handler can get up to date data from a page and prepares
the code for a following change where only one of the backing pages
need to be marked as dirty.
Cc: stable@vger.kernel.org
Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Link: https://lore.kernel.org/linux-sgx/8922e48f-6646-c7cc-6393-7c78dcf23d23@intel.com/
Link: https://lkml.kernel.org/r/fa9f98986923f43e72ef4c6702a50b2a0b3c42e3.1652389823.git.reinette.chatre@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3f5e3d3a8b895c8a11da8b0063ba2022dd9e2045 upstream.
Correct the name of the bluetooth interrupt from host-wake to
host-wakeup.
Fixes: 1c65b6184441b ("ARM: dts: s5pv210: Correct BCM4329 bluetooth node")
Cc: <stable@vger.kernel.org>
Signed-off-by: Jonathan Bakker <xc-racer2@live.ca>
Link: https://lore.kernel.org/r/CY4PR04MB0567495CFCBDC8D408D44199CB1C9@CY4PR04MB0567.namprd04.prod.outlook.com
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d22d2474e3953996f03528b84b7f52cc26a39403 upstream.
For some sev ioctl interfaces, the length parameter that is passed maybe
less than or equal to SEV_FW_BLOB_MAX_SIZE, but larger than the data
that PSP firmware returns. In this case, kmalloc will allocate memory
that is the size of the input rather than the size of the data.
Since PSP firmware doesn't fully overwrite the allocated buffer, these
sev ioctl interface may return uninitialized kernel slab memory.
Reported-by: Andy Nguyen <theflow@google.com>
Suggested-by: David Rientjes <rientjes@google.com>
Suggested-by: Peter Gonda <pgonda@google.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: eaf78265a4ab3 ("KVM: SVM: Move SEV code to separate file")
Fixes: 2c07ded06427d ("KVM: SVM: add support for SEV attestation command")
Fixes: 4cfdd47d6d95a ("KVM: SVM: Add KVM_SEV SEND_START command")
Fixes: d3d1af85e2c75 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
Fixes: eba04b20e4861 ("KVM: x86: Account a variety of miscellaneous allocations")
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Message-Id: <20220516154310.3685678-1-Ashish.Kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
is required
commit 8d5678a76689acbf91245a3791fe853ab773090f upstream.
Before Commit c3e5e415bc1e6 ("KVM: X86: Change kvm_sync_page()
to return true when remote flush is needed"), the return value
of kvm_sync_page() indicates whether the page is synced, and
kvm_mmu_get_page() would rebuild page when the sync fails.
But now, kvm_sync_page() returns false when the page is
synced and no tlb flushing is required, which leads to
rebuild page in kvm_mmu_get_page(). So return the return
value of mmu->sync_page() directly and check it in
kvm_mmu_get_page(). If the sync fails, the page will be
zapped and the invalid_list is not empty, so set flush as
true is accepted in mmu_sync_children().
Cc: stable@vger.kernel.org
Fixes: c3e5e415bc1e6 ("KVM: X86: Change kvm_sync_page() to return true when remote flush is needed")
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Acked-by: Lai Jiangshan <jiangshanlai@gmail.com>
Message-Id: <0dabeeb789f57b0d793f85d073893063e692032d.1647336064.git.houwenlong.hwl@antgroup.com>
[mmu_sync_children should not flush if the page is zapped. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 45846661d10422ce9e22da21f8277540b29eca22 upstream.
Remove WARNs that sanity check that KVM never lets a triple fault for L2
escape and incorrectly end up in L1. In normal operation, the sanity
check is perfectly valid, but it incorrectly assumes that it's impossible
for userspace to induce KVM_REQ_TRIPLE_FAULT without bouncing through
KVM_RUN (which guarantees kvm_check_nested_state() will see and handle
the triple fault).
The WARN can currently be triggered if userspace injects a machine check
while L2 is active and CR4.MCE=0. And a future fix to allow save/restore
of KVM_REQ_TRIPLE_FAULT, e.g. so that a synthesized triple fault isn't
lost on migration, will make it trivially easy for userspace to trigger
the WARN.
Clearing KVM_REQ_TRIPLE_FAULT when forcibly leaving guest mode is
tempting, but wrong, especially if/when the request is saved/restored,
e.g. if userspace restores events (including a triple fault) and then
restores nested state (which may forcibly leave guest mode). Ignoring
the fact that KVM doesn't currently provide the necessary APIs, it's
userspace's responsibility to manage pending events during save/restore.
------------[ cut here ]------------
WARNING: CPU: 7 PID: 1399 at arch/x86/kvm/vmx/nested.c:4522 nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
Modules linked in: kvm_intel kvm irqbypass
CPU: 7 PID: 1399 Comm: state_test Not tainted 5.17.0-rc3+ #808
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
Call Trace:
<TASK>
vmx_leave_nested+0x30/0x40 [kvm_intel]
vmx_set_nested_state+0xca/0x3e0 [kvm_intel]
kvm_arch_vcpu_ioctl+0xf49/0x13e0 [kvm]
kvm_vcpu_ioctl+0x4b9/0x660 [kvm]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
---[ end trace 0000000000000000 ]---
Fixes: cb6a32c2b877 ("KVM: x86: Handle triple fault in L2 without killing L1")
Cc: stable@vger.kernel.org
Cc: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220407002315.78092-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ffd1925a596ce68bed7d81c61cb64bc35f788a9d upstream.
When kernel handles the vm-exit caused by external interrupts and NMI,
it always sets kvm_intr_type to tell if it's dealing an IRQ or NMI. For
the PMI scenario, it could be IRQ or NMI.
However, intel_pt PMIs are only generated for HARDWARE perf events, and
HARDWARE events are always configured to generate NMIs. Use
kvm_handling_nmi_from_guest() to precisely identify if the intel_pt PMI
came from the guest; this avoids false positives if an intel_pt PMI/NMI
arrives while the host is handling an unrelated IRQ VM-Exit.
Fixes: db215756ae59 ("KVM: x86: More precisely identify NMI from guest when handling PMI")
Signed-off-by: Yanfei Xu <yanfei.xu@intel.com>
Message-Id: <20220523140821.1345605-1-yanfei.xu@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6fcee03df6a1a3101a77344be37bb85c6142d56c upstream.
This can cause various unexpected issues, since VM is partially
destroyed at that point.
For example when AVIC is enabled, this causes avic_vcpu_load to
access physical id page entry which is already freed by .vm_destroy.
Fixes: 8221c1370056 ("svm: Manage vcpu load/unload when enable AVIC")
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220322172449.235575-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fee060cd52d69c114b62d1a2948ea9648b5131f9 upstream.
Whenever x86_decode_emulated_instruction() detects a breakpoint, it
returns the value that kvm_vcpu_check_breakpoint() writes into its
pass-by-reference second argument. Unfortunately this is completely
bogus because the expected outcome of x86_decode_emulated_instruction
is an EMULATION_* value.
Then, if kvm_vcpu_check_breakpoint() does "*r = 0" (corresponding to
a KVM_EXIT_DEBUG userspace exit), it is misunderstood as EMULATION_OK
and x86_emulate_instruction() is called without having decoded the
instruction. This causes various havoc from running with a stale
emulation context.
The fix is to move the call to kvm_vcpu_check_breakpoint() where it was
before commit 4aa2691dcbd3 ("KVM: x86: Factor out x86 instruction
emulation with decoding") introduced x86_decode_emulated_instruction().
The other caller of the function does not need breakpoint checks,
because it is invoked as part of a vmexit and the processor has already
checked those before executing the instruction that #GP'd.
This fixes CVE-2022-1852.
Reported-by: Qiuhao Li <qiuhao@sysec.org>
Reported-by: Gaoning Pan <pgn@zju.edu.cn>
Reported-by: Yongkang Jia <kangel@zju.edu.cn>
Fixes: 4aa2691dcbd3 ("KVM: x86: Factor out x86 instruction emulation with decoding")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220311032801.3467418-2-seanjc@google.com>
[Rewrote commit message according to Qiuhao's report, since a patch
already existed to fix the bug. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 33fbe6befa622c082f7d417896832856814bdde0 upstream.
This shows up as a TDP MMU leak when running nested. Non-working cmpxchg on L0
relies makes L1 install two different shadow pages under same spte, and one of
them is leaked.
Fixes: 1c2361f667f36 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220512101420.306759-1-mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1c2361f667f3648855ceae25f1332c18413fdb9f upstream.
Use the recently introduce __try_cmpxchg_user() to emulate atomic guest
accesses via the associated userspace address instead of mapping the
backing pfn into kernel address space. Using kvm_vcpu_map() is unsafe as
it does not coordinate with KVM's mmu_notifier to ensure the hva=>pfn
translation isn't changed/unmapped in the memremap() path, i.e. when
there's no struct page and thus no elevated refcount.
Fixes: 42e35f8072c3 ("KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220202004945.2540433-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f122dfe4476890d60b8c679128cd2259ec96a24c upstream.
Use the recently introduced __try_cmpxchg_user() to update guest PTE A/D
bits instead of mapping the PTE into kernel address space. The VM_PFNMAP
path is broken as it assumes that vm_pgoff is the base pfn of the mapped
VMA range, which is conceptually wrong as vm_pgoff is the offset relative
to the file and has nothing to do with the pfn. The horrific hack worked
for the original use case (backing guest memory with /dev/mem), but leads
to accessing "random" pfns for pretty much any other VM_PFNMAP case.
Fixes: bd53cb35a3e9 ("X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs")
Debugged-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Reported-by: syzbot+6cde2282daa792c49ab8@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220202004945.2540433-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 989b5db215a2f22f89d730b607b071d964780f10 upstream.
Add support for CMPXCHG loops on userspace addresses. Provide both an
"unsafe" version for tight loops that do their own uaccess begin/end, as
well as a "safe" version for use cases where the CMPXCHG is not buried in
a loop, e.g. KVM will resume the guest instead of looping when emulation
of a guest atomic accesses fails the CMPXCHG.
Provide 8-byte versions for 32-bit kernels so that KVM can do CMPXCHG on
guest PAE PTEs, which are accessed via userspace addresses.
Guard the asm_volatile_goto() variation with CC_HAS_ASM_GOTO_TIED_OUTPUT,
the "+m" constraint fails on some compilers that otherwise support
CC_HAS_ASM_GOTO_OUTPUT.
Cc: stable@vger.kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220202004945.2540433-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit baec4f5a018fe2d708fc1022330dba04b38b5fe3 upstream.
Commit ddd7ed842627 ("x86/kvm: Alloc dummy async #PF token outside of
raw spinlock") leads to the following Smatch static checker warning:
arch/x86/kernel/kvm.c:212 kvm_async_pf_task_wake()
warn: sleeping in atomic context
arch/x86/kernel/kvm.c
202 raw_spin_lock(&b->lock);
203 n = _find_apf_task(b, token);
204 if (!n) {
205 /*
206 * Async #PF not yet handled, add a dummy entry for the token.
207 * Allocating the token must be down outside of the raw lock
208 * as the allocator is preemptible on PREEMPT_RT kernels.
209 */
210 if (!dummy) {
211 raw_spin_unlock(&b->lock);
--> 212 dummy = kzalloc(sizeof(*dummy), GFP_KERNEL);
^^^^^^^^^^
Smatch thinks the caller has preempt disabled. The `smdb.py preempt
kvm_async_pf_task_wake` output call tree is:
sysvec_kvm_asyncpf_interrupt() <- disables preempt
-> __sysvec_kvm_asyncpf_interrupt()
-> kvm_async_pf_task_wake()
The caller is this:
arch/x86/kernel/kvm.c
290 DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_asyncpf_interrupt)
291 {
292 struct pt_regs *old_regs = set_irq_regs(regs);
293 u32 token;
294
295 ack_APIC_irq();
296
297 inc_irq_stat(irq_hv_callback_count);
298
299 if (__this_cpu_read(apf_reason.enabled)) {
300 token = __this_cpu_read(apf_reason.token);
301 kvm_async_pf_task_wake(token);
302 __this_cpu_write(apf_reason.token, 0);
303 wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1);
304 }
305
306 set_irq_regs(old_regs);
307 }
The DEFINE_IDTENTRY_SYSVEC() is a wrapper that calls this function
from the call_on_irqstack_cond(). It's inside the call_on_irqstack_cond()
where preempt is disabled (unless it's already disabled). The
irq_enter/exit_rcu() functions disable/enable preempt.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0547758a6de3cc71a0cfdd031a3621a30db6a68b upstream.
Drop the raw spinlock in kvm_async_pf_task_wake() before allocating the
the dummy async #PF token, the allocator is preemptible on PREEMPT_RT
kernels and must not be called from truly atomic contexts.
Opportunistically document why it's ok to loop on allocation failure,
i.e. why the function won't get stuck in an infinite loop.
Reported-by: Yajun Deng <yajun.deng@linux.dev>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d187ba5312307d51818beafaad87d28a7d939adf upstream.
Set the starting uABI size of KVM's guest FPU to 'struct kvm_xsave',
i.e. to KVM's historical uABI size. When saving FPU state for usersapce,
KVM (well, now the FPU) sets the FP+SSE bits in the XSAVE header even if
the host doesn't support XSAVE. Setting the XSAVE header allows the VM
to be migrated to a host that does support XSAVE without the new host
having to handle FPU state that may or may not be compatible with XSAVE.
Setting the uABI size to the host's default size results in out-of-bounds
writes (setting the FP+SSE bits) and data corruption (that is thankfully
caught by KASAN) when running on hosts without XSAVE, e.g. on Core2 CPUs.
WARN if the default size is larger than KVM's historical uABI size; all
features that can push the FPU size beyond the historical size must be
opt-in.
==================================================================
BUG: KASAN: slab-out-of-bounds in fpu_copy_uabi_to_guest_fpstate+0x86/0x130
Read of size 8 at addr ffff888011e33a00 by task qemu-build/681
CPU: 1 PID: 681 Comm: qemu-build Not tainted 5.18.0-rc5-KASAN-amd64 #1
Hardware name: /DG35EC, BIOS ECG3510M.86A.0118.2010.0113.1426 01/13/2010
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x45
print_report.cold+0x45/0x575
kasan_report+0x9b/0xd0
fpu_copy_uabi_to_guest_fpstate+0x86/0x130
kvm_arch_vcpu_ioctl+0x72a/0x1c50 [kvm]
kvm_vcpu_ioctl+0x47f/0x7b0 [kvm]
__x64_sys_ioctl+0x5de/0xc90
do_syscall_64+0x31/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
Allocated by task 0:
(stack is not available)
The buggy address belongs to the object at ffff888011e33800
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 0 bytes to the right of
512-byte region [ffff888011e33800, ffff888011e33a00)
The buggy address belongs to the physical page:
page:0000000089cd4adb refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11e30
head:0000000089cd4adb order:2 compound_mapcount:0 compound_pincount:0
flags: 0x4000000000010200(slab|head|zone=1)
raw: 4000000000010200 dead000000000100 dead000000000122 ffff888001041c80
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888011e33900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888011e33980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff888011e33a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff888011e33a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888011e33b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint
Fixes: be50b2065dfa ("kvm: x86: Add support for getting/setting expanded xstate buffer")
Fixes: c60427dd50ba ("x86/fpu: Add uabi_size to guest_fpu")
Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Zdenek Kaspar <zkaspar82@gmail.com>
Message-Id: <20220504001219.983513-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 300981abddcb13f8f06ad58f52358b53a8096775 upstream.
The bug is here:
if (!p)
return ret;
The list iterator value 'p' will *always* be set and non-NULL by
list_for_each_entry(), so it is incorrect to assume that the iterator
value will be NULL if the list is empty or no element is found.
To fix the bug, Use a new value 'iter' as the list iterator, while use
the old value 'p' as a dedicated variable to point to the found element.
Fixes: dfaa973ae960 ("KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs")
Cc: stable@vger.kernel.org # v5.9+
Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220414062103.8153-1-xiam0nd.tong@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e10e2f58030c5c211d49042a8c2a1b93d40b2ffb upstream.
In the event that random_get_entropy() can't access a cycle counter or
similar, falling back to returning 0 is really not the best we can do.
Instead, at least calling random_get_entropy_fallback() would be
preferable, because that always needs to return _something_, even
falling back to jiffies eventually. It's not as though
random_get_entropy_fallback() is super high precision or guaranteed to
be entropic, but basically anything that's not zero all the time is
better than returning zero all the time.
This is accomplished by just including the asm-generic code like on
other architectures, which means we can get rid of the empty stub
function here.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ac9756c79797bb98972736b13cfb239fd2cffb79 upstream.
In the event that random_get_entropy() can't access a cycle counter or
similar, falling back to returning 0 is really not the best we can do.
Instead, at least calling random_get_entropy_fallback() would be
preferable, because that always needs to return _something_, even
falling back to jiffies eventually. It's not as though
random_get_entropy_fallback() is super high precision or guaranteed to
be entropic, but basically anything that's not zero all the time is
better than returning zero all the time.
This is accomplished by just including the asm-generic code like on
other architectures, which means we can get rid of the empty stub
function here.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9f13fb0cd11ed2327abff69f6501a2c124c88b5a upstream.
In the event that random_get_entropy() can't access a cycle counter or
similar, falling back to returning 0 is really not the best we can do.
Instead, at least calling random_get_entropy_fallback() would be
preferable, because that always needs to return _something_, even
falling back to jiffies eventually. It's not as though
random_get_entropy_fallback() is super high precision or guaranteed to
be entropic, but basically anything that's not zero all the time is
better than returning zero all the time.
This is accomplished by just including the asm-generic code like on
other architectures, which means we can get rid of the empty stub
function here.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3bd4abc07a267e6a8b33d7f8717136e18f921c53 upstream.
In the event that random_get_entropy() can't access a cycle counter or
similar, falling back to returning 0 is suboptimal. Instead, fallback
to calling random_get_entropy_fallback(), which isn't extremely high
precision or guaranteed to be entropic, but is certainly better than
returning zero all the time.
If CONFIG_X86_TSC=n, then it's possible for the kernel to run on systems
without RDTSC, such as 486 and certain 586, so the fallback code is only
required for that case.
As well, fix up both the new function and the get_cycles() function from
which it was derived to use cpu_feature_enabled() rather than
boot_cpu_has(), and use !IS_ENABLED() instead of #ifndef.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c04e72700f2293013dab40208e809369378f224c upstream.
In the event that random_get_entropy() can't access a cycle counter or
similar, falling back to returning 0 is really not the best we can do.
Instead, at least calling random_get_entropy_fallback() would be
preferable, because that always needs to return _something_, even
falling back to jiffies eventually. It's not as though
random_get_entropy_fallback() is super high precision or guaranteed to
be entropic, but basically anything that's not zero all the time is
better than returning zero all the time.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ff8a8f59c99f6a7c656387addc4d9f2247d75077 upstream.
In the event that random_get_entropy() can't access a cycle counter or
similar, falling back to returning 0 is really not the best we can do.
Instead, at least calling random_get_entropy_fallback() would be
preferable, because that always needs to return _something_, even
falling back to jiffies eventually. It's not as though
random_get_entropy_fallback() is super high precision or guaranteed to
be entropic, but basically anything that's not zero all the time is
better than returning zero all the time.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1c99c6a7c3c599a68321b01b9ec243215ede5a68 upstream.
For situations in which we don't have a c0 counter register available,
we've been falling back to reading the c0 "random" register, which is
usually bounded by the amount of TLB entries and changes every other
cycle or so. This means it wraps extremely often. We can do better by
combining this fast-changing counter with a potentially slower-changing
counter from random_get_entropy_fallback() in the more significant bits.
This commit combines the two, taking into account that the changing bits
are in a different bit position depending on the CPU model. In addition,
we previously were falling back to 0 for ancient CPUs that Linux does
not support anyway; remove that dead path entirely.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Tested-by: Maciej W. Rozycki <macro@orcam.me.uk>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6d01238623faa9425f820353d2066baf6c9dc872 upstream.
In the event that random_get_entropy() can't access a cycle counter or
similar, falling back to returning 0 is really not the best we can do.
Instead, at least calling random_get_entropy_fallback() would be
preferable, because that always needs to return _something_, even
falling back to jiffies eventually. It's not as though
random_get_entropy_fallback() is super high precision or guaranteed to
be entropic, but basically anything that's not zero all the time is
better than returning zero all the time.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0f392c95391f2d708b12971a07edaa7973f9eece upstream.
In the event that random_get_entropy() can't access a cycle counter or
similar, falling back to returning 0 is really not the best we can do.
Instead, at least calling random_get_entropy_fallback() would be
preferable, because that always needs to return _something_, even
falling back to jiffies eventually. It's not as though
random_get_entropy_fallback() is super high precision or guaranteed to
be entropic, but basically anything that's not zero all the time is
better than returning zero all the time.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 408835832158df0357e18e96da7f2d1ed6b80e7f upstream.
PowerPC defines a get_cycles() function, but it does not do the usual
`#define get_cycles get_cycles` dance, making it impossible for generic
code to see if an arch-specific function was defined. While the
get_cycles() ifdef is not currently used, the following timekeeping
patch in this series will depend on the macro existing (or not existing)
when defining random_get_entropy().
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@ozlabs.org>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1097710bc9660e1e588cf2186a35db3d95c4d258 upstream.
Alpha defines a get_cycles() function, but it does not do the usual
`#define get_cycles get_cycles` dance, making it impossible for generic
code to see if an arch-specific function was defined. While the
get_cycles() ifdef is not currently used, the following timekeeping
patch in this series will depend on the macro existing (or not existing)
when defining random_get_entropy().
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8865bbe6ba1120e67f72201b7003a16202cd42be upstream.
PA-RISC defines a get_cycles() function, but it does not do the usual
`#define get_cycles get_cycles` dance, making it impossible for generic
code to see if an arch-specific function was defined. While the
get_cycles() ifdef is not currently used, the following timekeeping
patch in this series will depend on the macro existing (or not existing)
when defining random_get_entropy().
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2e3df523256cb9836de8441e9c791a796759bb3c upstream.
S390x defines a get_cycles() function, but it does not do the usual
`#define get_cycles get_cycles` dance, making it impossible for generic
code to see if an arch-specific function was defined. While the
get_cycles() ifdef is not currently used, the following timekeeping
patch in this series will depend on the macro existing (or not existing)
when defining random_get_entropy().
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 57c0900b91d8891ab43f0e6b464d059fda51d102 upstream.
Itanium defines a get_cycles() function, but it does not do the usual
`#define get_cycles get_cycles` dance, making it impossible for generic
code to see if an arch-specific function was defined. While the
get_cycles() ifdef is not currently used, the following timekeeping
patch in this series will depend on the macro existing (or not existing)
when defining random_get_entropy().
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Correctly expose GICv3 support even if no irqchip is created so
that userspace doesn't observe it changing pointlessly (fixing a
regression with QEMU)
- Don't issue a hypercall to set the id-mapped vectors when protected
mode is enabled (fix for pKVM in combination with CPUs affected by
Spectre-v3a)
x86 (five oneliners, of which the most interesting two are):
- a NULL pointer dereference on INVPCID executed with paging
disabled, but only if KVM is using shadow paging
- an incorrect bsearch comparison function which could truncate the
result and apply PMU event filtering incorrectly. This one comes
with a selftests update too"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/mmu: fix NULL pointer dereference on guest INVPCID
KVM: x86: hyper-v: fix type of valid_bank_mask
KVM: Free new dirty bitmap if creating a new memslot fails
KVM: eventfd: Fix false positive RCU usage warning
selftests: kvm/x86: Verify the pmu event filter matches the correct event
selftests: kvm/x86: Add the helper function create_pmu_event_filter
kvm: x86/pmu: Fix the compare function used by the pmu event filter
KVM: arm64: Don't hypercall before EL2 init
KVM: arm64: vgic-v3: Consistently populate ID_AA64PFR0_EL1.GIC
KVM: x86/mmu: Update number of zapped pages even if page list is stable
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- fix the fu540-c000 device tree to avoid a schema check failure on the
DMA node name
- fix typo in the PolarFire SOC device tree
* tag 'riscv-for-linus-5.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: dts: microchip: fix gpio1 reg property typo
riscv: dts: sifive: fu540-c000: align dma node name with dtschema
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Three arm64 fixes for -rc8/final.
The MTE and stolen time fixes have been doing the rounds for a little
while, but review and testing feedback was ongoing until earlier this
week. The kexec fix showed up on Monday and addresses a failure
observed under Qemu.
Summary:
- Add missing write barrier to publish MTE tags before a pte update
- Fix kexec relocation clobbering its own data structures
- Fix stolen time crash if a timer IRQ fires during CPU hotplug"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: mte: Ensure the cleared tags are visible before setting the PTE
arm64: kexec: load from kimage prior to clobbering
arm64: paravirt: Use RCU read locks to guard stolen_time
|
|
With shadow paging enabled, the INVPCID instruction results in a call
to kvm_mmu_invpcid_gva. If INVPCID is executed with CR0.PG=0, the
invlpg callback is not set and the result is a NULL pointer dereference.
Fix it trivially by checking for mmu->invlpg before every call.
There are other possibilities:
- check for CR0.PG, because KVM (like all Intel processors after P5)
flushes guest TLB on CR0.PG changes so that INVPCID/INVLPG are a
nop with paging disabled
- check for EFER.LMA, because KVM syncs and flushes when switching
MMU contexts outside of 64-bit mode
All of these are tricky, go for the simple solution. This is CVE-2022-1789.
Reported-by: Yongkang Jia <kangel@zju.edu.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
In kvm_hv_flush_tlb(), valid_bank_mask is declared as unsigned long,
but is used as u64, which is wrong for i386, and has been spotted by
LKP after applying "KVM: x86: hyper-v: replace bitmap_weight() with
hweight64()"
https://lore.kernel.org/lkml/20220510154750.212913-12-yury.norov@gmail.com/
But it's wrong even without that patch because now bitmap_weight()
dereferences a word after valid_bank_mask on i386.
>> include/asm-generic/bitops/const_hweight.h:21:76: warning: right shift count >= width of type
+[-Wshift-count-overflow]
21 | #define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
| ^~
include/asm-generic/bitops/const_hweight.h:10:16: note: in definition of macro '__const_hweight8'
10 | ((!!((w) & (1ULL << 0))) + \
| ^
include/asm-generic/bitops/const_hweight.h:20:31: note: in expansion of macro '__const_hweight16'
20 | #define __const_hweight32(w) (__const_hweight16(w) + __const_hweight16((w) >> 16))
| ^~~~~~~~~~~~~~~~~
include/asm-generic/bitops/const_hweight.h:21:54: note: in expansion of macro '__const_hweight32'
21 | #define __const_hweight64(w) (__const_hweight32(w) + __const_hweight32((w) >> 32))
| ^~~~~~~~~~~~~~~~~
include/asm-generic/bitops/const_hweight.h:29:49: note: in expansion of macro '__const_hweight64'
29 | #define hweight64(w) (__builtin_constant_p(w) ? __const_hweight64(w) : __arch_hweight64(w))
| ^~~~~~~~~~~~~~~~~
arch/x86/kvm/hyperv.c:1983:36: note: in expansion of macro 'hweight64'
1983 | if (hc->var_cnt != hweight64(valid_bank_mask))
| ^~~~~~~~~
CC: Borislav Petkov <bp@alien8.de>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: Jim Mattson <jmattson@google.com>
CC: Joerg Roedel <joro@8bytes.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Sean Christopherson <seanjc@google.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Wanpeng Li <wanpengli@tencent.com>
CC: kvm@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: x86@kernel.org
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Message-Id: <20220519171504.1238724-1-yury.norov@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When returning from the compare function the u64 is truncated to an
int. This results in a loss of the high nybble[1] in the event select
and its sign if that nybble is in use. Switch from using a result that
can end up being truncated to a result that can only be: 1, 0, -1.
[1] bits 35:32 in the event select register and bits 11:8 in the event
select.
Fixes: 7ff775aca48ad ("KVM: x86/pmu: Use binary search to check filtered events")
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220517051238.2566934-1-aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Fix reg address typo in the gpio1 stanza.
Signed-off-by: Conor Paxton <conor.paxton@microchip.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Fixes: 528a5b1f2556 ("riscv: dts: microchip: add new peripherals to icicle kit device tree")
Link: https://lore.kernel.org/r/20220517104058.2004734-1-conor.paxton@microchip.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
Fixes dtbs_check warnings like:
dma@3000000: $nodename:0: 'dma@3000000' does not match '^dma-controller(@.*)?$'
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Link: https://lore.kernel.org/r/20220407193856.18223-1-krzysztof.kozlowski@linaro.org
Fixes: c5ab54e9945b ("riscv: dts: add support for PDMA device of HiFive Unleashed Rev A00")
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc architecture fixes from Helge Deller:
"We had two big outstanding issues after v5.18-rc6:
a) 32-bit kernels on 64-bit machines (e.g. on a C3700 which is able
to run 32- and 64-bit kernels) failed early in userspace.
b) 64-bit kernels on PA8800/PA8900 CPUs (e.g. in a C8000) showed
random userspace segfaults. We assumed that those problems were
caused by the tmpalias flushes.
Dave did a lot of testing and reorganization of the current flush code
and fixed the 32-bit cache flushing. For PA8800/PA8900 CPUs he
switched the code to flush using the virtual address of user and
kernel pages instead of using tmpalias flushes. The tmpalias flushes
don't seem to work reliable on such CPUs.
We tested the patches on a wide range machines (715/64, B160L, C3000,
C3700, C8000, rp3440) and they have been in for-next without any
conflicts.
Summary:
- Rewrite the cache flush code for PA8800/PA8900 CPUs to flush using
the virtual address of user and kernel pages instead of using
tmpalias flushes. Testing showed, that tmpalias flushes don't work
reliably on PA8800/PA8900 CPUs
- Fix flush code to allow 32-bit kernels to run on 64-bit capable
machines, e.g. a 32-bit kernel on C3700 machines"
* tag 'for-5.18/parisc-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix patch code locking and flushing
parisc: Rewrite cache flush code for PA8800/PA8900
parisc: Disable debug code regarding cache flushes in handle_nadtlb_fault()
|
|
Pull ARM fixes from Russell King:
"Two further fixes for Spectre-BHB from Ard for Cortex A15 and to use
the wide branch instruction for Thumb2"
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9197/1: spectre-bhb: fix loop8 sequence for Thumb2
ARM: 9196/1: spectre-bhb: enable for Cortex-A15
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"The SoC bug fixes have calmed down sufficiently, there is one minor
update for the MAINTAINERS file, and few bug fixes for dts
descriptions:
- Updates to the BananaPi R2-Pro (rk3568) dts to match production
hardware rather than the prototype version.
- Qualcomm sm8250 soundwire gets disabled on some machines to avoid
crashes
- A number of aspeed SoC specific fixes, addressing incorrect pin
cotrol settings, some values in the romed8hm board, and a revert
for an accidental removal of a DT node"
* 'arm/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
MAINTAINERS: omap: remove me as a maintainer
ARM: dts: aspeed: Add video engine to g6
ARM: dts: aspeed: romed8hm3: Fix GPIOB0 name
ARM: dts: aspeed: romed8hm3: Add lm25066 sense resistor values
ARM: dts: aspeed-g6: fix SPI1/SPI2 quad pin group
ARM: dts: aspeed-g6: add FWQSPI group in pinctrl dtsi
dt-bindings: pinctrl: aspeed-g6: add FWQSPI function/group
pinctrl: pinctrl-aspeed-g6: add FWQSPI function-group
dt-bindings: pinctrl: aspeed-g6: remove FWQSPID group
pinctrl: pinctrl-aspeed-g6: remove FWQSPID group in pinctrl
ARM: dts: aspeed-g6: remove FWQSPID group in pinctrl dtsi
arm64: dts: qcom: sm8250: don't enable rx/tx macro by default
arm64: dts: rockchip: Add gmac1 and change network settings of bpi-r2-pro
arm64: dts: rockchip: Change io-domains of bpi-r2-pro
|
|
In Thumb2, 'b . + 4' produces a branch instruction that uses a narrow
encoding, and so it does not jump to the following instruction as
expected. So use W(b) instead.
Fixes: 6c7cb60bff7a ("ARM: fix Thumb2 regression with Spectre BHB")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
|
|
The Spectre-BHB mitigations were inadvertently left disabled for
Cortex-A15, due to the fact that cpu_v7_bugs_init() is not called in
that case. So fix that.
Fixes: b9baf5c8c5c3 ("ARM: Spectre-BHB workaround")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
|
|
This change fixes the following:
1) The flags variable is not initialized. Always use raw_spin_lock_irqsave
and raw_spin_unlock_irqrestore to serialize patching.
2) flush_kernel_vmap_range is primarily intended for DMA flushes.
The whole cache flush in flush_kernel_vmap_range is only possible
when interrupts are enabled on SMP machines. Since __patch_text_multiple
calls flush_kernel_vmap_range with interrupts disabled, it is better
to directly call flush_kernel_dcache_range_asm and
flush_kernel_icache_range_asm.
3) The final call to flush_icache_range is unnecessary.
Tested with `[PATCH, V3] parisc: Rewrite cache flush code for
PA8800/PA8900' change on rp3440, c8000 and c3750 (32 and 64-bit).
Note by Helge:
This patch had been temporarily reverted shortly before v5.18-rc6 in order
to fix boot issues. Now it can be re-applied.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
Originally, I was convinced that we needed to use tmpalias flushes
everwhere, for both user and kernel flushes. However, when I modified
flush_kernel_dcache_page_addr, to use a tmpalias flush, my c8000
would crash quite early when booting.
The PDC returns alias values of 0 for the icache and dcache. This
indicates that either the alias boundary is greater than 16MB or
equivalent aliasing doesn't work. I modified the tmpalias code to
make it easy to try alternate boundaries. I tried boundaries up to
128MB but still kernel tmpalias flushes didn't work on c8000.
This led me to conclude that tmpalias flushes don't work on PA8800
and PA8900 machines, and that we needed to flush directly using the
virtual address of user and kernel pages. This is likely the major
cause of instability on the c8000 and rp34xx machines.
Flushing user pages requires doing a temporary context switch as we
have to flush pages that don't belong to the current context. Further,
we have to deal with pages that aren't present. If a page isn't
present, the flush instructions fault on every line.
Other code has been rearranged and simplified based on testing. For
example, I introduced a flush_cache_dup_mm routine. flush_cache_mm
and flush_cache_dup_mm differ in that flush_cache_mm calls
purge_cache_pages and flush_cache_dup_mm calls flush_cache_pages.
In some implementations, pdc is more efficient than fdc. Based on
my testing, I don't believe there's any performance benefit on the
c8000.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
Change the "BUG" to "WARNING" and disable the message because it triggers
occasionally in spite of the check in flush_cache_page_if_present.
The pte value extracted for the "from" page in copy_user_highpage is racy and
occasionally the pte is cleared before the flush is complete. I assume that
the page is simultaneously flushed by flush_cache_mm before the pte is cleared
as nullifying the fdc doesn't seem to cause problems.
I investigated various locking scenarios but I wasn't able to find a way to
sequence the flushes. This code is called for every COW break and locks impact
performance.
This patch is related to the bigger cache flush patch because we need the pte
on PA8800/PA8900 to flush using the vma context.
I have also seen this from copy_to_user_page and copy_from_user_page.
The messages appear infrequently when enabled.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
|