Age | Commit message (Collapse) | Author |
|
commit 30faaafdfa0c754c91bac60f216c9f34a2bfdf7e upstream.
Commit 9c17d96500f7 ("xen/gntdev: Grant maps should not be subject to
NUMA balancing") set VM_IO flag to prevent grant maps from being
subjected to NUMA balancing.
It was discovered recently that this flag causes get_user_pages() to
always fail with -EFAULT.
check_vma_flags
__get_user_pages
__get_user_pages_locked
__get_user_pages_unlocked
get_user_pages_fast
iov_iter_get_pages
dio_refill_pages
do_direct_IO
do_blockdev_direct_IO
do_blockdev_direct_IO
ext4_direct_IO_read
generic_file_read_iter
aio_run_iocb
(which can happen if guest's vdisk has direct-io-safe option).
To avoid this let's use VM_MIXEDMAP flag instead --- it prevents
NUMA balancing just as VM_IO does and has no effect on
check_vma_flags().
Reported-by: Olaf Hering <olaf@aepfle.de>
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Tested-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9a035a40f7f3f6708b79224b86c5777a3334f7ea upstream.
This should really only be done for XS_TRANSACTION_END messages, or
else at least some of the xenstore-* tools don't work anymore.
Fixes: 0beef634b8 ("xenbus: don't BUG() on user mode induced condition")
Reported-by: Richard Schütz <rschuetz@uni-koblenz.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Richard Schütz <rschuetz@uni-koblenz.de>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: "M. Vefa Bicakci" <m.v.b@runbox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7469be95a487319514adce2304ad2af3553d2fc9 upstream.
xenbus_dev_request_and_reply() needs to track whether a transaction is
open. For XS_TRANSACTION_START messages it calls transaction_start()
and for XS_TRANSACTION_END messages it calls transaction_end().
If sending an XS_TRANSACTION_START message fails or responds with an
an error, the transaction is not open and transaction_end() must be
called.
If sending an XS_TRANSACTION_END message fails, the transaction is
still open, but if an error response is returned the transaction is
closed.
Commit 027bd7e89906 ("xen/xenbus: Avoid synchronous wait on XenBus
stalling shutdown/restart") introduced a regression where failed
XS_TRANSACTION_START messages were leaving the transaction open. This
can cause problems with suspend (and migration) as all transactions
must be closed before suspending.
It appears that the problematic change was added accidentally, so just
remove it.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0beef634b86a1350c31da5fcc2992f0d7c8a622b upstream.
Inability to locate a user mode specified transaction ID should not
lead to a kernel crash. For other than XS_TRANSACTION_START also
don't issue anything to xenbus if the specified ID doesn't match that
of any active transaction.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 02ef871ecac290919ea0c783d05da7eedeffc10e upstream.
Current overlap check is evaluating to false a case where a filter
field is fully contained (proper subset) of a r/w request. This
change applies classical overlap check instead to include all the
scenarios.
More specifically, for (Hilscher GmbH CIFX 50E-DP(M/S)) device driver
the logic is such that the entire confspace is read and written in 4
byte chunks. In this case as an example, CACHE_LINE_SIZE,
LATENCY_TIMER and PCI_BIST are arriving together in one call to
xen_pcibk_config_write() with offset == 0xc and size == 4. With the
exsisting overlap check the LATENCY_TIMER field (offset == 0xd, length
== 1) is fully contained in the write request and hence is excluded
from write, which is incorrect.
Signed-off-by: Andrey Grodzovsky <andrey2805@gmail.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6f2d9d99213514360034c6d52d2c3919290b3504 upstream.
As of Xen 4.7 PV CPUID doesn't expose either of CPUID[1].ECX[7] and
CPUID[0x80000007].EDX[7] anymore, causing the driver to fail to load on
both Intel and AMD systems. Doing any kind of hardware capability
checks in the driver as a prerequisite was wrong anyway: With the
hypervisor being in charge, all such checking should be done by it. If
ACPI data gets uploaded despite some missing capability, the hypervisor
is free to ignore part or all of that data.
Ditch the entire check_prereq() function, and do the only valid check
(xen_initial_domain()) in the caller in its place.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 842775f1509054ea969f1787f38d6a0ec2ccfaba upstream.
Fix a declared-but-not-defined warning when building with
XEN_BALLOON_MEMORY_HOTPLUG=n. This fixes a regression introduced by
commit dfd74a1edfab ("xen/balloon: Fix crash when ballooning on x86 32
bit PAE").
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f0f393877c71ad227d36705d61d1e4062bc29cf5 upstream.
Commit ff1e22e7a638 ("xen/events: Mask a moving irq") open-coded
irq_move_irq() but left out checking if the IRQ is disabled. This broke
resuming from suspend since it tries to move a (disabled) irq without
holding the IRQ's desc->lock. Fix it by adding in a check for disabled
IRQs.
The resulting stacktrace was:
kernel BUG at /build/linux-UbQGH5/linux-4.4.0/kernel/irq/migration.c:31!
invalid opcode: 0000 [#1] SMP
Modules linked in: xenfs xen_privcmd ...
CPU: 0 PID: 9 Comm: migration/0 Not tainted 4.4.0-22-generic #39-Ubuntu
Hardware name: Xen HVM domU, BIOS 4.6.1-xs125180 05/04/2016
task: ffff88003d75ee00 ti: ffff88003d7bc000 task.ti: ffff88003d7bc000
RIP: 0010:[<ffffffff810e26e2>] [<ffffffff810e26e2>] irq_move_masked_irq+0xd2/0xe0
RSP: 0018:ffff88003d7bfc50 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff88003d40ba00 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000100 RDI: ffff88003d40bad8
RBP: ffff88003d7bfc68 R08: 0000000000000000 R09: ffff88003d000000
R10: 0000000000000000 R11: 000000000000023c R12: ffff88003d40bad0
R13: ffffffff81f3a4a0 R14: 0000000000000010 R15: 00000000ffffffff
FS: 0000000000000000(0000) GS:ffff88003da00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd4264de624 CR3: 0000000037922000 CR4: 00000000003406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
ffff88003d40ba38 0000000000000024 0000000000000000 ffff88003d7bfca0
ffffffff814c8d92 00000010813ef89d 00000000805ea732 0000000000000009
0000000000000024 ffff88003cc39b80 ffff88003d7bfce0 ffffffff814c8f66
Call Trace:
[<ffffffff814c8d92>] eoi_pirq+0xb2/0xf0
[<ffffffff814c8f66>] __startup_pirq+0xe6/0x150
[<ffffffff814ca659>] xen_irq_resume+0x319/0x360
[<ffffffff814c7e75>] xen_suspend+0xb5/0x180
[<ffffffff81120155>] multi_cpu_stop+0xb5/0xe0
[<ffffffff811200a0>] ? cpu_stop_queue_work+0x80/0x80
[<ffffffff811203d0>] cpu_stopper_thread+0xb0/0x140
[<ffffffff810a94e6>] ? finish_task_switch+0x76/0x220
[<ffffffff810ca731>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[<ffffffff810a3935>] smpboot_thread_fn+0x105/0x160
[<ffffffff810a3830>] ? sort_range+0x30/0x30
[<ffffffff810a0588>] kthread+0xd8/0xf0
[<ffffffff810a04b0>] ? kthread_create_on_node+0x1e0/0x1e0
[<ffffffff8182568f>] ret_from_fork+0x3f/0x70
[<ffffffff810a04b0>] ? kthread_create_on_node+0x1e0/0x1e0
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 27e0e6385377c4dc68a4ddaf1a35a2dfa951f3c5 upstream.
The copying of ring data was wrong for two cases: For a full ring
nothing got copied at all (as in that case the canonicalized producer
and consumer indexes are identical). And in case one or both of the
canonicalized (after the resize) indexes would point into the second
half of the buffer, the copied data ended up in the wrong (free) part
of the new buffer. In both cases uninitialized data would get passed
back to the caller.
Fix this by simply copying the old ring contents twice: Once to the
low half of the new buffer, and a second time to the high half.
This addresses the inability to boot a HVM guest with 64 or more
vCPUs. This regression was caused by 8620015499101090 (xen/evtchn:
dynamically grow pending event channel ring).
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dfd74a1edfaba5864276a2859190a8d242d18952 upstream.
Commit 55b3da98a40dbb3776f7454daf0d95dde25c33d2 (xen/balloon: find
non-conflicting regions to place hotplugged memory) caused a
regression in 4.4.
When ballooning on an x86 32 bit PAE system with close to 64 GiB of
memory, the address returned by allocate_resource may be above 64 GiB.
When using CONFIG_SPARSEMEM, this setup is limited to using physical
addresses < 64 GiB. When adding memory at this address, it runs off
the end of the mem_section array and causes a crash. Instead, fail
the ballooning request.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ff1e22e7a638a0782f54f81a6c9cb139aca2da35 upstream.
Moving an unmasked irq may result in irq handler being invoked on both
source and target CPUs.
With 2-level this can happen as follows:
On source CPU:
evtchn_2l_handle_events() ->
generic_handle_irq() ->
handle_edge_irq() ->
eoi_pirq():
irq_move_irq(data);
/***** WE ARE HERE *****/
if (VALID_EVTCHN(evtchn))
clear_evtchn(evtchn);
If at this moment target processor is handling an unrelated event in
evtchn_2l_handle_events()'s loop it may pick up our event since target's
cpu_evtchn_mask claims that this event belongs to it *and* the event is
unmasked and still pending. At the same time, source CPU will continue
executing its own handle_edge_irq().
With FIFO interrupt the scenario is similar: irq_move_irq() may result
in a EVTCHNOP_unmask hypercall which, in turn, may make the event
pending on the target CPU.
We can avoid this situation by moving and clearing the event while
keeping event masked.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d159457b84395927b5a52adb72f748dd089ad5e5 upstream.
Commit 8135cf8b092723dbfcc611fe6fdcb3a36c9951c5 (xen/pciback: Save
xen_pci_op commands before processing it) broke enabling MSI-X because
it would never copy the resulting vectors into the response. The
number of vectors requested was being overwritten by the return value
(typically zero for success).
Save the number of vectors before processing the op, so the correct
number of vectors are copied afterwards.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8d47065f7d1980dde52abb874b301054f3013602 upstream.
Commit 408fb0e5aa7fda0059db282ff58c3b2a4278baa0 (xen/pciback: Don't
allow MSI-X ops if PCI_COMMAND_MEMORY is not set) prevented enabling
MSI-X on passed-through virtual functions, because it checked the VF
for PCI_COMMAND_MEMORY but this is not a valid bit for VFs.
Instead, check the physical function for PCI_COMMAND_MEMORY.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f285aa8db7cc4432c1a03f8b55ff34fe96317c11 upstream.
When adding a new frontend to xen-scsiback don't decrement the number
of active frontends in case of no error. Doing so results in a failure
when trying to remove the xen-pvscsi nexus even if no domain is using
it.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel:
- XSA-155 security fixes to backend drivers.
- XSA-157 security fixes to pciback.
* tag 'for-linus-4.4-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen-pciback: fix up cleanup path when alloc fails
xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
xen/pciback: Do not install an IRQ handler for MSI interrupts.
xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
xen/pciback: Save xen_pci_op commands before processing it
xen-scsiback: safely copy requests
xen-blkback: read from indirect descriptors only once
xen-blkback: only read request operation from shared ring once
xen-netback: use RING_COPY_REQUEST() throughout
xen-netback: don't use last request to determine minimum Tx credit
xen: Add RING_COPY_REQUEST()
xen/x86/pvh: Use HVM's flush_tlb_others op
xen: Resume PMU from non-atomic context
xen/events/fifo: Consume unprocessed events when a CPU dies
|
|
When allocating a pciback device fails, clear the private
field. This could lead to an use-after free, however
the 'really_probe' takes care of setting
dev_set_drvdata(dev, NULL) in its failure path (which we would
exercise if the ->probe function failed), so we we
are OK. However lets be defensive as the code can change.
Going forward we should clean up the pci_set_drvdata(dev, NULL)
in the various code-base. That will be for another day.
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reported-by: Jonathan Creekmore <jonathan.creekmore@gmail.com>
Signed-off-by: Doug Goldstein <cardoe@cardoe.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
commit f598282f51 ("PCI: Fix the NIU MSI-X problem in a better way")
teaches us that dealing with MSI-X can be troublesome.
Further checks in the MSI-X architecture shows that if the
PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we
may not be able to access the BAR (since they are memory regions).
Since the MSI-X tables are located in there.. that can lead
to us causing PCIe errors. Inhibit us performing any
operation on the MSI-X unless the MEMORY bit is set.
Note that Xen hypervisor with:
"x86/MSI-X: access MSI-X table only after having enabled MSI-X"
will return:
xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3!
When the generic MSI code tries to setup the PIRQ without
MEMORY bit set. Which means with later versions of Xen
(4.6) this patch is not neccessary.
This is part of XSA-157
CC: stable@vger.kernel.org
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
MSI(X) enabled.
Otherwise just continue on, returning the same values as
previously (return of 0, and op->result has the PIRQ value).
This does not change the behavior of XEN_PCI_OP_disable_msi[|x].
The pci_disable_msi or pci_disable_msix have the checks for
msi_enabled or msix_enabled so they will error out immediately.
However the guest can still call these operations and cause
us to disable the 'ack_intr'. That means the backend IRQ handler
for the legacy interrupt will not respond to interrupts anymore.
This will lead to (if the device is causing an interrupt storm)
for the Linux generic code to disable the interrupt line.
Naturally this will only happen if the device in question
is plugged in on the motherboard on shared level interrupt GSI.
This is part of XSA-157
CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
Otherwise an guest can subvert the generic MSI code to trigger
an BUG_ON condition during MSI interrupt freeing:
for (i = 0; i < entry->nvec_used; i++)
BUG_ON(irq_has_action(entry->irq + i));
Xen PCI backed installs an IRQ handler (request_irq) for
the dev->irq whenever the guest writes PCI_COMMAND_MEMORY
(or PCI_COMMAND_IO) to the PCI_COMMAND register. This is
done in case the device has legacy interrupts the GSI line
is shared by the backend devices.
To subvert the backend the guest needs to make the backend
to change the dev->irq from the GSI to the MSI interrupt line,
make the backend allocate an interrupt handler, and then command
the backend to free the MSI interrupt and hit the BUG_ON.
Since the backend only calls 'request_irq' when the guest
writes to the PCI_COMMAND register the guest needs to call
XEN_PCI_OP_enable_msi before any other operation. This will
cause the generic MSI code to setup an MSI entry and
populate dev->irq with the new PIRQ value.
Then the guest can write to PCI_COMMAND PCI_COMMAND_MEMORY
and cause the backend to setup an IRQ handler for dev->irq
(which instead of the GSI value has the MSI pirq). See
'xen_pcibk_control_isr'.
Then the guest disables the MSI: XEN_PCI_OP_disable_msi
which ends up triggering the BUG_ON condition in 'free_msi_irqs'
as there is an IRQ handler for the entry->irq (dev->irq).
Note that this cannot be done using MSI-X as the generic
code does not over-write dev->irq with the MSI-X PIRQ values.
The patch inhibits setting up the IRQ handler if MSI or
MSI-X (for symmetry reasons) code had been called successfully.
P.S.
Xen PCIBack when it sets up the device for the guest consumption
ends up writting 0 to the PCI_COMMAND (see xen_pcibk_reset_device).
XSA-120 addendum patch removed that - however when upstreaming said
addendum we found that it caused issues with qemu upstream. That
has now been fixed in qemu upstream.
This is part of XSA-157
CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
MSI-X enabled
The guest sequence of:
a) XEN_PCI_OP_enable_msix
b) XEN_PCI_OP_enable_msix
results in hitting an NULL pointer due to using freed pointers.
The device passed in the guest MUST have MSI-X capability.
The a) constructs and SysFS representation of MSI and MSI groups.
The b) adds a second set of them but adding in to SysFS fails (duplicate entry).
'populate_msi_sysfs' frees the newly allocated msi_irq_groups (note that
in a) pdev->msi_irq_groups is still set) and also free's ALL of the
MSI-X entries of the device (the ones allocated in step a) and b)).
The unwind code: 'free_msi_irqs' deletes all the entries and tries to
delete the pdev->msi_irq_groups (which hasn't been set to NULL).
However the pointers in the SysFS are already freed and we hit an
NULL pointer further on when 'strlen' is attempted on a freed pointer.
The patch adds a simple check in the XEN_PCI_OP_enable_msix to guard
against that. The check for msi_enabled is not stricly neccessary.
This is part of XSA-157
CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
MSI-X enabled
The guest sequence of:
a) XEN_PCI_OP_enable_msi
b) XEN_PCI_OP_enable_msi
c) XEN_PCI_OP_disable_msi
results in hitting an BUG_ON condition in the msi.c code.
The MSI code uses an dev->msi_list to which it adds MSI entries.
Under the above conditions an BUG_ON() can be hit. The device
passed in the guest MUST have MSI capability.
The a) adds the entry to the dev->msi_list and sets msi_enabled.
The b) adds a second entry but adding in to SysFS fails (duplicate entry)
and deletes all of the entries from msi_list and returns (with msi_enabled
is still set). c) pci_disable_msi passes the msi_enabled checks and hits:
BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));
and blows up.
The patch adds a simple check in the XEN_PCI_OP_enable_msi to guard
against that. The check for msix_enabled is not stricly neccessary.
This is part of XSA-157.
CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
Double fetch vulnerabilities that happen when a variable is
fetched twice from shared memory but a security check is only
performed the first time.
The xen_pcibk_do_op function performs a switch statements on the op->cmd
value which is stored in shared memory. Interestingly this can result
in a double fetch vulnerability depending on the performed compiler
optimization.
This patch fixes it by saving the xen_pci_op command before
processing it. We also use 'barrier' to make sure that the
compiler does not perform any optimization.
This is part of XSA155.
CC: stable@vger.kernel.org
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
The copy of the ring request was lacking a following barrier(),
potentially allowing the compiler to optimize the copy away.
Use RING_COPY_REQUEST() to ensure the request is copied to local
memory.
This is part of XSA155.
CC: stable@vger.kernel.org
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|
When a CPU is offlined, there may be unprocessed events on a port for
that CPU. If the port is subsequently reused on a different CPU, it
could be in an unexpected state with the link bit set, resulting in
interrupts being missed. Fix this by consuming any unprocessed events
for a particular CPU when that CPU dies.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Cc: <stable@vger.kernel.org> # 3.14+
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel:
- Fix gntdev and numa balancing.
- Fix x86 boot crash due to unallocated legacy irq descs.
- Fix overflow in evtchn device when > 1024 event channels.
* tag 'for-linus-4.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/evtchn: dynamically grow pending event channel ring
xen/events: Always allocate legacy interrupts on PV guests
xen/gntdev: Grant maps should not be subject to NUMA balancing
|
|
If more than 1024 event channels are bound to a evtchn device then it
possible (even with well behaved applications) for the ring to
overflow and events to be lost (reported as an -EFBIG error).
Dynamically increase the size of the ring so there is always enough
space for all bound events. Well behaved applicables that only unmask
events after draining them from the ring can thus no longer lose
events.
However, an application could unmask an event before draining it,
allowing multiple entries per port to accumulate in the ring, and a
overflow could still occur. So the overflow detection and reporting
is retained.
The ring size is initially only 64 entries so the common use case of
an application only binding a few events will use less memory than
before. The ring size may grow to 512 KiB (enough for all 2^17
possible channels). This order 7 kmalloc() may fail due to memory
fragmentation, so we fall back to trying vmalloc().
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
|
|
After commit 8c058b0b9c34 ("x86/irq: Probe for PIC presence before
allocating descs for legacy IRQs") early_irq_init() will no longer
preallocate descriptors for legacy interrupts if PIC does not
exist, which is the case for Xen PV guests.
Therefore we may need to allocate those descriptors ourselves.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
Doing so will cause the grant to be unmapped and then, during
fault handling, the fault to be mistakenly treated as NUMA hint
fault.
In addition, even if those maps could partcipate in NUMA
balancing, it wouldn't provide any benefit since we are unable
to determine physical page's node (even if/when VNUMA is
implemented).
Marking grant maps' VMAs as VM_IO will exclude them from being
part of NUMA balancing.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending
Pull SCSI target updates from Nicholas Bellinger:
"This series contains HCH's changes to absorb configfs attribute
->show() + ->store() function pointer usage from it's original
tree-wide consumers, into common configfs code.
It includes usb-gadget, target w/ drivers, netconsole and ocfs2
changes to realize the improved simplicity, that now renders the
original include/target/configfs_macros.h CPP magic for fabric drivers
and others, unnecessary and obsolete.
And with common code in place, new configfs attributes can be added
easier than ever before.
Note, there are further improvements in-flight from other folks for
v4.5 code in configfs land, plus number of target fixes for post -rc1
code"
In the meantime, a new user of the now-removed old configfs API came in
through the char/misc tree in commit 7bd1d4093c2f ("stm class: Introduce
an abstraction for System Trace Module devices").
This merge resolution comes from Alexander Shishkin, who updated his stm
class tracing abstraction to account for the removal of the old
show_attribute and store_attribute methods in commit 517982229f78
("configfs: remove old API") from this pull. As Alexander says about
that patch:
"There's no need to keep an extra wrapper structure per item and the
awkward show_attribute/store_attribute item ops are no longer needed.
This patch converts policy code to the new api, all the while making
the code quite a bit smaller and easier on the eyes.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>"
That patch was folded into the merge so that the tree should be fully
bisectable.
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (23 commits)
configfs: remove old API
ocfs2/cluster: use per-attribute show and store methods
ocfs2/cluster: move locking into attribute store methods
netconsole: use per-attribute show and store methods
target: use per-attribute show and store methods
spear13xx_pcie_gadget: use per-attribute show and store methods
dlm: use per-attribute show and store methods
usb-gadget/f_serial: use per-attribute show and store methods
usb-gadget/f_phonet: use per-attribute show and store methods
usb-gadget/f_obex: use per-attribute show and store methods
usb-gadget/f_uac2: use per-attribute show and store methods
usb-gadget/f_uac1: use per-attribute show and store methods
usb-gadget/f_mass_storage: use per-attribute show and store methods
usb-gadget/f_sourcesink: use per-attribute show and store methods
usb-gadget/f_printer: use per-attribute show and store methods
usb-gadget/f_midi: use per-attribute show and store methods
usb-gadget/f_loopback: use per-attribute show and store methods
usb-gadget/ether: use per-attribute show and store methods
usb-gadget/f_acm: use per-attribute show and store methods
usb-gadget/f_hid: use per-attribute show and store methods
...
|
|
When offlining a cpu, instead of cpu_down, call device_offline, which
also takes care of updating the cpu.dev.offline field. This keeps the
sysfs file /sys/devices/system/cpu/cpuN/online, up to date. Also move
the call to disable_hotplug_cpu, because it makes more sense to have it
there.
We don't call device_online at cpu-hotplug time, because that would
immediately take the cpu online, while we want to retain the current
behaviour: the user needs to explicitly enable the cpu after it has
been hotplugged.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: konrad.wilk@oracle.com
CC: boris.ostrovsky@oracle.com
CC: david.vrabel@citrix.com
|
|
Build cpu_hotplug for ARM and ARM64 guests.
Rename arch_(un)register_cpu to xen_(un)register_cpu and provide an
empty implementation on ARM and ARM64. On x86 just call
arch_(un)register_cpu as we are already doing.
Initialize cpu_hotplug on ARM.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
|
|
The PV ring may use multiple grants and expect them to be mapped
contiguously in the virtual memory.
Although, the current code is relying on a Linux page will be mapped to
a single grant. On build where Linux is using a different page size than
the grant (i.e other than 4KB), the grant will always be mapped on the
first 4KB of each Linux page which make the final ring not contiguous in
the memory.
This can be fixed by mapping multiple grant in a same Linux page.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
With the 64KB page granularity support on ARM64, a Linux page may be
split accross multiple grant.
Currently we have the helper gnttab_foreach_grant_in_grant to break a
Linux page based on an offset and a len, but it doesn't fit when we only
have a number of grants in hand.
Introduce a new helper which take an array of Linux page and a number of
grant and will figure out the address of each grant.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
Linux may use a different page size than the size of grant. So make
clear that the order is actually in number of grant.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
The type of the item in frame_list is xen_pfn_t which is not an unsigned
long on ARM but an uint64_t.
With the current computation, the size of frame_list will be 2 *
PAGE_SIZE rather than PAGE_SIZE.
I bet it's just mistake when the type has been switched from "unsigned
long" to "xen_pfn_t" in commit 965c0aaafe3e75d4e65cd4ec862915869bde3abd
"xen: balloon: use correct type for frame_list".
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
Swiotlb is used on ARM64 to support DMA on platform where devices are
not protected by an SMMU. Furthermore it's only enabled for DOM0.
While Xen is always using 4KB page granularity in the stage-2 page table,
Linux ARM64 may either use 4KB or 64KB. This means that a Linux page
can be spanned accross multiple Xen page.
The Swiotlb code has to validate that the buffer used for DMA is
physically contiguous in the memory. As a Linux page can't be shared
between local memory and foreign page by design (the balloon code always
removing entirely a Linux page), the changes in the code are very
minimal because we only need to check the first Xen PFN.
Note that it may be possible to optimize the function
check_page_physically_contiguous to avoid looping over every Xen PFN
for local memory. Although I will let this optimization for a follow-up.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
With 64KB page granularity support, the frame number will be different.
It will be easier to modify the behavior in a single place rather than
in each caller.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
The hypercall interface (as well as the toolstack) is always using 4KB
page granularity. When the toolstack is asking for mapping a series of
guest PFN in a batch, it expects to have the page map contiguously in
its virtual memory.
When Linux is using 64KB page granularity, the privcmd driver will have
to map multiple Xen PFN in a single Linux page.
Note that this solution works on page granularity which is a multiple of
4KB.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
The Xen interface is using 4KB page granularity. This means that each
grant is 4KB.
The current implementation allocates a Linux page per grant. On Linux
using 64KB page granularity, only the first 4KB of the page will be
used.
We could decrease the memory wasted by sharing the page with multiple
grant. It will require some care with the {Set,Clear}ForeignPage macro.
Note that no changes has been made in the x86 code because both Linux
and Xen will only use 4KB page granularity.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
Only use the first 4KB of the page to store the events channel info. It
means that we will waste 60KB every time we allocate page for:
* control block: a page is allocating per CPU
* event array: a page is allocating everytime we need to expand it
I think we can reduce the memory waste for the 2 areas by:
* control block: sharing between multiple vCPUs. Although it will
require some bookkeeping in order to not free the page when the CPU
goes offline and the other CPUs sharing the page still there
* event array: always extend the array event by 64K (i.e 16 4K
chunk). That would require more care when we fail to expand the
event channel.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
For ARM64 guests, Linux is able to support either 64K or 4K page
granularity. Although, the hypercall interface is always based on 4K
page granularity.
With 64K page granularity, a single page will be spread over multiple
Xen frame.
To avoid splitting the page into 4K frame, take advantage of the
extent_order field to directly allocate/free chunk of the Linux page
size.
Note that PVMMU is only used for PV guest (which is x86) and the page
granularity is always 4KB. Some BUILD_BUG_ON has been added to ensure
that because the code has not been modified.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
All the ring (xenstore, and PV rings) are always based on the page
granularity of Xen.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
pages
On ARM all dma-capable devices on a same platform may not be protected
by an IOMMU. The DMA requests have to use the BFN (i.e MFN on ARM) in
order to use correctly the device.
While the DOM0 memory is allocated in a 1:1 fashion (PFN == MFN), grant
mapping will screw this contiguous mapping.
When Linux is using 64KB page granularitary, the page may be split
accross multiple non-contiguous MFN (Xen is using 4KB page
granularity). Therefore a DMA request will likely fail.
Checking that a 64KB page is using contiguous MFN is tedious. For
now, always says that biovec are not mergeable.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
Currently, a grant is always based on the Xen page granularity (i.e
4KB). When Linux is using a different page granularity, a single page
will be split between multiple grants.
The new helpers will be in charge of splitting the Linux page into grants
and call a function given by the caller on each grant.
Also provide an helper to count the number of grants within a given
contiguous region.
Note that the x86/include/asm/xen/page.h is now including
xen/interface/grant_table.h rather than xen/grant_table.h. It's
necessary because xen/grant_table.h depends on asm/xen/page.h and will
break the compilation. Furthermore, only definition in
interface/grant_table.h is required.
Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
|
|
Pages returned by alloc_xenballooned_pages() will be used for grant
mapping which will call set_phys_to_machine() (in PV guests).
Ballooned pages are set as INVALID_P2M_ENTRY in the p2m and thus may
be using the (shared) missing tables and a subsequent
set_phys_to_machine() will need to allocate new tables.
Since the grant mapping may be done from a context that cannot sleep,
the p2m entries must already be allocated.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
|
|
alloc_xenballooned_pages() is used to get ballooned pages to back
foreign mappings etc. Instead of having to balloon out real pages,
use (if supported) hotplugged memory.
This makes more memory available to the guest and reduces
fragmentation in the p2m.
This is only enabled if the xen.balloon.hotplug_unpopulated sysctl is
set to 1. This sysctl defaults to 0 in case the udev rules to
automatically online hotplugged memory do not exist.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
---
v3:
- Add xen.balloon.hotplug_unpopulated sysctl to enable use of hotplug
for unpopulated pages.
|
|
All users of alloc_xenballoon_pages() wanted low memory pages, so
remove the option for high memory.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
|
|
Now that we track the total number of pages (included hotplugged
regions), it is easy to determine if more memory needs to be
hotplugged.
Add a new BP_WAIT state to signal that the balloon process needs to
wait until kicked by the memory add notifier (when the new section is
onlined by userspace).
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
---
v3:
- Return BP_WAIT if enough sections are already hotplugged.
v2:
- New BP_WAIT status after adding new memory sections.
|
|
The stats used for memory hotplug make no sense and are fiddled with
in odd ways. Remove them and introduce total_pages to track the total
number of pages (both populated and unpopulated) including those within
hotplugged regions (note that this includes not yet onlined pages).
This will be used in a subsequent commit (xen/balloon: only hotplug
additional memory if required) when deciding whether additional memory
needs to be hotplugged.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
|
|
Instead of placing hotplugged memory at the end of RAM (which may
conflict with PCI devices or reserved regions) use allocate_resource()
to get a new, suitably aligned resource that does not conflict.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
---
v3:
- Remove stale comment.
|