summaryrefslogtreecommitdiff
path: root/arch/powerpc/mm/mmu_context_book3s64.c
AgeCommit message (Collapse)Author
2017-08-23powerpc/mm: Optimize detection of thread local mm'sBenjamin Herrenschmidt
Instead of comparing the whole CPU mask every time, let's keep a counter of how many bits are set in the mask. Thus testing for a local mm only requires testing if that counter is 1 and the current CPU bit is set in the mask. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-23Merge branch 'fixes' into nextMichael Ellerman
There's a non-trivial dependency between some commits we want to put in next and the KVM prefetch work around that went into fixes. So merge fixes into next.
2017-08-03powerpc: Remove old unused icswx based coprocessor supportBenjamin Herrenschmidt
We have a whole pile of unused code to maintain the ACOP register, allocate coprocessor PIDs and handle ACOP faults. This mechanism was used for the HFI adapter on POWER7 which is dead and gone and whose driver never went upstream. It was used on some A2 core based stuff that also never saw the light of day. Take out all that code. There is still some POWER8 coprocessor code that uses icswx but it's kernel only and thus doesn't use any of that infrastructure. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-26powerpc/mm/radix: Workaround prefetch issue with KVMBenjamin Herrenschmidt
There's a somewhat architectural issue with Radix MMU and KVM. When coming out of a guest with AIL (Alternate Interrupt Location, ie, MMU enabled), we start executing hypervisor code with the PID register still containing whatever the guest has been using. The problem is that the CPU can (and will) then start prefetching or speculatively load from whatever host context has that same PID (if any), thus bringing translations for that context into the TLB, which Linux doesn't know about. This can cause stale translations and subsequent crashes. Fixing this in a way that is neither racy nor a huge performance impact is difficult. We could just make the host invalidations always use broadcast forms but that would hurt single threaded programs for example. We chose to fix it instead by partitioning the PID space between guest and host. This is possible because today Linux only use 19 out of the 20 bits of PID space, so existing guests will work if we make the host use the top half of the 20 bits space. We additionally add support for a property to indicate to Linux the size of the PID register which will be useful if we eventually have processors with a larger PID space available. There is still an issue with malicious guests purposefully setting the PID register to a value in the hosts PID range. Hopefully future HW can prevent that, but in the meantime, we handle it with a pair of kludges: - On the way out of a guest, before we clear the current VCPU in the PACA, we check the PID and if it's outside of the permitted range we flush the TLB for that PID. - When context switching, if the mm is "new" on that CPU (the corresponding bit was set for the first time in the mm cpumask), we check if any sibling thread is in KVM (has a non-NULL VCPU pointer in the PACA). If that is the case, we also flush the PID for that CPU (core). This second part is needed to handle the case where a process is migrated (or starts a new pthread) on a sibling thread of the CPU coming out of KVM, as there's a window where stale translations can exist before we detect it and flush them out. A future optimization could be added by keeping track of whether the PID has ever been used and avoid doing that for completely fresh PIDs. We could similarily mark PIDs that have been the subject of a global invalidation as "fresh". But for now this will do. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop unneeded include of kvm_book3s_asm.h] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-10powerpc/mm/radix: Synchronize updates to the process tableBenjamin Herrenschmidt
When writing to the process table, we need to ensure the store is visible to a subsequent access by the MMU. We assume we never have the PID active while doing the update, so a ptesync/isync pair should hopefully be a big enough hammer for our purpose. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-10powerpc/mm/radix: Properly clear process table entryBenjamin Herrenschmidt
On radix, the process table entry we want to clear when destroying a context is entry 0, not entry 1. This has no *immediate* consequence on Power9, but it can cause other bugs to become worse. Fixes: 7e381c0ff618 ("powerpc/mm/radix: Add mmu context handling callback for radix") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-03Merge branch 'fixes' into nextMichael Ellerman
Merge our fixes branch, a few of them are tripping people up while working on top of next, and we also have a dependency between the CXL fixes and new CXL code we want to merge into next.
2017-06-27powerpc: Only do ERAT invalidate on radix context switch on P9 DD1Benjamin Herrenschmidt
From: Michael Neuling <mikey@neuling.org> On P9 (Nimbus) DD2 and later, in radix mode, the move to the PID register will implicitly invalidate the user space ERAT entries and leave the kernel ones alone. Thus the only thing needed is an isync() to synchronize this with subsequent uaccess's Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-06-08powerpc/mm/4k: Limit 4k page size config to 64TB virtual address spaceAneesh Kumar K.V
Supporting 512TB requires us to do a order 3 allocation for level 1 page table (pgd). This results in page allocation failures with certain workloads. For now limit 4k linux page size config to 64TB. Fixes: f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB") Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-04powerpc/powernv: Introduce address translation services for Nvlink2Alistair Popple
Nvlink2 supports address translation services (ATS) allowing devices to request address translations from an mmu known as the nest MMU which is setup to walk the CPU page tables. To access this functionality certain firmware calls are required to setup and manage hardware context tables in the nvlink processing unit (NPU). The NPU also manages forwarding of TLB invalidates (known as address translation shootdowns/ATSDs) to attached devices. This patch exports several methods to allow device drivers to register a process id (PASID/PID) in the hardware tables and to receive notification of when a device should stop issuing address translation requests (ATRs). It also adds a fault handler to allow device drivers to demand fault pages in. Signed-off-by: Alistair Popple <alistair@popple.id.au> [mpe: Fix up comment formatting, use flush_tlb_mm()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01powerpc/mm: Enable mappings above 128TBAneesh Kumar K.V
Not all user space application is ready to handle wide addresses. It's known that at least some JIT compilers use higher bits in pointers to encode their information. It collides with valid pointers with 512TB addresses and leads to crashes. To mitigate this, we are not going to allocate virtual address space above 128TB by default. But userspace can ask for allocation from full address space by specifying hint address (with or without MAP_FIXED) above 128TB. If hint address set above 128TB, but MAP_FIXED is not specified, we try to look for unmapped area by specified address. If it's already occupied, we look for unmapped area in *full* address space, rather than from 128TB window. This approach helps to easily make application's memory allocator aware about large address space without manually tracking allocated virtual address space. This is going to be a per mmap decision. ie, we can have some mmaps with larger addresses and other that do not. A sample memory layout looks like: 10000000-10010000 r-xp 00000000 fc:00 9057045 /home/max_addr_512TB 10010000-10020000 r--p 00000000 fc:00 9057045 /home/max_addr_512TB 10020000-10030000 rw-p 00010000 fc:00 9057045 /home/max_addr_512TB 10029630000-10029660000 rw-p 00000000 00:00 0 [heap] 7fff834a0000-7fff834b0000 rw-p 00000000 00:00 0 7fff834b0000-7fff83670000 r-xp 00000000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so 7fff83670000-7fff83680000 r--p 001b0000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so 7fff83680000-7fff83690000 rw-p 001c0000 fc:00 9177190 /lib/powerpc64le-linux-gnu/libc-2.23.so 7fff83690000-7fff836a0000 rw-p 00000000 00:00 0 7fff836a0000-7fff836c0000 r-xp 00000000 00:00 0 [vdso] 7fff836c0000-7fff83700000 r-xp 00000000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so 7fff83700000-7fff83710000 r--p 00030000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so 7fff83710000-7fff83720000 rw-p 00040000 fc:00 9177193 /lib/powerpc64le-linux-gnu/ld-2.23.so 7fffdccf0000-7fffdcd20000 rw-p 00000000 00:00 0 [stack] 1000000000000-1000000010000 rw-p 00000000 00:00 0 1ffff83710000-1ffff83720000 rw-p 00000000 00:00 0 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01powerpc/pseries: Skip using reserved virtual address rangeAneesh Kumar K.V
Now that we use all the available virtual address range, we need to make sure we don't generate VSID such that it overlaps with the reserved vsid range. Reserved vsid range include the virtual address range used by the adjunct partition and also the VRMA virtual segment. We find the context value that can result in generating such a VSID and reserve it early in boot. We don't look at the adjunct range, because for now we disable the adjunct usage in a Linux LPAR via CAS interface. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Rewrite hash__reserve_context_id(), move the rest into pseries] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-04-01powerpc/mm: Add addr_limit to mm_context and use it to derive max slice indexAneesh Kumar K.V
In the followup patch, we will increase the slice array size to handle 512TB range, but will limit the max addr to 128TB. Avoid doing unnecessary computation and avoid doing slice mask related operation above address limit. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/hash: Support 68 bit VAAneesh Kumar K.V
Inorder to support large effective address range (512TB), we want to increase the virtual address bits to 68. But we do have platforms like p4 and p5 that can only do 65 bit VA. We support those platforms by limiting context bits on them to 16. The protovsid -> vsid conversion is verified to work with both 65 and 68 bit va values. I also documented the restrictions in a table format as part of code comments. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/hash: Use context ids 1-4 for the kernelAneesh Kumar K.V
Currently we use the top 4 context ids (0x7fffc-0x7ffff) for the kernel. Kernel VSIDs are built using these top context values and effective the segement ID. In subsequent patches we want to increase the max effective address to 512TB. We will achieve that by increasing the effective segment IDs there by increasing virtual address range. We will be switching to a 68bit virtual address in the following patch. But platforms like Power4 and Power5 only support a 65 bit virtual address. We will handle that by limiting the context bits to 16 instead of 19 on those platforms. That means the max context id will have a different value on different platforms. So that we don't have to deal with the kernel context ids changing between different platforms, move the kernel context ids down to use context ids 1-4. We can't use segment 0 of context-id 0, because that maps to VSID 0, which we want to keep as invalid, so we avoid context-id 0 entirely. Similarly we can't use the last segment of the maximum context, so we avoid it too. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Switch from 0-3 to 1-4 so VSID=0 remains invalid] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm: Split radix vs hash mm context initialisationMichael Ellerman
Complete the split of the radix vs hash mm context initialisation. This is mostly code movement, with the exception that we now limit the context allocation to PRTB_ENTRIES - 1 on radix. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/hash: Pull hash constants into hash__alloc_context_id()Michael Ellerman
The min and max context id values used in alloc_context_id() are currently the right values for use on hash, and happen to also be safe for use on radix. But we need to change that in a subsequent patch, so make the min/max ids parameters and pull the hash values into hsah__alloc_context_id(). Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-31powerpc/mm/hash: Abstract context id allocation for KVMMichael Ellerman
KVM wants to be able to allocate an MMU context id, which it does currently by calling __init_new_context(). We're about to rework that code, so provide a wrapper for KVM so it can not worry about the details. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdownAlexey Kardashevskiy
At the moment the userspace tool is expected to request pinning of the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present. When the userspace process finishes, all the pinned pages need to be put; this is done as a part of the userspace memory context (MM) destruction which happens on the very last mmdrop(). This approach has a problem that a MM of the userspace process may live longer than the userspace process itself as kernel threads use userspace process MMs which was runnning on a CPU where the kernel thread was scheduled to. If this happened, the MM remains referenced until this exact kernel thread wakes up again and releases the very last reference to the MM, on an idle system this can take even hours. This moves preregistered regions tracking from MM to VFIO; insteads of using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is added so each container releases regions which it has pre-registered. This changes the userspace interface to return EBUSY if a memory region is already registered in a container. However it should not have any practical effect as the only userspace tool available now does register memory region once per container anyway. As tce_iommu_register_pages/tce_iommu_unregister_pages are called under container->lock, this does not need additional locking. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-12-02powerpc/iommu: Pass mm_struct to init/cleanup helpersAlexey Kardashevskiy
We are going to get rid of @current references in mmu_context_boos3s64.c and cache mm_struct in the VFIO container. Since mm_context_t does not have reference counting, we will be using mm_struct which does have the reference counter. This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather than mm_context_t (which is embedded into mm). This should not cause any behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-17powerpc/mm/radix: Update PID switch sequenceAneesh Kumar K.V
Update the PID switch as per ISA doc. slbia is needed in radix to invalidate any implementation specific lookaside information. We use the .long format due to build errors with the below compiler version. gcc (Ubuntu 5.3.1-14ubuntu2.1) 5.3.1 20160413 GNU assembler (GNU Binutils for Ubuntu) 2.26 CC arch/powerpc/mm//mmu_context_book3s64.o {standard input}: Assembler messages: {standard input}:506: Error: junk at end of line: `0x7' scripts/Makefile.build:291: recipe for target 'arch/powerpc/mm//mmu_context_book3s64.o' failed make[1]: *** [arch/powerpc/mm//mmu_context_book3s64.o] Error 1 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-06-17powerpc/mm/radix: Update Radix tree size as per ISA 3.0Aneesh Kumar K.V
ISA 3.0 updated it to be encoded as Radix tree size = 2^(RTS + 31). We have it encoded as 2^(RTS + 28). Add a helper with the correct encoding and use it instead of opencoding. Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines") Reviewed-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-11powerpc/mm/subpage: Initialise user psize correctlyAneesh Kumar K.V
As part of the radix support we switched Book3s64 to use a value of ~0 for MMU_NO_CONTEXT. That is because id 0 is special on radix. However that broke the logic in init_new_context(). The code there needs to differentiate between a newly allocated context and one inherited via fork. Previously it worked because a newly allocated context has an id of zero (because it was just memset() to zero), which used to match MMU_NO_CONTEXT, and therefore slice_mm_new_context() did the right thing. Instead check against a context.id value of zero instead of using slice_mm_new_context(). Without this patch we never call slice_set_user_psize(), and end up with a slice psize value of zero and we always end up using 4K HPTE. Fixes: 1a472c9dba6b ("powerpc/mm/radix: Add tlbflush routines") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-05-01powerpc/mm: Rename mmu_context_hash64.c to mmu_context_book3s64.cAneesh Kumar K.V
This file now contains both hash and radix specific code. Rename it to indicate this better. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>