summaryrefslogtreecommitdiff
path: root/arch/powerpc/mm
AgeCommit message (Collapse)Author
2018-10-04powerpc/nohash: fix undefined behaviour when testing page size supportDaniel Axtens
When enumerating page size definitions to check hardware support, we construct a constant which is (1U << (def->shift - 10)). However, the array of page size definitions is only initalised for various MMU_PAGE_* constants, so it contains a number of 0-initialised elements with def->shift == 0. This means we end up shifting by a very large number, which gives the following UBSan splat: ================================================================================ UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21 shift exponent 4294967286 is too large for 32-bit type 'unsigned int' CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-00045-ga604f927b012-dirty #6 Call Trace: [c00000000101bc20] [c000000000a13d54] .dump_stack+0xa8/0xec (unreliable) [c00000000101bcb0] [c0000000004f20a8] .ubsan_epilogue+0x18/0x64 [c00000000101bd30] [c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4 [c00000000101be20] [c000000000d21760] .early_init_mmu+0x1b4/0x5a0 [c00000000101bf10] [c000000000d1ba28] .early_setup+0x100/0x130 [c00000000101bf90] [c000000000000528] start_here_multiplatform+0x68/0x80 ================================================================================ Fix this by first checking if the element exists (shift != 0) before constructing the constant. Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03powerpc/mm: Don't report hugepage tables as memory leaks when using kmemleakChristophe Leroy
When a process allocates a hugepage, the following leak is reported by kmemleak. This is a false positive which is due to the pointer to the table being stored in the PGD as physical memory address and not virtual memory pointer. unreferenced object 0xc30f8200 (size 512): comm "mmap", pid 374, jiffies 4872494 (age 627.630s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<e32b68da>] huge_pte_alloc+0xdc/0x1f8 [<9e0df1e1>] hugetlb_fault+0x560/0x8f8 [<7938ec6c>] follow_hugetlb_page+0x14c/0x44c [<afbdb405>] __get_user_pages+0x1c4/0x3dc [<b8fd7cd9>] __mm_populate+0xac/0x140 [<3215421e>] vm_mmap_pgoff+0xb4/0xb8 [<c148db69>] ksys_mmap_pgoff+0xcc/0x1fc [<4fcd760f>] ret_from_syscall+0x0/0x38 See commit a984506c542e2 ("powerpc/mm: Don't report PUDs as memory leaks when using kmemleak") for detailed explanation. To fix that, this patch tells kmemleak to ignore the allocated hugepage table. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03arch/powerpc/mm/hash: validate the pte entries before handling the hash faultAneesh Kumar K.V
Make sure we are operating on THP and hugetlb entries in the respective hash fault handling routines. No functional change in this patch. If we walked the table wrongly before, we will retry the access. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03powerpc/mm/book3s: Check for pmd_large instead of pmd_trans_hugeAneesh Kumar K.V
Update few code paths to check for pmd_large. set_pmd_at: We want to use this to store swap pte at pmd level. For swap ptes we don't want to set H_PAGE_THP_HUGE. Hence check for pmd_large in set_pmd_at. This remove the false WARN_ON when using this with swap pmd entry. pmd_page: We don't really use them on pmd migration entries. But they can also work with migration entries and we don't differentiate at the pte level. Hence update pmd_page to work with pmd migration entries too __find_linux_pte: lockless page table walk need to handle pmd migration entries. pmd_trans_huge check will return false on them. We don't set thp = 1 for such entries, but update hpage_shift correctly. Without this we will walk pmd migration entries as a pte page pointer which is wrong. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03powerpc/mm/hugetlb/book3s: add _PAGE_PRESENT to hugepd pointer.Aneesh Kumar K.V
This make hugetlb directory pointer similar to other page able entries. A hugepd entry is identified by lack of _PAGE_PTE bit set and directory size stored in HUGEPD_SHIFT_MASK. We update that to also look at _PAGE_PRESENT Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03powerpc/mm/book3s: Update pmd_present to look at _PAGE_PRESENT bitAneesh Kumar K.V
With this patch we use 0x8000000000000000UL (_PAGE_PRESENT) to indicate a valid pgd/pud/pmd entry. We also switch the p**_present() to look at this bit. With pmd_present, we have a special case. We need to make sure we consider a pmd marked invalid during THP split as present. Right now we clear the _PAGE_PRESENT bit during a pmdp_invalidate. Inorder to consider this special case we add a new pte bit _PAGE_INVALID (mapped to _RPAGE_SW0). This bit is only used with _PAGE_PRESENT cleared. Hence we are not really losing a pte bit for this special case. pmd_present is also updated to look at _PAGE_INVALID. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-10-03Revert "convert SLB miss handlers to C" and subsequent commitsMichael Ellerman
This reverts commits: 5e46e29e6a97 ("powerpc/64s/hash: convert SLB miss handlers to C") 8fed04d0f6ae ("powerpc/64s/hash: remove user SLB data from the paca") 655deecf67b2 ("powerpc/64s/hash: SLB allocation status bitmaps") 2e1626744e8d ("powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setup") 89ca4e126a3f ("powerpc/64s/hash: Add a SLB preload cache") This series had a few bugs, and the fixes are not all trivial. So revert most of it for now. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-28Merge tag 'powerpc-4.19-3' of ↵Greg Kroah-Hartman
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Michael writes: "powerpc fixes for 4.19 #3 A reasonably big batch of fixes due to me being away for a few weeks. A fix for the TM emulation support on Power9, which could result in corrupting the guest r11 when running under KVM. Two fixes to the TM code which could lead to userspace GPR corruption if we take an SLB miss at exactly the wrong time. Our dynamic patching code had a bug that meant we could patch freed __init text, which could lead to corrupting userspace memory. csum_ipv6_magic() didn't work on little endian platforms since we optimised it recently. A fix for an endian bug when reading a device tree property telling us how many storage keys the machine has available. Fix a crash seen on some configurations of PowerVM when migrating the partition from one machine to another. A fix for a regression in the setup of our CPU to NUMA node mapping in KVM guests. A fix to our selftest Makefiles to make them work since a recent change to the shared Makefile logic." * tag 'powerpc-4.19-3' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: selftests/powerpc: Fix Makefiles for headers_install change powerpc/numa: Use associativity if VPHN hcall is successful powerpc/tm: Avoid possible userspace r1 corruption on reclaim powerpc/tm: Fix userspace r13 corruption powerpc/pseries: Fix unitialized timer reset on migration powerpc/pkeys: Fix reading of ibm, processor-storage-keys property powerpc: fix csum_ipv6_magic() on little endian platforms powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again) powerpc: Avoid code patching freed init sections KVM: PPC: Book3S HV: Fix guest r11 corruption with POWER9 TM workarounds
2018-09-25powerpc/numa: Use associativity if VPHN hcall is successfulSrikar Dronamraju
Currently associativity is used to lookup node-id even if the preceding VPHN hcall failed. However this can cause CPU to be made part of the wrong node, (most likely to be node 0). This is because VPHN is not enabled on KVM guests. With 2ea6263 ("powerpc/topology: Get topology for shared processors at boot"), associativity is used to set to the wrong node. Hence KVM guest topology is broken. For example : A 4 node KVM guest before would have reported. [root@localhost ~]# numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 node 0 size: 1746 MB node 0 free: 1604 MB node 1 cpus: 4 5 6 7 node 1 size: 2044 MB node 1 free: 1765 MB node 2 cpus: 8 9 10 11 node 2 size: 2044 MB node 2 free: 1837 MB node 3 cpus: 12 13 14 15 node 3 size: 2044 MB node 3 free: 1903 MB node distances: node 0 1 2 3 0: 10 40 40 40 1: 40 10 40 40 2: 40 40 10 40 3: 40 40 40 10 Would now report: [root@localhost ~]# numactl -H available: 4 nodes (0-3) node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 1746 MB node 0 free: 1244 MB node 1 cpus: node 1 size: 2044 MB node 1 free: 2032 MB node 2 cpus: 1 node 2 size: 2044 MB node 2 free: 2028 MB node 3 cpus: node 3 size: 2044 MB node 3 free: 2032 MB node distances: node 0 1 2 3 0: 10 40 40 40 1: 40 10 40 40 2: 40 40 10 40 3: 40 40 40 10 Fix this by skipping associativity lookup if the VPHN hcall failed. Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot") Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-24powerpc/pseries: Fix unitialized timer reset on migrationMichael Bringmann
After migration of a powerpc LPAR, the kernel executes code to update the system state to reflect new platform characteristics. Such changes include modifications to device tree properties provided to the system by PHYP. Property notifications received by the post_mobility_fixup() code are passed along to the kernel in general through a call to of_update_property() which in turn passes such events back to all modules through entries like the '.notifier_call' function within the NUMA module. When the NUMA module updates its state, it resets its event timer. If this occurs after a previous call to stop_topology_update() or on a system without VPHN enabled, the code runs into an unitialized timer structure and crashes. This patch adds a safety check along this path toward the problem code. An example crash log is as follows. ibmvscsi 30000081: Re-enabling adapter! ------------[ cut here ]------------ kernel BUG at kernel/time/timer.c:958! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179 ... NIP mod_timer+0x4c/0x400 LR reset_topology_timer+0x40/0x60 Call Trace: 0xc0000003f9407830 (unreliable) reset_topology_timer+0x40/0x60 dt_update_callback+0x100/0x120 notifier_call_chain+0x90/0x100 __blocking_notifier_call_chain+0x60/0x90 of_property_notify+0x90/0xd0 of_update_property+0x104/0x150 update_dt_property+0xdc/0x1f0 pseries_devicetree_update+0x2d0/0x510 post_mobility_fixup+0x7c/0xf0 migration_store+0xa4/0xc0 kobj_attr_store+0x30/0x60 sysfs_kf_write+0x64/0xa0 kernfs_fop_write+0x16c/0x240 __vfs_write+0x40/0x200 vfs_write+0xc8/0x240 ksys_write+0x5c/0x100 system_call+0x58/0x6c Fixes: 5d88aa85c00b ("powerpc/pseries: Update CPU maps when device tree is updated") Cc: stable@vger.kernel.org # v3.10+ Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-21signal/powerpc: Use force_sig_fault where appropriateEric W. Biederman
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-21signal/powerpc: Specialize _exception_pkey for handling pkey exceptionsEric W. Biederman
Now that _exception no longer calls _exception_pkey it is no longer necessary to handle any signal with any si_code. All pkey exceptions are SIGSEGV with paired with SEGV_PKUERR. So just handle that case and remove the now unnecessary parameters from _exception_pkey. Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-21signal/powerpc: Remove pkey parameter from __bad_area_nosemaphoreEric W. Biederman
Now that bad_key_fault_exception no longer calls __bad_area_nosemaphore there is no reason for __bad_area_nosemaphore to handle pkeys. Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-21signal/powerpc: Call _exception_pkey directly from bad_key_fault_exceptionEric W. Biederman
This removes the need for other code paths to deal with pkey exceptions. Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-21signal/powerpc: Remove pkey parameter from __bad_areaEric W. Biederman
There are no callers of __bad_area that pass in a pkey parameter so it makes no sense to take one. Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-21signal/powerpc: Use force_sig_mceerr as appropriateEric W. Biederman
In do_sigbus isolate the mceerr signaling code and call force_sig_mceerr instead of falling through to the force_sig_info that works for all of the other signals. Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-20powerpc/pkeys: Fix reading of ibm, processor-storage-keys propertyThiago Jung Bauermann
scan_pkey_feature() uses of_property_read_u32_array() to read the ibm,processor-storage-keys property and calls be32_to_cpu() on the value it gets. The problem is that of_property_read_u32_array() already returns the value converted to the CPU byte order. The value of pkeys_total ends up more or less sane because there's a min() call in pkey_initialize() which reduces pkeys_total to 32. So in practice the kernel ignores the fact that the hypervisor reserved one key for itself (the device tree advertises 31 keys in my test VM). This is wrong, but the effect in practice is that when a process tries to allocate the 32nd key, it gets an -EINVAL error instead of -ENOSPC which would indicate that there aren't any keys available Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem") Cc: stable@vger.kernel.org # v4.16+ Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: Add a SLB preload cacheNicholas Piggin
When switching processes, currently all user SLBEs are cleared, and a few (exec_base, pc, and stack) are preloaded. In trivial testing with small apps, this tends to miss the heap and low 256MB segments, and it will also miss commonly accessed segments on large memory workloads. Add a simple round-robin preload cache that just inserts the last SLB miss into the head of the cache and preloads those at context switch time. Every 256 context switches, the oldest entry is removed from the cache to shrink the cache and require fewer slbmte if they are unused. Much more could go into this, including into the SLB entry reclaim side to track some LRU information etc, which would require a study of large memory workloads. But this is a simple thing we can do now that is an obvious win for common workloads. With the full series, process switching speed on the context_switch benchmark on POWER9/hash (with kernel speculation security masures disabled) increases from 140K/s to 178K/s (27%). POWER8 does not change much (within 1%), it's unclear why it does not see a big gain like POWER9. Booting to busybox init with 256MB segments has SLB misses go down from 945 to 69, and with 1T segments 900 to 21. These could almost all be eliminated by preloading a bit more carefully with ELF binary loading. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setupNicholas Piggin
This will be used by the SLB code in the next patch, but for now this sets the slb_addr_limit to the correct size for 32-bit tasks. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: SLB allocation status bitmapsNicholas Piggin
Add 32-entry bitmaps to track the allocation status of the first 32 SLB entries, and whether they are user or kernel entries. These are used to allocate free SLB entries first, before resorting to the round robin allocator. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: remove user SLB data from the pacaNicholas Piggin
User SLB mappig data is copied into the PACA from the mm->context so it can be accessed by the SLB miss handlers. After the C conversion, SLB miss handlers now run with relocation on, and user SLB misses are able to take recursive kernel SLB misses, so the user SLB mapping data can be removed from the paca and accessed directly. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: convert SLB miss handlers to CNicholas Piggin
This patch moves SLB miss handlers completely to C, using the standard exception handler macros to set up the stack and branch to C. This can be done because the segment containing the kernel stack is always bolted, so accessing it with relocation on will not cause an SLB exception. Arbitrary kernel memory may not be accessed when handling kernel space SLB misses, so care should be taken there. However user SLB misses can access any kernel memory, which can be used to move some fields out of the paca (in later patches). User SLB misses could quite easily reconcile IRQs and set up a first class kernel environment and exit via ret_from_except, however that doesn't seem to be necessary at the moment, so we only do that if a bad fault is encountered. [ Credit to Aneesh for bug fixes, error checks, and improvements to bad address handling, etc ] Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Since RFC: - Added MSR[RI] handling - Fixed up a register loss bug exposed by irq tracing (Aneesh) - Reject misses outside the defined kernel regions (Aneesh) - Added several more sanity checks and error handling (Aneesh), we may look at consolidating these tests and tightenig up the code but for a first pass we decided it's better to check carefully. Since v1: - Fixed SLB cache corruption (Aneesh) - Fixed untidy SLBE allocation "leak" in get_vsid error case - Now survives some stress testing on real hardware Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: Use POWER9 SLBIA IH=3 variant in switch_slbNicholas Piggin
POWER9 introduces SLBIA IH=3, which invalidates all SLB entries and associated lookaside information that have a class value of 1, which Linux assigns to user addresses. This matches what switch_slb wants, and allows a simple fast implementation that avoids the slb_cache complexity. As a side-effect, the POWER5 < DD2.1 SLB invalidation workaround is also avoided on POWER9. Process context switching rate is improved about 2.2% for a small process that hits the slb cache which is the best case for the current code. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: Use POWER6 SLBIA IH=1 variant in switch_slbNicholas Piggin
The SLBIA IH=1 hint will remove all non-zero SLBEs, but only invalidate ERAT entries associated with a class value of 1, for processors that support the hint (e.g., POWER6 and newer), which Linux assigns to user addresses. This prevents kernel ERAT entries from being invalidated when context switchig (if the thread faulted in more than 8 user SLBEs). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: remove the vmalloc segment from the bolted SLBNicholas Piggin
Remove the vmalloc segment from bolted SLBEs. This is not required to be bolted, and seems like it was added to help pre-load the SLB on context switch. However there are now other segments like the vmemmap segment and non-zero node memory that often take misses after a context switch, so it is better to solve this in a more general way. A subsequent change will track free SLB entries and uses those rather than round-robin overwrite valid entries, which makes it far less likely for kernel SLBEs to be evicted after they are installed. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: move POWER5 < DD2.1 slbie workaround where it is neededNicholas Piggin
The POWER5 < DD2.1 issue is that slbie needs to be issued more than once. It came in with this change: ChangeSet@1.1608, 2004-04-29 07:12:31-07:00, david@gibson.dropbear.id.au [PATCH] POWER5 erratum workaround Early POWER5 revisions (<DD2.1) have a problem requiring slbie instructions to be repeated under some circumstances. The patch below adds a workaround (patch made by Anton Blanchard). (aka. 3e4520f7605243abf66a7ccd3d2e49e48e8c0483 in the full history tree) The extra slbie in switch_slb is done even for the case where slbia is called (slb_flush_and_rebolt). I don't believe that is required because there are other slb_flush_and_rebolt callers which do not issue the workaround slbie, which would be broken if it was required. It also seems to be fine inside the isync with the first slbie, as it is in the kernel stack switch code. So move this workaround to where it is required. This is not much of an optimisation because this is the fast path, but it makes the code more understandable and neater. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Retain slbie_data initialisation to avoid compiler warning] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: avoid the POWER5 < DD2.1 slb invalidate workaround on POWER8/9Nicholas Piggin
I only have POWER8/9 to test, so just remove it for those. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/64s/hash: Fix stab_rr off by one initializationNicholas Piggin
This causes SLB alloation to start 1 beyond the start of the SLB. There is no real problem because after it wraps it stats behaving properly, it's just surprisig to see when looking at SLB traces. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-19powerpc/pseries: Dump the SLB contents on SLB MCE errors.Mahesh Salgaonkar
If we get a machine check exceptions due to SLB errors then dump the current SLB contents which will be very much helpful in debugging the root cause of SLB errors. Introduce an exclusive buffer per cpu to hold faulty SLB entries. In real mode mce handler saves the old SLB contents into this buffer accessible through paca and print it out later in virtual mode. With this patch the console will log SLB contents like below on SLB MCE errors: [ 507.297236] SLB contents of cpu 0x1 [ 507.297237] Last SLB entry inserted at slot 16 [ 507.297238] 00 c000000008000000 400ea1b217000500 [ 507.297239] 1T ESID= c00000 VSID= ea1b217 LLP:100 [ 507.297240] 01 d000000008000000 400d43642f000510 [ 507.297242] 1T ESID= d00000 VSID= d43642f LLP:110 [ 507.297243] 11 f000000008000000 400a86c85f000500 [ 507.297244] 1T ESID= f00000 VSID= a86c85f LLP:100 [ 507.297245] 12 00007f0008000000 4008119624000d90 [ 507.297246] 1T ESID= 7f VSID= 8119624 LLP:110 [ 507.297247] 13 0000000018000000 00092885f5150d90 [ 507.297247] 256M ESID= 1 VSID= 92885f5150 LLP:110 [ 507.297248] 14 0000010008000000 4009e7cb50000d90 [ 507.297249] 1T ESID= 1 VSID= 9e7cb50 LLP:110 [ 507.297250] 15 d000000008000000 400d43642f000510 [ 507.297251] 1T ESID= d00000 VSID= d43642f LLP:110 [ 507.297252] 16 d000000008000000 400d43642f000510 [ 507.297253] 1T ESID= d00000 VSID= d43642f LLP:110 [ 507.297253] ---------------------------------- [ 507.297254] SLB cache ptr value = 3 [ 507.297254] Valid SLB cache entries: [ 507.297255] 00 EA[0-35]= 7f000 [ 507.297256] 01 EA[0-35]= 1 [ 507.297257] 02 EA[0-35]= 1000 [ 507.297257] Rest of SLB cache entries: [ 507.297258] 03 EA[0-35]= 7f000 [ 507.297258] 04 EA[0-35]= 1 [ 507.297259] 05 EA[0-35]= 1000 [ 507.297260] 06 EA[0-35]= 12 [ 507.297260] 07 EA[0-35]= 7f000 Suggested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-18powerpc: Avoid code patching freed init sectionsMichael Neuling
This stops us from doing code patching in init sections after they've been freed. In this chain: kvm_guest_init() -> kvm_use_magic_page() -> fault_in_pages_readable() -> __get_user() -> __get_user_nocheck() -> barrier_nospec(); We have a code patching location at barrier_nospec() and kvm_guest_init() is an init function. This whole chain gets inlined, so when we free the init section (hence kvm_guest_init()), this code goes away and hence should no longer be patched. We seen this as userspace memory corruption when using a memory checker while doing partition migration testing on powervm (this starts the code patching post migration via /sys/kernel/mobility/migration). In theory, it could also happen when using /sys/kernel/debug/powerpc/barrier_nospec. Cc: stable@vger.kernel.org # 4.13+ Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-09-12KVM: PPC: Avoid marking DMA-mapped pages dirty in real modeAlexey Kardashevskiy
At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm() which in turn reads the old TCE and if it was a valid entry, marks the physical page dirty if it was mapped for writing. Since it is in real mode, realmode_pfn_to_page() is used instead of pfn_to_page() to get the page struct. However SetPageDirty() itself reads the compound page head and returns a virtual address for the head page struct and setting dirty bit for that kills the system. This adds additional dirty bit tracking into the MM/IOMMU API for use in the real mode. Note that this does not change how VFIO and KVM (in virtual mode) set this bit. The KVM (real mode) changes include: - use the lowest bit of the cached host phys address to carry the dirty bit; - mark pages dirty when they are unpinned which happens when the preregistered memory is released which always happens in virtual mode; - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit; - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use in the new mm_iommu_ua_mark_dirty_rm() helper; - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only caller anyway) to reduce the real mode KVM and IOMMU knowledge across different subsystems. This removes realmode_pfn_to_page() as it is not used anymore. While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for the real mode only and modules cannot call it anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-08-26Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds
Pull IDA updates from Matthew Wilcox: "A better IDA API: id = ida_alloc(ida, GFP_xxx); ida_free(ida, id); rather than the cumbersome ida_simple_get(), ida_simple_remove(). The new IDA API is similar to ida_simple_get() but better named. The internal restructuring of the IDA code removes the bitmap preallocation nonsense. I hope the net -200 lines of code is convincing" * 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits) ida: Change ida_get_new_above to return the id ida: Remove old API test_ida: check_ida_destroy and check_ida_alloc test_ida: Convert check_ida_conv to new API test_ida: Move ida_check_max test_ida: Move ida_check_leaf idr-test: Convert ida_check_nomem to new API ida: Start new test_ida module target/iscsi: Allocate session IDs from an IDA iscsi target: fix session creation failure handling drm/vmwgfx: Convert to new IDA API dmaengine: Convert to new IDA API ppc: Convert vas ID allocation to new IDA API media: Convert entity ID allocation to new IDA API ppc: Convert mmu context allocation to new IDA API Convert net_namespace to new IDA API cb710: Convert to new IDA API rsxx: Convert to new IDA API osd: Convert to new IDA API sd: Convert to new IDA API ...
2018-08-24Merge tag 'powerpc-4.19-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - An implementation for the newly added hv_ops->flush() for the OPAL hvc console driver backends, I forgot to apply this after merging the hvc driver changes before the merge window. - Enable all PCI bridges at boot on powernv, to avoid races when multiple children of a bridge try to enable it simultaneously. This is a workaround until the PCI core can be enhanced to fix the races. - A fix to query PowerVM for the correct system topology at boot before initialising sched domains, seen in some configurations to cause broken scheduling etc. - A fix for pte_access_permitted() on "nohash" platforms. - Two commits to fix SIGBUS when using remap_pfn_range() seen on Power9 due to a workaround when using the nest MMU (GPUs, accelerators). - Another fix to the VFIO code used by KVM, the previous fix had some bugs which caused guests to not start in some configurations. - A handful of other minor fixes. Thanks to: Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy, Hari Bathini, Luke Dashjr, Mahesh Salgaonkar, Nicholas Piggin, Paul Mackerras, Srikar Dronamraju. * tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mce: Fix SLB rebolting during MCE recovery path. KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pages powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transition powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid. powerpc/nohash: fix pte_access_permitted() powerpc/topology: Get topology for shared processors at boot powerpc64/ftrace: Include ftrace.h needed for enable/disable calls powerpc/powernv/pci: Work around races in PCI bridge enabling powerpc/fadump: cleanup crash memory ranges support powerpc/powernv: provide a console flush operation for opal hvc driver powerpc/traps: Avoid rate limit messages from show unhandled signals powerpc/64s: Fix PACA_IRQ_HARD_DIS accounting in idle_power4()
2018-08-23powerpc/mce: Fix SLB rebolting during MCE recovery path.Mahesh Salgaonkar
The commit e7e81847478 ("powerpc/64s: move machine check SLB flushing to mm/slb.c") introduced a bug in reloading bolted SLB entries. Unused bolted entries are stored with .esid=0 in the slb_shadow area, and that value is now used directly as the RB input to slbmte, which means the RB[52:63] index field is set to 0, which causes SLB entry 0 to be cleared. Fix this by storing the index bits in the unused bolted entries, which directs the slbmte to the right place. The SLB shadow area is also used by the hypervisor, but PAPR is okay with that, from LoPAPR v1.1, 14.11.1.3 SLB Shadow Buffer: Note: SLB is filled sequentially starting at index 0 from the shadow buffer ignoring the contents of RB field bits 52-63 Fixes: e7e81847478b ("powerpc/64s: move machine check SLB flushing to mm/slb.c") Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-23KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pagesPaul Mackerras
Commit 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the pinned physical page", 2018-07-17) added some checks to ensure that guest DMA mappings don't attempt to map more than the guest is entitled to access. However, errors in the logic mean that legitimate guest requests to map pages for DMA are being denied in some situations. Specifically, if the first page of the range passed to mm_iommu_get() is mapped with a normal page, and subsequent pages are mapped with transparent huge pages, we end up with mem->pageshift == 0. That means that the page size checks in mm_iommu_ua_to_hpa() and mm_iommu_up_to_hpa_rm() will always fail for every page in that region, and thus the guest can never map any memory in that region for DMA, typically leading to a flood of error messages like this: qemu-system-ppc64: VFIO_MAP_DMA: -22 qemu-system-ppc64: vfio_dma_map(0x10005f47780, 0x800000000000000, 0x10000, 0x7fff63ff0000) = -22 (Invalid argument) The logic errors in mm_iommu_get() are: (a) use of 'ua' not 'ua + (i << PAGE_SHIFT)' in the find_linux_pte() call (meaning that find_linux_pte() returns the pte for the first address in the range, not the address we are currently up to); (b) use of 'pageshift' as the variable to receive the hugepage shift returned by find_linux_pte() - for a normal page this gets set to 0, leading to us setting mem->pageshift to 0 when we conclude that the pte returned by find_linux_pte() didn't match the page we were looking at; (c) comparing 'compshift', which is a page order, i.e. log base 2 of the number of pages, with 'pageshift', which is a log base 2 of the number of bytes. To fix these problems, this patch introduces 'cur_ua' to hold the current user address and uses that in the find_linux_pte() call; introduces 'pteshift' to hold the hugepage shift found by find_linux_pte(); and compares 'pteshift' with 'compshift + PAGE_SHIFT' rather than 'compshift'. The patch also moves the local_irq_restore to the point after the PTE pointer returned by find_linux_pte() has been dereferenced because otherwise the PTE could change underneath us, and adds a check to avoid doing the find_linux_pte() call once mem->pageshift has been reduced to PAGE_SHIFT, as an optimization. Fixes: 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the pinned physical page") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-23powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transitionAneesh Kumar K.V
The Nest MMU workaround is only needed for RW upgrades. Avoid doing that for other PTE updates. We also avoid clearing the PTE while marking it invalid. This is because other page table walkers will find this PTE none and can result in unexpected behaviour due to that. Instead we clear _PAGE_PRESENT and set the software PTE bit _PAGE_INVALID. pte_present() is already updated to check for both bits. This makes sure page table walkers will find the PTE present and things like pte_pfn(pte) returns the right value. Based on an original patch from Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-21ppc: Convert mmu context allocation to new IDA APIMatthew Wilcox
ida_alloc_range is the perfect fit for this use case. Eliminates a custom spinlock, a call to ida_pre_get and a local check for the allocated ID exceeding a maximum. Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
2018-08-21powerpc/topology: Get topology for shared processors at bootSrikar Dronamraju
On a shared LPAR, Phyp will not update the CPU associativity at boot time. Just after the boot system does recognize itself as a shared LPAR and trigger a request for correct CPU associativity. But by then the scheduler would have already created/destroyed its sched domains. This causes - Broken load balance across Nodes causing islands of cores. - Performance degradation esp if the system is lightly loaded - dmesg to wrongly report all CPUs to be in Node 0. - Messages in dmesg saying borken topology. - With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain"), can cause rcu stalls at boot up. The sched_domains_numa_masks table which is used to generate cpumasks is only created at boot time just before creating sched domains and never updated. Hence, its better to get the topology correct before the sched domains are created. For example on 64 core Power 8 shared LPAR, dmesg reports Brought up 512 CPUs Node 0 CPUs: 0-511 Node 1 CPUs: Node 2 CPUs: Node 3 CPUs: Node 4 CPUs: Node 5 CPUs: Node 6 CPUs: Node 7 CPUs: Node 8 CPUs: Node 9 CPUs: Node 10 CPUs: Node 11 CPUs: ... BUG: arch topology borken the DIE domain not a subset of the NUMA domain BUG: arch topology borken the DIE domain not a subset of the NUMA domain numactl/lscpu output will still be correct with cores spreading across all nodes: Socket(s): 64 NUMA node(s): 12 Model: 2.0 (pvr 004d 0200) Model name: POWER8 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 64K L1i cache: 32K NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471 NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479 NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487 NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495 NUMA node4 CPU(s): 208-215,304-311,400-407,496-503 NUMA node5 CPU(s): 168-175,264-271,360-367,456-463 NUMA node6 CPU(s): 128-135,224-231,320-327,416-423 NUMA node7 CPU(s): 136-143,232-239,328-335,424-431 NUMA node8 CPU(s): 216-223,312-319,408-415,504-511 NUMA node9 CPU(s): 144-151,240-247,336-343,432-439 NUMA node10 CPU(s): 152-159,248-255,344-351,440-447 NUMA node11 CPU(s): 160-167,256-263,352-359,448-455 Currently on this LPAR, the scheduler detects 2 levels of Numa and created numa sched domains for all CPUs, but it finds a single DIE domain consisting of all CPUs. Hence it deletes all numa sched domains. To address this, detect the shared processor and update topology soon after CPUs are setup so that correct topology is updated just before scheduler creates sched domain. With the fix, dmesg reports: numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471 numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479 numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487 numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495 numa: Node 4 CPUs: 208-215 304-311 400-407 496-503 numa: Node 5 CPUs: 168-175 264-271 360-367 456-463 numa: Node 6 CPUs: 128-135 224-231 320-327 416-423 numa: Node 7 CPUs: 136-143 232-239 328-335 424-431 numa: Node 8 CPUs: 216-223 312-319 408-415 504-511 numa: Node 9 CPUs: 144-151 240-247 336-343 432-439 numa: Node 10 CPUs: 152-159 248-255 344-351 440-447 numa: Node 11 CPUs: 160-167 256-263 352-359 448-455 and lscpu also reports: Socket(s): 64 NUMA node(s): 12 Model: 2.0 (pvr 004d 0200) Model name: POWER8 (architected), altivec supported Hypervisor vendor: pHyp Virtualization type: para L1d cache: 64K L1i cache: 32K NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471 NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479 NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487 NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495 NUMA node4 CPU(s): 208-215,304-311,400-407,496-503 NUMA node5 CPU(s): 168-175,264-271,360-367,456-463 NUMA node6 CPU(s): 128-135,224-231,320-327,416-423 NUMA node7 CPU(s): 136-143,232-239,328-335,424-431 NUMA node8 CPU(s): 216-223,312-319,408-415,504-511 NUMA node9 CPU(s): 144-151,240-247,336-343,432-439 NUMA node10 CPU(s): 152-159,248-255,344-351,440-447 NUMA node11 CPU(s): 160-167,256-263,352-359,448-455 Reported-by: Manjunatha H R <manjuhr1@in.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [mpe: Trim / format change log] Tested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-17Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: - a few misc things - a few Y2038 fixes - ntfs fixes - arch/sh tweaks - ocfs2 updates - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits) mm/hmm.c: remove unused variables align_start and align_end fs/userfaultfd.c: remove redundant pointer uwq mm, vmacache: hash addresses based on pmd mm/list_lru: introduce list_lru_shrink_walk_irq() mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one() mm/list_lru.c: move locking from __list_lru_walk_one() to its caller mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node() mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP mm/sparse: delete old sparse_init and enable new one mm/sparse: add new sparse_init_nid() and sparse_init() mm/sparse: move buffer init/fini to the common place mm/sparse: use the new sparse buffer functions in non-vmemmap mm/sparse: abstract sparse buffer allocations mm/hugetlb.c: don't zero 1GiB bootmem pages mm, page_alloc: double zone's batchsize mm/oom_kill.c: document oom_lock mm/hugetlb: remove gigantic page support for HIGHMEM mm, oom: remove sleep from under oom_lock kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous() mm/cma: remove unsupported gfp_mask parameter from cma_alloc() ...
2018-08-17mm: convert return type of handle_mm_fault() caller to vm_fault_tSouptick Joarder
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t") In this patch all the caller of handle_mm_fault() are changed to return vm_fault_t type. Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Tony Luck <tony.luck@intel.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: James Hogan <jhogan@kernel.org> Cc: Ley Foon Tan <lftan@altera.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: David S. Miller <davem@davemloft.net> Cc: Richard Weinberger <richard@nod.at> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-13powerpc/mm/book3s/radix: Add mapping statisticsAneesh Kumar K.V
Add statistics that show how memory is mapped within the kernel linear mapping. This is similar to commit 37cd944c8d8f ("s390/pgtable: add mapping statistics") We don't do this with Hash translation mode. Hash uses one size (mmu_linear_psize) to map the kernel linear mapping and we print the linear psize during boot as below. "Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24" A sample output looks like: DirectMap4k: 0 kB DirectMap64k: 18432 kB DirectMap2M: 1030144 kB DirectMap1G: 11534336 kB Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-13Merge branch 'fixes' into nextMichael Ellerman
Merge our fixes branch from the 4.18 cycle to resolve some minor conflicts.
2018-08-10powerpc/64s: move machine check SLB flushing to mm/slb.cNicholas Piggin
The machine check code that flushes and restores bolted segments in real mode belongs in mm/slb.c. This will also be used by pseries machine check and idle code in future changes. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-10powerpc/mm: remove warning about ‘type’ being setMathieu Malaterre
‘type’ is only used when CONFIG_DEBUG_HIGHMEM is set. So add a possibly unused tag to variable. Remove warning treated as error with W=1: arch/powerpc/mm/highmem.c:59:6: error: variable ‘type’ set but not used [-Werror=unused-but-set-variable] Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-08powerpc/Makefiles: Convert ifeq to ifdef where possibleRodrigo R. Galvao
In Makefiles if we're testing a CONFIG_FOO symbol for equality with 'y' we can instead just use ifdef. The latter reads easily, so convert to it where possible. Signed-off-by: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com> Reviewed-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-08powerpc/64s: Fix page table fragment refcount race vs speculative referencesNicholas Piggin
The page table fragment allocator uses the main page refcount racily with respect to speculative references. A customer observed a BUG due to page table page refcount underflow in the fragment allocator. This can be caused by the fragment allocator set_page_count stomping on a speculative reference, and then the speculative failure handler decrements the new reference, and the underflow eventually pops when the page tables are freed. Fix this by using a dedicated field in the struct page for the page table fragment allocator. Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage") Cc: stable@vger.kernel.org # v3.10+ Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-08-07powerpc/64s: free page table caches at exit_mmap timeNicholas Piggin
The kernel page table caches are tied to init_mm, so there is no more need for them after userspace is finished. destroy_context() gets called when we drop the last reference for an mm, which can be much later than the task exit due to other lazy mm references to it. We can free the page table cache pages on task exit because they only cache the userspace page tables and kernel threads should not access user space addresses. The mapping for kernel threads itself is maintained in init_mm and page table cache for that is attached to init_mm. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [mpe: Merge change log additions from Aneesh] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-30powerpc/44x: Mark mmu_init_secondary() as __initAlexey Spirkov
mmu_init_secondary() calls ppc44x_pin_tlb() which is marked __init, leading to a warning: The function mmu_init_secondary() references the function __init ppc44x_pin_tlb(). There's no CPU hotplug support on 44x so mmu_init_secondary() will only be called at boot. Therefore we should mark it as __init. Signed-off-by: Alexey Spirkov <alexeis@astrosoft.ru> [mpe: Flesh out change log details] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-30powerpc: remove unnecessary inclusion of asm/tlbflush.hChristophe Leroy
asm/tlbflush.h is only needed for: - using functions xxx_flush_tlb_xxx() - using MMU_NO_CONTEXT - including asm-generic/pgtable.h Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-30powerpc: remove superflous inclusions of asm/fixmap.hChristophe Leroy
Files not using fixmap consts or functions don't need asm/fixmap.h Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>