summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu/common.c
AgeCommit message (Collapse)Author
2022-05-23Merge tag 'x86_cpu_for_v5.19_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 CPU feature updates from Borislav Petkov: - Remove a bunch of chicken bit options to turn off CPU features which are not really needed anymore - Misc fixes and cleanups * tag 'x86_cpu_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation: Add missing prototype for unpriv_ebpf_notify() x86/pm: Fix false positive kmemleak report in msr_build_context() x86/speculation/srbds: Do not try to turn mitigation off when not supported x86/cpu: Remove "noclflush" x86/cpu: Remove "noexec" x86/cpu: Remove "nosmep" x86/cpu: Remove CONFIG_X86_SMAP and "nosmap" x86/cpu: Remove "nosep" x86/cpu: Allow feature bit names from /proc/cpuinfo in clearcpuid=
2022-05-23Merge tag 'x86_sev_for_v5.19_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull AMD SEV-SNP support from Borislav Petkov: "The third AMD confidential computing feature called Secure Nested Paging. Add to confidential guests the necessary memory integrity protection against malicious hypervisor-based attacks like data replay, memory remapping and others, thus achieving a stronger isolation from the hypervisor. At the core of the functionality is a new structure called a reverse map table (RMP) with which the guest has a say in which pages get assigned to it and gets notified when a page which it owns, gets accessed/modified under the covers so that the guest can take an appropriate action. In addition, add support for the whole machinery needed to launch a SNP guest, details of which is properly explained in each patch. And last but not least, the series refactors and improves parts of the previous SEV support so that the new code is accomodated properly and not just bolted on" * tag 'x86_sev_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) x86/entry: Fixup objtool/ibt validation x86/sev: Mark the code returning to user space as syscall gap x86/sev: Annotate stack change in the #VC handler x86/sev: Remove duplicated assignment to variable info x86/sev: Fix address space sparse warning x86/sev: Get the AP jump table address from secrets page x86/sev: Add missing __init annotations to SEV init routines virt: sevguest: Rename the sevguest dir and files to sev-guest virt: sevguest: Change driver name to reflect generic SEV support x86/boot: Put globals that are accessed early into the .data section x86/boot: Add an efi.h header for the decompressor virt: sevguest: Fix bool function returning negative value virt: sevguest: Fix return value check in alloc_shared_pages() x86/sev-es: Replace open-coded hlt-loop with sev_es_terminate() virt: sevguest: Add documentation for SEV-SNP CPUID Enforcement virt: sevguest: Add support to get extended report virt: sevguest: Add support to derive key virt: Add SEV-SNP guest driver x86/sev: Register SEV-SNP guest request platform device x86/sev: Provide support for SNP guest request NAEs ...
2022-05-21x86/speculation/mmio: Reuse SRBDS mitigation for SBDSPawan Gupta
The Shared Buffers Data Sampling (SBDS) variant of Processor MMIO Stale Data vulnerabilities may expose RDRAND, RDSEED and SGX EGETKEY data. Mitigation for this is added by a microcode update. As some of the implications of SBDS are similar to SRBDS, SRBDS mitigation infrastructure can be leveraged by SBDS. Set X86_BUG_SRBDS and use SRBDS mitigation. Mitigation is enabled by default; use srbds=off to opt-out. Mitigation status can be checked from below file: /sys/devices/system/cpu/vulnerabilities/srbds Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-05-21x86/speculation/mmio: Enumerate Processor MMIO Stale Data bugPawan Gupta
Processor MMIO Stale Data is a class of vulnerabilities that may expose data after an MMIO operation. For more details please refer to Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst Add the Processor MMIO Stale Data bug enumeration. A microcode update adds new bits to the MSR IA32_ARCH_CAPABILITIES, define them. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-04-11x86/tsx: Disable TSX development mode at bootPawan Gupta
A microcode update on some Intel processors causes all TSX transactions to always abort by default[*]. Microcode also added functionality to re-enable TSX for development purposes. With this microcode loaded, if tsx=on was passed on the cmdline, and TSX development mode was already enabled before the kernel boot, it may make the system vulnerable to TSX Asynchronous Abort (TAA). To be on safer side, unconditionally disable TSX development mode during boot. If a viable use case appears, this can be revisited later. [*]: Intel TSX Disable Update for Selected Processors, doc ID: 643557 [ bp: Drop unstable web link, massage heavily. ] Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/347bd844da3a333a9793c6687d4e4eb3b2419a3e.1646943780.git.pawan.kumar.gupta@linux.intel.com
2022-04-06x86/sev: Register GHCB memory when SEV-SNP is activeBrijesh Singh
The SEV-SNP guest is required by the GHCB spec to register the GHCB's Guest Physical Address (GPA). This is because the hypervisor may prefer that a guest uses a consistent and/or specific GPA for the GHCB associated with a vCPU. For more information, see the GHCB specification section "GHCB GPA Registration". [ bp: Cleanup comments. ] Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220307213356.2797205-18-brijesh.singh@amd.com
2022-04-04x86/cpu: Remove "noclflush"Borislav Petkov
Not really needed anymore and there's clearcpuid=. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220127115626.14179-7-bp@alien8.de
2022-04-04x86/cpu: Remove "nosmep"Borislav Petkov
There should be no need to disable SMEP anymore. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220127115626.14179-5-bp@alien8.de
2022-04-04x86/cpu: Remove CONFIG_X86_SMAP and "nosmap"Borislav Petkov
Those were added as part of the SMAP enablement but SMAP is currently an integral part of kernel proper and there's no need to disable it anymore. Rip out that functionality. Leave --uaccess default on for objtool as this is what objtool should do by default anyway. If still needed - clearcpuid=smap. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220127115626.14179-4-bp@alien8.de
2022-04-04x86/cpu: Remove "nosep"Borislav Petkov
That chicken bit was added by 4f88651125e2 ("[PATCH] i386: allow disabling X86_FEATURE_SEP at boot") but measuring int80 vsyscall performance on 32-bit doesn't matter anymore. If still needed, one can boot with clearcpuid=sep to disable that feature for testing. Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220127115626.14179-3-bp@alien8.de
2022-04-04x86/cpu: Allow feature bit names from /proc/cpuinfo in clearcpuid=Borislav Petkov
Having to give the X86_FEATURE array indices in order to disable a feature bit for testing is not really user-friendly. So accept the feature bit names too. Some feature bits don't have names so there the array indices are still accepted, of course. Clearing CPUID flags is not something which should be done in production so taint the kernel too. An exemplary cmdline would then be something like: clearcpuid=de,440,smca,succory,bmi1,3dnow ("succory" is wrong on purpose). And it says: [ ... ] Clearing CPUID bits: de 13:24 smca (unknown: succory) bmi1 3dnow [ Fix CONFIG_X86_FEATURE_NAMES=n build error as reported by the 0day robot: https://lore.kernel.org/r/202203292206.ICsY2RKX-lkp@intel.com ] Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220127115626.14179-2-bp@alien8.de
2022-03-15Merge branch 'x86/cpu' into x86/core, to resolve conflictsIngo Molnar
Conflicts: arch/x86/include/asm/cpufeatures.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-03-15x86/ibt: Disable IBT around firmwarePeter Zijlstra
Assume firmware isn't IBT clean and disable it across calls. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220308154318.759989383@infradead.org
2022-03-15x86/ibt,kexec: Disable CET on kexecPeter Zijlstra
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220308154318.641454603@infradead.org
2022-03-15x86/ibt: Add IBT feature, MSR and #CP handlingPeter Zijlstra
The bits required to make the hardware go.. Of note is that, provided the syscall entry points are covered with ENDBR, #CP doesn't need to be an IST because we'll never hit the syscall gap. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220308154318.582331711@infradead.org
2022-02-01x86/cpu: Read/save PPIN MSR during initializationTony Luck
Currently, the PPIN (Protected Processor Inventory Number) MSR is read by every CPU that processes a machine check, CMCI, or just polls machine check banks from a periodic timer. This is not a "fast" MSR, so this adds to overhead of processing errors. Add a new "ppin" field to the cpuinfo_x86 structure. Read and save the PPIN during initialization. Use this copy in mce_setup() instead of reading the MSR. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220131230111.2004669-4-tony.luck@intel.com
2022-02-01x86/cpu: X86_FEATURE_INTEL_PPIN finally has a CPUID bitTony Luck
After nine generations of adding to model specific list of CPUs that support PPIN (Protected Processor Inventory Number) Intel allocated a CPUID bit to enumerate the MSRs. CPUID(EAX=7, ECX=1).EBX bit 0 enumerates presence of MSR_PPIN_CTL and MSR_PPIN. Add it to the "scattered" CPUID bits and add an entry to the ppin_cpuids[] x86_match_cpu() array to catch Intel CPUs that implement it. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220131230111.2004669-3-tony.luck@intel.com
2022-02-01x86/cpu: Merge Intel and AMD ppin_init() functionsTony Luck
The code to decide whether a system supports the PPIN (Protected Processor Inventory Number) MSR was cloned from the Intel implementation. Apart from the X86_FEATURE bit and the MSR numbers it is identical. Merge the two functions into common x86 code, but use x86_match_cpu() instead of the switch (c->x86_model) that was used by the old Intel code. No functional change. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220131230111.2004669-2-tony.luck@intel.com
2022-01-10Merge tag 'x86_cpu_for_v5.17_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: - Enable the short string copies for CPUs which support them, in copy_user_enhanced_fast_string() - Avoid writing MSR_CSTAR on Intel due to TDX guests raising a #VE trap * tag 'x86_cpu_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/lib: Add fast-short-rep-movs check to copy_user_enhanced_fast_string() x86/cpu: Don't write CSTAR MSR on Intel CPUs
2021-12-22x86/mm: Prevent early boot triple-faults with instrumentationBorislav Petkov
Commit in Fixes added a global TLB flush on the early boot path, after the kernel switches off of the trampoline page table. Compiler profiling options enabled with GCOV_PROFILE add additional measurement code on clang which needs to be initialized prior to use. The global flush in x86_64_start_kernel() happens before those initializations can happen, leading to accessing invalid memory. GCOV_PROFILE builds with gcc are still ok so this is clang-specific. The second issue this fixes is with KASAN: for a similar reason, kasan_early_init() needs to have happened before KASAN-instrumented functions are called. Therefore, reorder the flush to happen after the KASAN early init and prevent the compilers from adding profiling instrumentation to native_write_cr4(). Fixes: f154f290855b ("x86/mm/64: Flush global TLB on boot and AP bringup") Reported-by: "J. Bruce Fields" <bfields@fieldses.org> Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Carel Si <beibei.si@intel.com> Tested-by: "J. Bruce Fields" <bfields@fieldses.org> Link: https://lore.kernel.org/r/20211209144141.GC25654@xsang-OptiPlex-9020
2021-11-25x86/cpu: Don't write CSTAR MSR on Intel CPUsAndi Kleen
Intel CPUs do not support SYSCALL in 32-bit mode, but the kernel initializes MSR_CSTAR unconditionally. That MSR write is normally ignored by the CPU, but in a TDX guest it raises a #VE trap. Exclude Intel CPUs from the MSR_CSTAR initialization. [ tglx: Fixed the subject line and removed the redundant comment. ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20211119035803.4012145-1-sathyanarayanan.kuppuswamy@linux.intel.com
2021-11-01Merge tag 'x86_cpu_for_v5.16_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Borislav Petkov: - Start checking a CPUID bit on AMD Zen3 which states that the CPU clears the segment base when a null selector is written. Do the explicit detection on older CPUs, zen2 and hygon specifically, which have the functionality but do not advertize the CPUID bit. Factor in the presence of a hypervisor underneath the kernel and avoid doing the explicit check there which the HV might've decided to not advertize for migration safety reasons, or similar. - Add support for a new X86 CPU vendor: VORTEX. Needed for whitelisting those CPUs in the hardware vulnerabilities detection - Force the compiler to use rIP-relative addressing in the fallback path of static_cpu_has(), in order to avoid unnecessary register pressure * tag 'x86_cpu_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Fix migration safety with X86_BUG_NULL_SEL x86/CPU: Add support for Vortex CPUs x86/umip: Downgrade warning messages to debug loglevel x86/asm: Avoid adding register pressure for the init case in static_cpu_has() x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix
2021-11-01Merge tag 'x86-fpu-2021-11-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: - Cleanup of extable fixup handling to be more robust, which in turn allows to make the FPU exception fixups more robust as well. - Change the return code for signal frame related failures from explicit error codes to a boolean fail/success as that's all what the calling code evaluates. - A large refactoring of the FPU code to prepare for adding AMX support: - Distangle the public header maze and remove especially the misnomed kitchen sink internal.h which is despite it's name included all over the place. - Add a proper abstraction for the register buffer storage (struct fpstate) which allows to dynamically size the buffer at runtime by flipping the pointer to the buffer container from the default container which is embedded in task_struct::tread::fpu to a dynamically allocated container with a larger register buffer. - Convert the code over to the new fpstate mechanism. - Consolidate the KVM FPU handling by moving the FPU related code into the FPU core which removes the number of exports and avoids adding even more export when AMX has to be supported in KVM. This also removes duplicated code which was of course unnecessary different and incomplete in the KVM copy. - Simplify the KVM FPU buffer handling by utilizing the new fpstate container and just switching the buffer pointer from the user space buffer to the KVM guest buffer when entering vcpu_run() and flipping it back when leaving the function. This cuts the memory requirements of a vCPU for FPU buffers in half and avoids pointless memory copy operations. This also solves the so far unresolved problem of adding AMX support because the current FPU buffer handling of KVM inflicted a circular dependency between adding AMX support to the core and to KVM. With the new scheme of switching fpstate AMX support can be added to the core code without affecting KVM. - Replace various variables with proper data structures so the extra information required for adding dynamically enabled FPU features (AMX) can be added in one place - Add AMX (Advanced Matrix eXtensions) support (finally): AMX is a large XSTATE component which is going to be available with Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD) which allows to trap the (first) use of an AMX related instruction, which has two benefits: 1) It allows the kernel to control access to the feature 2) It allows the kernel to dynamically allocate the large register state buffer instead of burdening every task with the the extra 8K or larger state storage. It would have been great to gain this kind of control already with AVX512. The support comes with the following infrastructure components: 1) arch_prctl() to - read the supported features (equivalent to XGETBV(0)) - read the permitted features for a task - request permission for a dynamically enabled feature Permission is granted per process, inherited on fork() and cleared on exec(). The permission policy of the kernel is restricted to sigaltstack size validation, but the syscall obviously allows further restrictions via seccomp etc. 2) A stronger sigaltstack size validation for sys_sigaltstack(2) which takes granted permissions and the potentially resulting larger signal frame into account. This mechanism can also be used to enforce factual sigaltstack validation independent of dynamic features to help with finding potential victims of the 2K sigaltstack size constant which is broken since AVX512 support was added. 3) Exception handling for #NM traps to catch first use of a extended feature via a new cause MSR. If the exception was caused by the use of such a feature, the handler checks permission for that feature. If permission has not been granted, the handler sends a SIGILL like the #UD handler would do if the feature would have been disabled in XCR0. If permission has been granted, then a new fpstate which fits the larger buffer requirement is allocated. In the unlikely case that this allocation fails, the handler sends SIGSEGV to the task. That's not elegant, but unavoidable as the other discussed options of preallocation or full per task permissions come with their own set of horrors for kernel and/or userspace. So this is the lesser of the evils and SIGSEGV caused by unexpected memory allocation failures is not a fundamentally new concept either. When allocation succeeds, the fpstate properties are filled in to reflect the extended feature set and the resulting sizes, the fpu::fpstate pointer is updated accordingly and the trap is disarmed for this task permanently. 4) Enumeration and size calculations 5) Trap switching via MSR_XFD The XFD (eXtended Feature Disable) MSR is context switched with the same life time rules as the FPU register state itself. The mechanism is keyed off with a static key which is default disabled so !AMX equipped CPUs have zero overhead. On AMX enabled CPUs the overhead is limited by comparing the tasks XFD value with a per CPU shadow variable to avoid redundant MSR writes. In case of switching from a AMX using task to a non AMX using task or vice versa, the extra MSR write is obviously inevitable. All other places which need to be aware of the variable feature sets and resulting variable sizes are not affected at all because they retrieve the information (feature set, sizes) unconditonally from the fpstate properties. 6) Enable the new AMX states Note, this is relatively new code despite the fact that AMX support is in the works for more than a year now. The big refactoring of the FPU code, which allowed to do a proper integration has been started exactly 3 weeks ago. Refactoring of the existing FPU code and of the original AMX patches took a week and has been subject to extensive review and testing. The only fallout which has not been caught in review and testing right away was restricted to AMX enabled systems, which is completely irrelevant for anyone outside Intel and their early access program. There might be dragons lurking as usual, but so far the fine grained refactoring has held up and eventual yet undetected fallout is bisectable and should be easily addressable before the 5.16 release. Famous last words... Many thanks to Chang Bae and Dave Hansen for working hard on this and also to the various test teams at Intel who reserved extra capacity to follow the rapid development of this closely which provides the confidence level required to offer this rather large update for inclusion into 5.16-rc1 * tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits) Documentation/x86: Add documentation for using dynamic XSTATE features x86/fpu: Include vmalloc.h for vzalloc() selftests/x86/amx: Add context switch test selftests/x86/amx: Add test cases for AMX state management x86/fpu/amx: Enable the AMX feature in 64-bit mode x86/fpu: Add XFD handling for dynamic states x86/fpu: Calculate the default sizes independently x86/fpu/amx: Define AMX state components and have it used for boot-time checks x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers x86/fpu/xstate: Add fpstate_realloc()/free() x86/fpu/xstate: Add XFD #NM handler x86/fpu: Update XFD state where required x86/fpu: Add sanity checks for XFD x86/fpu: Add XFD state to fpstate x86/msr-index: Add MSRs for XFD x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit x86/fpu: Reset permission and fpstate on exec() x86/fpu: Prepare fpu_clone() for dynamically enabled features x86/fpu/signal: Prepare for variable sigframe length x86/signal: Use fpu::__state_user_size for sigalt stack validation ...
2021-11-01Merge tag 'sched-core-2021-11-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Thomas Gleixner: - Revert the printk format based wchan() symbol resolution as it can leak the raw value in case that the symbol is not resolvable. - Make wchan() more robust and work with all kind of unwinders by enforcing that the task stays blocked while unwinding is in progress. - Prevent sched_fork() from accessing an invalid sched_task_group - Improve asymmetric packing logic - Extend scheduler statistics to RT and DL scheduling classes and add statistics for bandwith burst to the SCHED_FAIR class. - Properly account SCHED_IDLE entities - Prevent a potential deadlock when initial priority is assigned to a newly created kthread. A recent change to plug a race between cpuset and __sched_setscheduler() introduced a new lock dependency which is now triggered. Break the lock dependency chain by moving the priority assignment to the thread function. - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems. - Improve idle balancing in general and especially for NOHZ enabled systems. - Provide proper interfaces for live patching so it does not have to fiddle with scheduler internals. - Add cluster aware scheduling support. - A small set of tweaks for RT (irqwork, wait_task_inactive(), various scheduler options and delaying mmdrop) - The usual small tweaks and improvements all over the place * tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits) sched/fair: Cleanup newidle_balance sched/fair: Remove sysctl_sched_migration_cost condition sched/fair: Wait before decaying max_newidle_lb_cost sched/fair: Skip update_blocked_averages if we are defering load balance sched/fair: Account update_blocked_averages in newidle_balance cost x86: Fix __get_wchan() for !STACKTRACE sched,x86: Fix L2 cache mask sched/core: Remove rq_relock() sched: Improve wake_up_all_idle_cpus() take #2 irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support. sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ sched: Add cluster scheduler level for x86 sched: Add cluster scheduler level in core and related Kconfig for ARM64 topology: Represent clusters of CPUs within a die sched: Disable -Wunused-but-set-variable sched: Add wrapper for get_wchan() to keep task blocked x86: Fix get_wchan() to support the ORC unwinder proc: Use task_is_running() for wchan in /proc/$pid/stat ...
2021-10-21x86/cpu: Fix migration safety with X86_BUG_NULL_SELJane Malalane
Currently, Linux probes for X86_BUG_NULL_SEL unconditionally which makes it unsafe to migrate in a virtualised environment as the properties across the migration pool might differ. To be specific, the case which goes wrong is: 1. Zen1 (or earlier) and Zen2 (or later) in a migration pool 2. Linux boots on Zen2, probes and finds the absence of X86_BUG_NULL_SEL 3. Linux is then migrated to Zen1 Linux is now running on a X86_BUG_NULL_SEL-impacted CPU while believing that the bug is fixed. The only way to address the problem is to fully trust the "no longer affected" CPUID bit when virtualised, because in the above case it would be clear deliberately to indicate the fact "you might migrate to somewhere which has this behaviour". Zen3 adds the NullSelectorClearsBase CPUID bit to indicate that loading a NULL segment selector zeroes the base and limit fields, as well as just attributes. Zen2 also has this behaviour but doesn't have the NSCB bit. [ bp: Minor touchups. ] Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Signed-off-by: Borislav Petkov <bp@suse.de> CC: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20211021104744.24126-1-jane.malalane@citrix.com
2021-10-21x86/CPU: Add support for Vortex CPUsMarcos Del Sol Vives
DM&P devices were not being properly identified, which resulted in unneeded Spectre/Meltdown mitigations being applied. The manufacturer states that these devices execute always in-order and don't support either speculative execution or branch prediction, so they are not vulnerable to this class of attack. [1] This is something I've personally tested by a simple timing analysis on my Vortex86MX CPU, and can confirm it is true. Add identification for some devices that lack the CPUID product name call, so they appear properly on /proc/cpuinfo. ¹https://www.ssv-embedded.de/doks/infos/DMP_Ann_180108_Meltdown.pdf [ bp: Massage commit message. ] Signed-off-by: Marcos Del Sol Vives <marcos@orca.pet> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211017094408.1512158-1-marcos@orca.pet
2021-10-20x86/fpu: Replace the includes of fpu/internal.hThomas Gleixner
Now that the file is empty, fixup all references with the proper includes and delete the former kitchen sink. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20211015011540.001197214@linutronix.de
2021-10-15sched: Add cluster scheduler level for x86Tim Chen
There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is shared among a cluster of cores instead of being exclusive to one single core. To prevent oversubscription of L2 cache, load should be balanced between such L2 clusters, especially for tasks with no shared data. On benchmark such as SPECrate mcf test, this change provides a boost to performance especially on medium load system on Jacobsville. on a Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4 cores each, the benchmark number is as follow: Improvement over baseline kernel for mcf_r copies run time base rate 1 -0.1% -0.2% 6 25.1% 25.1% 12 18.8% 19.0% 24 0.3% 0.3% So this looks pretty good. In terms of the system's task distribution, some pretty bad clumping can be seen for the vanilla kernel without the L2 cluster domain for the 6 and 12 copies case. With the extra domain for cluster, the load does get evened out between the clusters. Note this patch isn't an universal win as spreading isn't necessarily a win, particually for those workload who can benefit from packing. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com
2021-10-06x86/entry: Clear X86_FEATURE_SMAP when CONFIG_X86_SMAP=nVegard Nossum
Commit 3c73b81a9164 ("x86/entry, selftests: Further improve user entry sanity checks") added a warning if AC is set when in the kernel. Commit 662a0221893a3d ("x86/entry: Fix AC assertion") changed the warning to only fire if the CPU supports SMAP. However, the warning can still trigger on a machine that supports SMAP but where it's disabled in the kernel config and when running the syscall_nt selftest, for example: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 49 at irqentry_enter_from_user_mode CPU: 0 PID: 49 Comm: init Tainted: G T 5.15.0-rc4+ #98 e6202628ee053b4f310759978284bd8bb0ce6905 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:irqentry_enter_from_user_mode ... Call Trace: ? irqentry_enter ? exc_general_protection ? asm_exc_general_protection ? asm_exc_general_protectio IS_ENABLED(CONFIG_X86_SMAP) could be added to the warning condition, but even this would not be enough in case SMAP is disabled at boot time with the "nosmap" parameter. To be consistent with "nosmap" behaviour, clear X86_FEATURE_SMAP when !CONFIG_X86_SMAP. Found using entry-fuzz + satrandconfig. [ bp: Massage commit message. ] Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry sanity checks") Fixes: 662a0221893a ("x86/entry: Fix AC assertion") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20211003223423.8666-1-vegard.nossum@oracle.com
2021-08-26x86/cpu: Add get_llc_id() helper functionKim Phillips
Factor out a helper function rather than export cpu_llc_id, which is needed in order to be able to build the AMD uncore driver as a module. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210817221048.88063-7-kim.phillips@amd.com
2021-07-07Merge tag 'x86-fpu-2021-07-07' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: "Fixes and improvements for FPU handling on x86: - Prevent sigaltstack out of bounds writes. The kernel unconditionally writes the FPU state to the alternate stack without checking whether the stack is large enough to accomodate it. Check the alternate stack size before doing so and in case it's too small force a SIGSEGV instead of silently corrupting user space data. - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been updated despite the fact that the FPU state which is stored on the signal stack has grown over time which causes trouble in the field when AVX512 is available on a CPU. The kernel does not expose the minimum requirements for the alternate stack size depending on the available and enabled CPU features. ARM already added an aux vector AT_MINSIGSTKSZ for the same reason. Add it to x86 as well. - A major cleanup of the x86 FPU code. The recent discoveries of XSTATE related issues unearthed quite some inconsistencies, duplicated code and other issues. The fine granular overhaul addresses this, makes the code more robust and maintainable, which allows to integrate upcoming XSTATE related features in sane ways" * tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again x86/fpu/signal: Let xrstor handle the features to init x86/fpu/signal: Handle #PF in the direct restore path x86/fpu: Return proper error codes from user access functions x86/fpu/signal: Split out the direct restore code x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing() x86/fpu/signal: Sanitize the xstate check on sigframe x86/fpu/signal: Remove the legacy alignment check x86/fpu/signal: Move initial checks into fpu__restore_sig() x86/fpu: Mark init_fpstate __ro_after_init x86/pkru: Remove xstate fiddling from write_pkru() x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate() x86/fpu: Remove PKRU handling from switch_fpu_finish() x86/fpu: Mask PKRU from kernel XRSTOR[S] operations x86/fpu: Hook up PKRU into ptrace() x86/fpu: Add PKRU storage outside of task XSAVE buffer x86/fpu: Dont restore PKRU in fpregs_restore_userspace() x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi() x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs() x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs() ...
2021-06-28Merge tag 'x86-asm-2021-06-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: - Micro-optimize and standardize the do_syscall_64() calling convention - Make syscall entry flags clearing more conservative - Clean up syscall table handling - Clean up & standardize assembly macros, in preparation of FRED - Misc cleanups and fixes * tag 'x86-asm-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/asm: Make <asm/asm.h> valid on cross-builds as well x86/regs: Syscall_get_nr() returns -1 for a non-system call x86/entry: Split PUSH_AND_CLEAR_REGS into two submacros x86/syscall: Maximize MSR_SYSCALL_MASK x86/syscall: Unconditionally prototype {ia32,x32}_sys_call_table[] x86/entry: Reverse arguments to do_syscall_64() x86/entry: Unify definitions from <asm/calling.h> and <asm/ptrace-abi.h> x86/asm: Use _ASM_BYTES() in <asm/nops.h> x86/asm: Add _ASM_BYTES() macro for a .byte ... opcode sequence x86/asm: Have the __ASM_FORM macros handle commas in arguments
2021-06-23x86/cpu: Write the default PKRU value when enabling PKEThomas Gleixner
In preparation of making the PKRU management more independent from XSTATES, write the default PKRU value into the hardware right after enabling PKRU in CR4. This ensures that switch_to() and copy_thread() have the correct setting for init task and the per CPU idle threads right away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121455.622983906@linutronix.de
2021-06-23x86/cpu: Sanitize X86_FEATURE_OSPKEThomas Gleixner
X86_FEATURE_OSPKE is enabled first on the boot CPU and the feature flag is set. Secondary CPUs have to enable CR4.PKE as well and set their per CPU feature flag. That's ineffective because all call sites have checks for boot_cpu_data. Make it smarter and force the feature flag when PKU is enabled on the boot cpu which allows then to use cpu_feature_enabled(X86_FEATURE_OSPKE) all over the place. That either compiles the code out when PKEY support is disabled in Kconfig or uses a static_cpu_has() for the feature check which makes a significant difference in hotpaths, e.g. context switch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121455.305113644@linutronix.de
2021-06-23x86/fpu: Get rid of fpu__get_supported_xfeatures_mask()Thomas Gleixner
This function is really not doing what the comment advertises: "Find supported xfeatures based on cpu features and command-line input. This must be called after fpu__init_parse_early_param() is called and xfeatures_mask is enumerated." fpu__init_parse_early_param() does not exist anymore and the function just returns a constant. Remove it and fix the caller and get rid of further references to fpu__init_parse_early_param(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121451.816404717@linutronix.de
2021-06-23x86/pkeys: Revert a5eff7259790 ("x86/pkeys: Add PKRU value to init_fpstate")Thomas Gleixner
This cannot work and it's unclear how that ever made a difference. init_fpstate.xsave.header.xfeatures is always 0 so get_xsave_addr() will always return a NULL pointer, which will prevent storing the default PKRU value in init_fpstate. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121451.451391598@linutronix.de
2021-05-19x86/signal: Introduce helpers to get the maximum signal frame sizeChang S. Bae
Signal frames do not have a fixed format and can vary in size when a number of things change: supported XSAVE features, 32 vs. 64-bit apps, etc. Add support for a runtime method for userspace to dynamically discover how large a signal stack needs to be. Introduce a new variable, max_frame_size, and helper functions for the calculation to be used in a new user interface. Set max_frame_size to a system-wide worst-case value, instead of storing multiple app-specific values. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Len Brown <len.brown@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: H.J. Lu <hjl.tools@gmail.com> Link: https://lkml.kernel.org/r/20210518200320.17239-3-chang.seok.bae@intel.com
2021-05-18x86/cpu: Init AP exception handling from cpu_init_secondary()Borislav Petkov
SEV-ES guests require properly setup task register with which the TSS descriptor in the GDT can be located so that the IST-type #VC exception handler which they need to function properly, can be executed. This setup needs to happen before attempting to load microcode in ucode_cpu_init() on secondary CPUs which can cause such #VC exceptions. Simplify the machinery by running that exception setup from a new function cpu_init_secondary() and explicitly call cpu_init_exception_handling() for the boot CPU before cpu_init(). The latter prepares for fixing and simplifying the exception/IST setup on the boot CPU. There should be no functional changes resulting from this patch. [ tglx: Reworked it so cpu_init_exception_handling() stays seperate ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Lai Jiangshan <laijs@linux.alibaba.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/87k0o6gtvu.ffs@nanos.tec.linutronix.de
2021-05-12x86/syscall: Maximize MSR_SYSCALL_MASKH. Peter Anvin (Intel)
It is better to clear as many flags as possible when we do a system call entry, as opposed to the other way around. The fewer flags we keep, the lesser the possible interference between the kernel and user space. The flags changed are: - CF, PF, AF, ZF, SF, OF: these are arithmetic flags which affect branches, possibly speculatively. They should be cleared for the same reasons we now clear all GPRs on entry. - RF: suppresses a code breakpoint on the subsequent instruction. It is probably impossible to enter the kernel with RF set, but if it is somehow not, it would break a kernel debugger setting a breakpoint on the entry point. Either way, user space should not be able to control kernel behavior here. - ID: this flag has no direct effect (it is a scratch bit only.) However, there is no reason to retain the user space value in the kernel, and the standard should be to clear unless needed, not the other way around. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210510185316.3307264-5-hpa@zytor.com
2021-05-05x86/cpu: Remove write_tsc() and write_rdtscp_aux() wrappersSean Christopherson
Drop write_tsc() and write_rdtscp_aux(); the former has no users, and the latter has only a single user and is slightly misleading since the only in-kernel consumer of MSR_TSC_AUX is RDPID, not RDTSCP. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210504225632.1532621-3-seanjc@google.com
2021-05-05x86/cpu: Initialize MSR_TSC_AUX if RDTSCP *or* RDPID is supportedSean Christopherson
Initialize MSR_TSC_AUX with CPU node information if RDTSCP or RDPID is supported. This fixes a bug where vdso_read_cpunode() will read garbage via RDPID if RDPID is supported but RDTSCP is not. While no known CPU supports RDPID but not RDTSCP, both Intel's SDM and AMD's APM allow for RDPID to exist without RDTSCP, e.g. it's technically a legal CPU model for a virtual machine. Note, technically MSR_TSC_AUX could be initialized if and only if RDPID is supported since RDTSCP is currently not used to retrieve the CPU node. But, the cost of the superfluous WRMSR is negigible, whereas leaving MSR_TSC_AUX uninitialized is just asking for future breakage if someone decides to utilize RDTSCP. Fixes: a582c540ac1b ("x86/vdso: Use RDPID in preference to LSL when available") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210504225632.1532621-2-seanjc@google.com
2021-04-27Merge tag 'x86_core_for_v5.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 updates from Borislav Petkov: - Turn the stack canary into a normal __percpu variable on 32-bit which gets rid of the LAZY_GS stuff and a lot of code. - Add an insn_decode() API which all users of the instruction decoder should preferrably use. Its goal is to keep the details of the instruction decoder away from its users and simplify and streamline how one decodes insns in the kernel. Convert its users to it. - kprobes improvements and fixes - Set the maximum DIE per package variable on Hygon - Rip out the dynamic NOP selection and simplify all the machinery around selecting NOPs. Use the simplified NOPs in objtool now too. - Add Xeon Sapphire Rapids to list of CPUs that support PPIN - Simplify the retpolines by folding the entire thing into an alternative now that objtool can handle alternatives with stack ops. Then, have objtool rewrite the call to the retpoline with the alternative which then will get patched at boot time. - Document Intel uarch per models in intel-family.h - Make Sub-NUMA Clustering topology the default and Cluster-on-Die the exception on Intel. * tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) x86, sched: Treat Intel SNC topology as default, COD as exception x86/cpu: Comment Skylake server stepping too x86/cpu: Resort and comment Intel models objtool/x86: Rewrite retpoline thunk calls objtool: Skip magical retpoline .altinstr_replacement objtool: Cache instruction relocs objtool: Keep track of retpoline call sites objtool: Add elf_create_undef_symbol() objtool: Extract elf_symbol_add() objtool: Extract elf_strtab_concat() objtool: Create reloc sections implicitly objtool: Add elf_create_reloc() helper objtool: Rework the elf_rebuild_reloc_section() logic objtool: Fix static_call list generation objtool: Handle per arch retpoline naming objtool: Correctly handle retpoline thunk calls x86/retpoline: Simplify retpolines x86/alternatives: Optimize optimize_nops() x86: Add insn_decode_kernel() x86/kprobes: Move 'inline' to the beginning of the kprobe_is_ss() declaration ...
2021-04-26Merge tag 'x86-splitlock-2021-04-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 bus lock detection updates from Thomas Gleixner: "Support for enhanced split lock detection: Newer CPUs provide a second mechanism to detect operations with lock prefix which go accross a cache line boundary. Such operations have to take bus lock which causes a system wide performance degradation when these operations happen frequently. The new mechanism is not using the #AC exception. It triggers #DB and is restricted to operations in user space. Kernel side split lock access can only be detected by the #AC based variant. Contrary to the #AC based mechanism the #DB based variant triggers _after_ the instruction was executed. The mechanism is CPUID enumerated and contrary to the #AC version which is based on the magic TEST_CTRL_MSR and model/family based enumeration on the way to become architectural" * tag 'x86-splitlock-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation/admin-guide: Change doc for split_lock_detect parameter x86/traps: Handle #DB for bus lock x86/cpufeatures: Enumerate #DB for bus lock detection
2021-03-28x86/traps: Handle #DB for bus lockFenghua Yu
Bus locks degrade performance for the whole system, not just for the CPU that requested the bus lock. Two CPU features "#AC for split lock" and "#DB for bus lock" provide hooks so that the operating system may choose one of several mitigation strategies. #AC for split lock is already implemented. Add code to use the #DB for bus lock feature to cover additional situations with new options to mitigate. split_lock_detect= #AC for split lock #DB for bus lock off Do nothing Do nothing warn Kernel OOPs Warn once per task and Warn once per task and and continues to run. disable future checking When both features are supported, warn in #AC fatal Kernel OOPs Send SIGBUS to user. Send SIGBUS to user When both features are supported, fatal in #AC ratelimit:N Do nothing Limit bus lock rate to N per second in the current non-root user. Default option is "warn". Hardware only generates #DB for bus lock detect when CPL>0 to avoid nested #DB from multiple bus locks while the first #DB is being handled. So no need to handle #DB for bus lock detected in the kernel. #DB for bus lock is enabled by bus lock detection bit 2 in DEBUGCTL MSR while #AC for split lock is enabled by split lock detection bit 29 in TEST_CTRL MSR. Both breakpoint and bus lock in the same instruction can trigger one #DB. The bus lock is handled before the breakpoint in the #DB handler. Delivery of #DB for bus lock in userspace clears DR6[11], which is set by the #DB handler right after reading DR6. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20210322135325.682257-3-fenghua.yu@intel.com
2021-03-28x86/process/64: Move cpu_current_top_of_stack out of TSSLai Jiangshan
cpu_current_top_of_stack is currently stored in TSS.sp1. TSS is exposed through the cpu_entry_area which is visible with user CR3 when PTI is enabled and active. This makes it a coveted fruit for attackers. An attacker can fetch the kernel stack top from it and continue next steps of actions based on the kernel stack. But it is actualy not necessary to be stored in the TSS. It is only accessed after the entry code switched to kernel CR3 and kernel GS_BASE which means it can be in any regular percpu variable. The reason why it is in TSS is historical (pre PTI) because TSS is also used as scratch space in SYSCALL_64 and therefore cache hot. A syscall also needs the per CPU variable current_task and eventually __preempt_count, so placing cpu_current_top_of_stack next to them makes it likely that they end up in the same cache line which should avoid performance regressions. This is not enforced as the compiler is free to place these variables, so these entry relevant variables should move into a data structure to make this enforceable. The seccomp_benchmark doesn't show any performance loss in the "getpid native" test result. Actually, the result changes from 93ns before to 92ns with this change when KPTI is disabled. The test is very stable and although the test doesn't show a higher degree of precision it gives enough confidence that moving cpu_current_top_of_stack does not cause a regression. [ tglx: Removed unneeded export. Massaged changelog ] Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210125173444.22696-2-jiangshanlai@gmail.com
2021-03-18x86: Fix various typos in commentsIngo Molnar
Fix ~144 single-word typos in arch/x86/ code comments. Doing this in a single commit should reduce the churn. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org
2021-03-08x86/stackprotector/32: Make the canary into a regular percpu variableAndy Lutomirski
On 32-bit kernels, the stackprotector canary is quite nasty -- it is stored at %gs:(20), which is nasty because 32-bit kernels use %fs for percpu storage. It's even nastier because it means that whether %gs contains userspace state or kernel state while running kernel code depends on whether stackprotector is enabled (this is CONFIG_X86_32_LAZY_GS), and this setting radically changes the way that segment selectors work. Supporting both variants is a maintenance and testing mess. Merely rearranging so that percpu and the stack canary share the same segment would be messy as the 32-bit percpu address layout isn't currently compatible with putting a variable at a fixed offset. Fortunately, GCC 8.1 added options that allow the stack canary to be accessed as %fs:__stack_chk_guard, effectively turning it into an ordinary percpu variable. This lets us get rid of all of the code to manage the stack canary GDT descriptor and the CONFIG_X86_32_LAZY_GS mess. (That name is special. We could use any symbol we want for the %fs-relative mode, but for CONFIG_SMP=n, gcc refuses to let us use any name other than __stack_chk_guard.) Forcibly disable stackprotector on older compilers that don't support the new options and turn the stack canary into a percpu variable. The "lazy GS" approach is now used for all 32-bit configurations. Also makes load_gs_index() work on 32-bit kernels. On 64-bit kernels, it loads the GS selector and updates the user GSBASE accordingly. (This is unchanged.) On 32-bit kernels, it loads the GS selector and updates GSBASE, which is now always the user base. This means that the overall effect is the same on 32-bit and 64-bit, which avoids some ifdeffery. [ bp: Massage commit message. ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/c0ff7dba14041c7e5d1cae5d4df052f03759bef3.1613243844.git.luto@kernel.org
2021-02-24Merge tag 'x86-entry-2021-02-24' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 irq entry updates from Thomas Gleixner: "The irq stack switching was moved out of the ASM entry code in course of the entry code consolidation. It ended up being suboptimal in various ways. This reworks the X86 irq stack handling: - Make the stack switching inline so the stackpointer manipulation is not longer at an easy to find place. - Get rid of the unnecessary indirect call. - Avoid the double stack switching in interrupt return and reuse the interrupt stack for softirq handling. - A objtool fix for CONFIG_FRAME_POINTER=y builds where it got confused about the stack pointer manipulation" * tag 'x86-entry-2021-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Fix stack-swizzle for FRAME_POINTER=y um: Enforce the usage of asm-generic/softirq_stack.h x86/softirq/64: Inline do_softirq_own_stack() softirq: Move do_softirq_own_stack() to generic asm header softirq: Move __ARCH_HAS_DO_SOFTIRQ to Kconfig x86: Select CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK x86/softirq: Remove indirection in do_softirq_own_stack() x86/entry: Use run_sysvec_on_irqstack_cond() for XEN upcall x86/entry: Convert device interrupts to inline stack switching x86/entry: Convert system vectors to irq stack macro x86/irq: Provide macro for inlining irq stack switching x86/apic: Split out spurious handling code x86/irq/64: Adjust the per CPU irq stack pointer by 8 x86/irq: Sanitize irq stack tracking x86/entry: Fix instrumentation annotation
2021-02-10x86/irq/64: Adjust the per CPU irq stack pointer by 8Thomas Gleixner
The per CPU hardirq_stack_ptr contains the pointer to the irq stack in the form that it is ready to be assigned to [ER]SP so that the first push ends up on the top entry of the stack. But the stack switching on 64 bit has the following rules: 1) Store the current stack pointer (RSP) in the top most stack entry to allow the unwinder to link back to the previous stack 2) Set RSP to the top most stack entry 3) Invoke functions on the irq stack 4) Pop RSP from the top most stack entry (stored in #1) so it's back to the original stack. That requires all stack switching code to decrement the stored pointer by 8 in order to be able to store the current RSP and then set RSP to that location. That's a pointless exercise. Do the -8 adjustment right when storing the pointer and make the data type a void pointer to avoid confusion vs. the struct irq_stack data type which is on 64bit only used to declare the backing store. Move the definition next to the inuse flag so they likely end up in the same cache line. Sticking them into a struct to enforce it is a seperate change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210210002512.354260928@linutronix.de
2021-02-10x86/irq: Sanitize irq stack trackingThomas Gleixner
The recursion protection for hard interrupt stacks is an unsigned int per CPU variable initialized to -1 named __irq_count. The irq stack switching is only done when the variable is -1, which creates worse code than just checking for 0. When the stack switching happens it uses this_cpu_add/sub(1), but there is no reason to do so. It simply can use straight writes. This is a historical leftover from the low level ASM code which used inc and jz to make a decision. Rename it to hardirq_stack_inuse, make it a bool and use plain stores. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210210002512.228830141@linutronix.de