summaryrefslogtreecommitdiff
path: root/arch/arm64/include/asm/unistd.h
AgeCommit message (Collapse)Author
2024-07-10arm64: rework compat syscall macrosArnd Bergmann
The generated asm/unistd_compat_32.h header file now contains macros that can be used directly in the vdso and the signal trampolines, so remove the duplicate definitions. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-10arm64: generate 64-bit syscall.tblArnd Bergmann
Change the asm/unistd.h header for arm64 to no longer include asm-generic/unistd.h itself, but instead generate both the asm/unistd.h contents and the list of entry points using the syscall.tbl scripts that we use on most other architectures. Once his is done for the remaining architectures, the generic unistd.h header can be removed and the generated tbl file put in its place. The Makefile changes are more complex than they should be, I need a little help to improve those. Ideally this should be done in an architecture-independent way as well. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-07-10arm64: convert unistd_32.h to syscall.tbl formatArnd Bergmann
This is a straight conversion from the old asm/unistd32.h into the format used by 32-bit arm and most other architectures, calling scripts to generate the asm/unistd32.h header and a new asm/syscalls32.h headers. I used a semi-automated text replacement method to do the conversion, and then used 'vimdiff' to synchronize the whitespace and the (unused) names of the non-compat syscalls with the arm version. There are two differences between the generated syscalls names and the old version: - the old asm/unistd32.h contained only a __NR_sync_file_range2 entry, while the arm32 version also defines __NR_arm_sync_file_range with the same number. I added this duplicate back in asm/unistd32.h. - __NR__sysctl was removed from the arm64 file a while ago, but all the tables still contain it. This should probably get removed everywhere but I added it here for consistency. On top of that, the arm64 version does not contain any references to the 32-bit OABI syscalls that are not supported by arm64. If we ever want to share the file between arm32 and arm64, it would not be hard to add support for both in one file. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-05-23mseal: wire up mseal syscallJeff Xu
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed. To measure the performance impact of this loop, two tests are developed. [8] The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results. The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples. Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook. Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__ 32 11811 12398 587 18 105% mprotect 1 439 465 26 26 106% mprotect 2 1659 1745 86 43 105% mprotect 4 3747 3889 142 36 104% mprotect 8 6755 6969 215 27 103% mprotect 16 13748 14144 396 25 103% mprotect 32 27827 28969 1142 36 104% madvise_ 1 240 262 22 22 109% madvise_ 2 366 442 76 38 121% madvise_ 4 623 751 128 32 121% madvise_ 8 1110 1324 215 27 119% madvise_ 16 2127 2451 324 20 115% madvise_ 32 4109 4642 534 17 113% The second test (measuring cpu cycle) syscall__ vmas cpu cmseal delta_cpu per_vma % munmap__ 1 1790 1890 100 100 106% munmap__ 2 2819 3033 214 107 108% munmap__ 4 4959 5271 312 78 106% munmap__ 8 8262 8745 483 60 106% munmap__ 16 13099 14116 1017 64 108% munmap__ 32 23221 24785 1565 49 107% mprotect 1 906 967 62 62 107% mprotect 2 3019 3203 184 92 106% mprotect 4 6149 6569 420 105 107% mprotect 8 9978 10524 545 68 105% mprotect 16 20448 21427 979 61 105% mprotect 32 40972 42935 1963 61 105% madvise_ 1 434 497 63 63 115% madvise_ 2 752 899 147 74 120% madvise_ 4 1313 1513 200 50 115% madvise_ 8 2271 2627 356 44 116% madvise_ 16 4312 4883 571 36 113% madvise_ 32 8376 9319 943 29 111% Based on the result, for 6.8 kernel, sealing check adds 20-40 nano seconds, or around 50-100 CPU cycles, per VMA. In addition, I applied the sealing to 5.10 kernel: The first test (measuring time) syscall__ vmas t tmseal delta_ns per_vma % munmap__ 1 357 390 33 33 109% munmap__ 2 442 463 21 11 105% munmap__ 4 614 634 20 5 103% munmap__ 8 1017 1137 120 15 112% munmap__ 16 1889 2153 263 16 114% munmap__ 32 4109 4088 -21 -1 99% mprotect 1 235 227 -7 -7 97% mprotect 2 495 464 -30 -15 94% mprotect 4 741 764 24 6 103% mprotect 8 1434 1437 2 0 100% mprotect 16 2958 2991 33 2 101% mprotect 32 6431 6608 177 6 103% madvise_ 1 191 208 16 16 109% madvise_ 2 300 324 24 12 108% madvise_ 4 450 473 23 6 105% madvise_ 8 753 806 53 7 107% madvise_ 16 1467 1592 125 8 108% madvise_ 32 2795 3405 610 19 122% The second test (measuring cpu cycle) syscall__ nbr_vma cpu cmseal delta_cpu per_vma % munmap__ 1 684 715 31 31 105% munmap__ 2 861 898 38 19 104% munmap__ 4 1183 1235 51 13 104% munmap__ 8 1999 2045 46 6 102% munmap__ 16 3839 3816 -23 -1 99% munmap__ 32 7672 7887 216 7 103% mprotect 1 397 443 46 46 112% mprotect 2 738 788 50 25 107% mprotect 4 1221 1256 35 9 103% mprotect 8 2356 2429 72 9 103% mprotect 16 4961 4935 -26 -2 99% mprotect 32 9882 10172 291 9 103% madvise_ 1 351 380 29 29 108% madvise_ 2 565 615 49 25 109% madvise_ 4 872 933 61 15 107% madvise_ 8 1508 1640 132 16 109% madvise_ 16 3078 3323 245 15 108% madvise_ 32 5893 6704 811 25 114% For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30 CPU cycles, there is even decrease in some cases. It might be interesting to compare 5.10 and 6.8 kernel The first test (measuring time) syscall__ vmas t_5_10 t_6_8 delta_ns per_vma % munmap__ 1 357 909 552 552 254% munmap__ 2 442 1398 956 478 316% munmap__ 4 614 2444 1830 458 398% munmap__ 8 1017 4029 3012 377 396% munmap__ 16 1889 6647 4758 297 352% munmap__ 32 4109 11811 7702 241 287% mprotect 1 235 439 204 204 187% mprotect 2 495 1659 1164 582 335% mprotect 4 741 3747 3006 752 506% mprotect 8 1434 6755 5320 665 471% mprotect 16 2958 13748 10790 674 465% mprotect 32 6431 27827 21397 669 433% madvise_ 1 191 240 49 49 125% madvise_ 2 300 366 67 33 122% madvise_ 4 450 623 173 43 138% madvise_ 8 753 1110 357 45 147% madvise_ 16 1467 2127 660 41 145% madvise_ 32 2795 4109 1314 41 147% The second test (measuring cpu cycle) syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma % munmap__ 1 684 1790 1106 1106 262% munmap__ 2 861 2819 1958 979 327% munmap__ 4 1183 4959 3776 944 419% munmap__ 8 1999 8262 6263 783 413% munmap__ 16 3839 13099 9260 579 341% munmap__ 32 7672 23221 15549 486 303% mprotect 1 397 906 509 509 228% mprotect 2 738 3019 2281 1140 409% mprotect 4 1221 6149 4929 1232 504% mprotect 8 2356 9978 7622 953 423% mprotect 16 4961 20448 15487 968 412% mprotect 32 9882 40972 31091 972 415% madvise_ 1 351 434 82 82 123% madvise_ 2 565 752 186 93 133% madvise_ 4 872 1313 442 110 151% madvise_ 8 1508 2271 763 95 151% madvise_ 16 3078 4312 1234 77 140% madvise_ 32 5893 8376 2483 78 142% From 5.10 to 6.8 munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma. mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma. madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma. In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the increase from 5.10 to 6.8 is significantly larger, approximately ten times greater for munmap and mprotect. When I discuss the mm performance with Brian Makin, an engineer who worked on performance, it was brought to my attention that such performance benchmarks, which measuring millions of mm syscall in a tight loop, may not accurately reflect real-world scenarios, such as that of a database service. Also this is tested using a single HW and ChromeOS, the data from another HW or distribution might be different. It might be best to take this data with a grain of salt. This patch (of 5): Wire up mseal syscall for all architectures. Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> [Bug #2] Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-09Merge tag 'lsm-pr-20240105' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull security module updates from Paul Moore: - Add three new syscalls: lsm_list_modules(), lsm_get_self_attr(), and lsm_set_self_attr(). The first syscall simply lists the LSMs enabled, while the second and third get and set the current process' LSM attributes. Yes, these syscalls may provide similar functionality to what can be found under /proc or /sys, but they were designed to support multiple, simultaneaous (stacked) LSMs from the start as opposed to the current /proc based solutions which were created at a time when only one LSM was allowed to be active at a given time. We have spent considerable time discussing ways to extend the existing /proc interfaces to support multiple, simultaneaous LSMs and even our best ideas have been far too ugly to support as a kernel API; after +20 years in the kernel, I felt the LSM layer had established itself enough to justify a handful of syscalls. Support amongst the individual LSM developers has been nearly unanimous, with a single objection coming from Tetsuo (TOMOYO) as he is worried that the LSM_ID_XXX token concept will make it more difficult for out-of-tree LSMs to survive. Several members of the LSM community have demonstrated the ability for out-of-tree LSMs to continue to exist by picking high/unused LSM_ID values as well as pointing out that many kernel APIs rely on integer identifiers, e.g. syscalls (!), but unfortunately Tetsuo's objections remain. My personal opinion is that while I have no interest in penalizing out-of-tree LSMs, I'm not going to penalize in-tree development to support out-of-tree development, and I view this as a necessary step forward to support the push for expanded LSM stacking and reduce our reliance on /proc and /sys which has occassionally been problematic for some container users. Finally, we have included the linux-api folks on (all?) recent revisions of the patchset and addressed all of their concerns. - Add a new security_file_ioctl_compat() LSM hook to handle the 32-bit ioctls on 64-bit systems problem. This patch includes support for all of the existing LSMs which provide ioctl hooks, although it turns out only SELinux actually cares about the individual ioctls. It is worth noting that while Casey (Smack) and Tetsuo (TOMOYO) did not give explicit ACKs to this patch, they did both indicate they are okay with the changes. - Fix a potential memory leak in the CALIPSO code when IPv6 is disabled at boot. While it's good that we are fixing this, I doubt this is something users are seeing in the wild as you need to both disable IPv6 and then attempt to configure IPv6 labeled networking via NetLabel/CALIPSO; that just doesn't make much sense. Normally this would go through netdev, but Jakub asked me to take this patch and of all the trees I maintain, the LSM tree seemed like the best fit. - Update the LSM MAINTAINERS entry with additional information about our process docs, patchwork, bug reporting, etc. I also noticed that the Lockdown LSM is missing a dedicated MAINTAINERS entry so I've added that to the pull request. I've been working with one of the major Lockdown authors/contributors to see if they are willing to step up and assume a Lockdown maintainer role; hopefully that will happen soon, but in the meantime I'll continue to look after it. - Add a handful of mailmap entries for Serge Hallyn and myself. * tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (27 commits) lsm: new security_file_ioctl_compat() hook lsm: Add a __counted_by() annotation to lsm_ctx.ctx calipso: fix memory leak in netlbl_calipso_add_pass() selftests: remove the LSM_ID_IMA check in lsm/lsm_list_modules_test MAINTAINERS: add an entry for the lockdown LSM MAINTAINERS: update the LSM entry mailmap: add entries for Serge Hallyn's dead accounts mailmap: update/replace my old email addresses lsm: mark the lsm_id variables are marked as static lsm: convert security_setselfattr() to use memdup_user() lsm: align based on pointer length in lsm_fill_user_ctx() lsm: consolidate buffer size handling into lsm_fill_user_ctx() lsm: correct error codes in security_getselfattr() lsm: cleanup the size counters in security_getselfattr() lsm: don't yet account for IMA in LSM_CONFIG_COUNT calculation lsm: drop LSM_ID_IMA LSM: selftests for Linux Security Module syscalls SELinux: Add selfattr hooks AppArmor: Add selfattr hooks Smack: implement setselfattr and getselfattr hooks ...
2024-01-08arm64: Update __NR_compat_syscalls for statmount/listmountFlorian Fainelli
Commit d8b0f5465012 ("wire up syscalls for statmount/listmount") added two new system calls to arch/arm64/include/asm/unistd32.h but forgot to update the __NR_compat_syscalls number, thus causing the following build failures: arch/arm64/include/asm/unistd32.h:922:24: error: array index in initializer exceeds array bounds 922 | #define __NR_statmount 457 | ^~~ arch/arm64/kernel/sys32.c:130:34: note: in definition of macro '__SYSCALL' 130 | #define __SYSCALL(nr, sym) [nr] = __arm64_##sym, | ^~ Bump up the number by two to accomodate for the new system calls added. Fixes: d8b0f5465012 ("wire up syscalls for statmount/listmount") Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-11-12LSM: wireup Linux Security Module syscallsCasey Schaufler
Wireup lsm_get_self_attr, lsm_set_self_attr and lsm_list_modules system calls. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: linux-api@vger.kernel.org Reviewed-by: Mickaël Salaün <mic@digikod.net> [PM: forward ported beyond v6.6 due merge window changes] Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-09-21futex: Add sys_futex_requeue()peterz@infradead.org
Finish off the 'simple' futex2 syscall group by adding sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are too numerous to fit into a regular syscall. As such, use struct futex_waitv to pass the 'source' and 'destination' futexes to the syscall. This syscall implements what was previously known as FUTEX_CMP_REQUEUE and uses {val, uaddr, flags} for source and {uaddr, flags} for destination. This design explicitly allows requeueing between different types of futex by having a different flags word per uaddr. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105248.511860556@noisy.programming.kicks-ass.net
2023-09-21futex: Add sys_futex_wait()peterz@infradead.org
To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This syscall implements what was previously known as FUTEX_WAIT_BITSET except it uses 'unsigned long' for the value and bitmask arguments, takes timespec and clockid_t arguments for the absolute timeout and uses FUTEX2 flags. The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105248.164324363@noisy.programming.kicks-ass.net
2023-09-21futex: Add sys_futex_wake()peterz@infradead.org
To complement sys_futex_waitv() add sys_futex_wake(). This syscall implements what was previously known as FUTEX_WAKE_BITSET except it uses 'unsigned long' for the bitmask and takes FUTEX2 flags. The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105247.936205525@noisy.programming.kicks-ass.net
2023-07-27arch: Register fchmodat2, usually as syscall 452Palmer Dabbelt
This registers the new fchmodat2 syscall in most places as nuber 452, with alpha being the exception where it's 562. I found all these sites by grepping for fspick, which I assume has found me everything. Signed-off-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Message-Id: <a677d521f048e4ca439e7080a5328f21eb8e960e.1689092120.git.legion@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-09cachestat: wire up cachestat for other architecturesNhat Pham
cachestat is previously only wired in for x86 (and architectures using the generic unistd.h table): https://lore.kernel.org/lkml/20230503013608.2431726-1-nphamcs@gmail.com/ This patch wires cachestat in for all the other architectures. [nphamcs@gmail.com: wire up cachestat for arm64] Link: https://lkml.kernel.org/r/20230511092843.3896327-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230510195806.2902878-1-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-26fs: stat: compat: Add __ARCH_WANT_COMPAT_STATGuo Ren
RISC-V doesn't neeed compat_stat, so using __ARCH_WANT_COMPAT_STAT to exclude unnecessary SYSCALL functions. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Heiko Stuebner <heiko@sntech.de> Acked-by: Helge Deller <deller@gmx.de> # parisc Link: https://lore.kernel.org/r/20220405071314.3225832-6-guoren@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2022-01-15mm/mempolicy: wire up syscall set_mempolicy_home_nodeAneesh Kumar K.V
Link: https://lkml.kernel.org/r/20211202123810.267175-4-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-10-07futex,arm: Wire up sys_futex_waitv()André Almeida
Wire up syscall entry point for ARM architectures, for both 32 and 64-bit. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210923171111.300673-19-andrealmeid@collabora.com
2021-09-03mm: wire up syscall process_mreleaseSuren Baghdasaryan
Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/20210809185259.405936-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-22arch: Wire up Landlock syscallsMickaël Salaün
Wire up the following system calls for all architectures: * landlock_create_ruleset(2) * landlock_add_rule(2) * landlock_restrict_self(2) Cc: Arnd Bergmann <arnd@arndb.de> Cc: James Morris <jmorris@namei.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com> Link: https://lore.kernel.org/r/20210422154123.13086-10-mic@digikod.net Signed-off-by: James Morris <jamorris@linux.microsoft.com>
2021-01-24fs: add mount_setattr()Christian Brauner
This implements the missing mount_setattr() syscall. While the new mount api allows to change the properties of a superblock there is currently no way to change the properties of a mount or a mount tree using file descriptors which the new mount api is based on. In addition the old mount api has the restriction that mount options cannot be applied recursively. This hasn't changed since changing mount options on a per-mount basis was implemented in [1] and has been a frequent request not just for convenience but also for security reasons. The legacy mount syscall is unable to accommodate this behavior without introducing a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND | MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost mount. Changing MS_REC to apply to the whole mount tree would mean introducing a significant uapi change and would likely cause significant regressions. The new mount_setattr() syscall allows to recursively clear and set mount options in one shot. Multiple calls to change mount options requesting the same changes are idempotent: int mount_setattr(int dfd, const char *path, unsigned flags, struct mount_attr *uattr, size_t usize); Flags to modify path resolution behavior are specified in the @flags argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW, and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to restrict path resolution as introduced with openat2() might be supported in the future. The mount_setattr() syscall can be expected to grow over time and is designed with extensibility in mind. It follows the extensible syscall pattern we have used with other syscalls such as openat2(), clone3(), sched_{set,get}attr(), and others. The set of mount options is passed in the uapi struct mount_attr which currently has the following layout: struct mount_attr { __u64 attr_set; __u64 attr_clr; __u64 propagation; __u64 userns_fd; }; The @attr_set and @attr_clr members are used to clear and set mount options. This way a user can e.g. request that a set of flags is to be raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in @attr_set while at the same time requesting that another set of flags is to be lowered such as removing noexec from a mount tree by specifying MOUNT_ATTR_NOEXEC in @attr_clr. Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0, not a bitmap, users wanting to transition to a different atime setting cannot simply specify the atime setting in @attr_set, but must also specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in @attr_clr. The @propagation field lets callers specify the propagation type of a mount tree. Propagation is a single property that has four different settings and as such is not really a flag argument but an enum. Specifically, it would be unclear what setting and clearing propagation settings in combination would amount to. The legacy mount() syscall thus forbids the combination of multiple propagation settings too. The goal is to keep the semantics of mount propagation somewhat simple as they are overly complex as it is. The @userns_fd field lets user specify a user namespace whose idmapping becomes the idmapping of the mount. This is implemented and explained in detail in the next patch. [1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount") Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-api@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-12-19epoll: wire up syscall epoll_pwait2Willem de Bruijn
Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/20201121144401.3727659-4-willemdebruijn.kernel@gmail.com Signed-off-by: Willem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18mm/madvise: introduce process_madvise() syscall: an external memory hinting APIMinchan Kim
There is usecase that System Management Software(SMS) want to give a memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case of Android, it is the ActivityManagerService. The information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces a new syscall process_madvise(2). It uses pidfd of an external process to give the hint. It also supports vector address range because Android app has thousands of vmas due to zygote so it's totally waste of CPU and power if we should call the syscall one by one for each vma.(With testing 2000-vma syscall vs 1-vector syscall, it showed 15% performance improvement. I think it would be bigger in real practice because the testing ran very cache friendly environment). Another potential use case for the vector range is to amortize the cost ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could benefit users like TCP receive zerocopy and malloc implementations. In future, we could find more usecases for other advises so let's make it happens as API since we introduce a new syscall at this moment. With that, existing madvise(2) user could replace it with process_madvise(2) with their own pid if they want to have batch address ranges support feature. ince it could affect other process's address range, only privileged process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise is rather risky. Because we are not sure all hints make sense from external process and implementation for the hint may rely on the caller being in the current context so it could be error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. If someone want to add other hints, we could hear the usecase and review it for each hint. It's safer for maintenance rather than introducing a buggy syscall but hard to fix it later. So finally, the API is as follows, ssize_t process_madvise(int pidfd, const struct iovec *iovec, unsigned long vlen, int advice, unsigned int flags); DESCRIPTION The process_madvise() system call is used to give advice or directions to the kernel about the address ranges from external process as well as local process. It provides the advice to address ranges of process described by iovec and vlen. The goal of such advice is to improve system or application performance. The pidfd selects the process referred to by the PID file descriptor specified in pidfd. (See pidofd_open(2) for further information) The pointer iovec points to an array of iovec structures, defined in <sys/uio.h> as: struct iovec { void *iov_base; /* starting address */ size_t iov_len; /* number of bytes to be advised */ }; The iovec describes address ranges beginning at address(iov_base) and with size length of bytes(iov_len). The vlen represents the number of elements in iovec. The advice is indicated in the advice argument, which is one of the following at this moment if the target process specified by pidfd is external. MADV_COLD MADV_PAGEOUT Permission to provide a hint to external process is governed by a ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2). The process_madvise supports every advice madvise(2) has if target process is in same thread group with calling process so user could use process_madvise(2) to extend existing madvise(2) to support vector address ranges. RETURN VALUE On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred. The caller should check return value to determine whether a partial advice occurred. FAQ: Q.1 - Why does any external entity have better knowledge? Quote from Sandeep "For Android, every application (including the special SystemServer) are forked from Zygote. The reason of course is to share as many libraries and classes between the two as possible to benefit from the preloading during boot. After applications start, (almost) all of the APIs end up calling into this SystemServer process over IPC (binder) and back to the application. In a fully running system, the SystemServer monitors every single process periodically to calculate their PSS / RSS and also decides which process is "important" to the user for interactivity. So, because of how these processes start _and_ the fact that the SystemServer is looping to monitor each process, it does tend to *know* which address range of the application is not used / useful. Besides, we can never rely on applications to clean things up themselves. We've had the "hey app1, the system is low on memory, please trim your memory usage down" notifications for a long time[1]. They rely on applications honoring the broadcasts and very few do. So, if we want to avoid the inevitable killing of the application and restarting it, some way to be able to tell the OS about unimportant memory in these applications will be useful. - ssp Q.2 - How to guarantee the race(i.e., object validation) between when giving a hint from an external process and get the hint from the target process? process_madvise operates on the target process's address space as it exists at the instant that process_madvise is called. If the space target process can run between the time the process_madvise process inspects the target process address space and the time that process_madvise is actually called, process_madvise may operate on memory regions that the calling process does not expect. It's the responsibility of the process calling process_madvise to close this race condition. For example, the calling process can suspend the target process with ptrace, SIGSTOP, or the freezer cgroup so that it doesn't have an opportunity to change its own address space before process_madvise is called. Another option is to operate on memory regions that the caller knows a priori will be unchanged in the target process. Yet another option is to accept the race for certain process_madvise calls after reasoning that mistargeting will do no harm. The suggested API itself does not provide synchronization. It also apply other APIs like move_pages, process_vm_write. The race isn't really a problem though. Why is it so wrong to require that callers do their own synchronization in some manner? Nobody objects to write(2) merely because it's possible for two processes to open the same file and clobber each other's writes --- instead, we tell people to use flock or something. Think about mmap. It never guarantees newly allocated address space is still valid when the user tries to access it because other threads could unmap the memory right before. That's where we need synchronization by using other API or design from userside. It shouldn't be part of API itself. If someone needs more fine-grained synchronization rather than process level, there were two ideas suggested - cookie[2] and anon-fd[3]. Both are applicable via using last reserved argument of the API but I don't think it's necessary right now since we have already ways to prevent the race so don't want to add additional complexity with more fine-grained optimization model. To make the API extend, it reserved an unsigned long as last argument so we could support it in future if someone really needs it. Q.3 - Why doesn't ptrace work? Injecting an madvise in the target process using ptrace would not work for us because such injected madvise would have to be executed by the target process, which means that process would have to be runnable and that creates the risk of the abovementioned race and hinting a wrong VMA. Furthermore, we want to act the hint in caller's context, not the callee's, because the callee is usually limited in cpuset/cgroups or even freezed state so they can't act by themselves quick enough, which causes more thrashing/kill. It doesn't work if the target process are ptraced(e.g., strace, debugger, minidump) because a process can have at most one ptracer. [1] https://developer.android.com/topic/performance/memory" [2] process_getinfo for getting the cookie which is updated whenever vma of process address layout are changed - Daniel Colascione - https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 [3] anonymous fd which is used for the object(i.e., address range) validation - Michal Hocko - https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/ [minchan@kernel.org: fix process_madvise build break for arm64] Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com [minchan@kernel.org: fix build error for mips of process_madvise] Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com [akpm@linux-foundation.org: fix patch ordering issue] [akpm@linux-foundation.org: fix arm64 whoops] [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian] [akpm@linux-foundation.org: fix i386 build] [sfr@canb.auug.org.au: fix syscall numbering] Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au [sfr@canb.auug.org.au: madvise.c needs compat.h] Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au [minchan@kernel.org: fix mips build] Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com [yuehaibing@huawei.com: remove duplicate header which is included twice] Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com [minchan@kernel.org: do not use helper functions for process_madvise] Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com [akpm@linux-foundation.org: pidfd_get_pid() gained an argument] [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"] Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <christian@brauner.io> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-14vfs: add faccessat2 syscallMiklos Szeredi
POSIX defines faccessat() as having a fourth "flags" argument, while the linux syscall doesn't have it. Glibc tries to emulate AT_EACCESS and AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken. Add a new faccessat(2) syscall with the added flags argument and implement both flags. The value of AT_EACCESS is defined in glibc headers to be the same as AT_REMOVEDIR. Use this value for the kernel interface as well, together with the explanatory comment. Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can be useful and is trivial to implement. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-03-19arm64: compat: Fix syscall number of compat_clock_getresVincenzo Frascino
The syscall number of compat_clock_getres was erroneously set to 247 (__NR_io_cancel!) instead of 264. This causes the vDSO fallback of clock_getres() to land on the wrong syscall for compat tasks. Fix the numbering. Cc: <stable@vger.kernel.org> Fixes: 53c489e1dfeb6 ("arm64: compat: Add missing syscall numbers") Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2020-01-29Merge tag 'threads-v5.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread management updates from Christian Brauner: "Sargun Dhillon over the last cycle has worked on the pidfd_getfd() syscall. This syscall allows for the retrieval of file descriptors of a process based on its pidfd. A task needs to have ptrace_may_access() permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and Andy) on the target. One of the main use-cases is in combination with seccomp's user notification feature. As a reminder, seccomp's user notification feature was made available in v5.0. It allows a task to retrieve a file descriptor for its seccomp filter. The file descriptor is usually handed of to a more privileged supervising process. The supervisor can then listen for syscall events caught by the seccomp filter of the supervisee and perform actions in lieu of the supervisee, usually emulating syscalls. pidfd_getfd() is needed to expand its uses. There are currently two major users that wait on pidfd_getfd() and one future user: - Netflix, Sargun said, is working on a service mesh where users should be able to connect to a dns-based VIP. When a user connects to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be redirected to an envoy process. This service mesh uses seccomp user notifications and pidfd to intercept all connect calls and instead of connecting them to 1.2.3.4:80 connects them to e.g. 127.0.0.1:8080. - LXD uses the seccomp notifier heavily to intercept and emulate mknod() and mount() syscalls for unprivileged containers/processes. With pidfd_getfd() more uses-cases e.g. bridging socket connections will be possible. - The patchset has also seen some interest from the browser corner. Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a broker process. In the future glibc will start blocking all signals during dlopen() rendering this type of sandbox impossible. Hence, in the future Firefox will switch to a seccomp-user-nofication based sandbox which also makes use of file descriptor retrieval. The thread for this can be found at https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html With pidfd_getfd() it is e.g. possible to bridge socket connections for the supervisee (binding to a privileged port) and taking actions on file descriptors on behalf of the supervisee in general. Sargun's first version was using an ioctl on pidfds but various people pushed for it to be a proper syscall which he duely implemented as well over various review cycles. Selftests are of course included. I've also added instructions how to deal with merge conflicts below. There's also a small fix coming from the kernel mentee project to correctly annotate struct sighand_struct with __rcu to fix various sparse warnings. We've received a few more such fixes and even though they are mostly trivial I've decided to postpone them until after -rc1 since they came in rather late and I don't want to risk introducing build warnings. Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is needed to avoid allocation recursions triggerable by storage drivers that have userspace parts that run in the IO path (e.g. dm-multipath, iscsi, etc). These allocation recursions deadlock the device. The new prctl() allows such privileged userspace components to avoid allocation recursions by setting the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE flags. The patch carries the necessary acks from the relevant maintainers and is routed here as part of prctl() thread-management." * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim sched.h: Annotate sighand_struct with __rcu test: Add test for pidfd getfd arch: wire up pidfd_getfd syscall pid: Implement pidfd_getfd syscall vfs, fdtable: Add fget_task helper
2020-01-29Merge branch 'work.openat2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull openat2 support from Al Viro: "This is the openat2() series from Aleksa Sarai. I'm afraid that the rest of namei stuff will have to wait - it got zero review the last time I'd posted #work.namei, and there had been a leak in the posted series I'd caught only last weekend. I was going to repost it on Monday, but the window opened and the odds of getting any review during that... Oh, well. Anyway, openat2 part should be ready; that _did_ get sane amount of review and public testing, so here it comes" From Aleksa's description of the series: "For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Furthermore, the need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum project[5]) with a few additions and changes made based on the previous discussion within [6] as well as others I felt were useful. In line with the conclusions of the original discussion of AT_NO_JUMPS, the flag has been split up into separate flags. However, instead of being an openat(2) flag it is provided through a new syscall openat2(2) which provides several other improvements to the openat(2) interface (see the patch description for more details). The following new LOOKUP_* flags are added: LOOKUP_NO_XDEV: Blocks all mountpoint crossings (upwards, downwards, or through absolute links). Absolute pathnames alone in openat(2) do not trigger this. Magic-link traversal which implies a vfsmount jump is also blocked (though magic-link jumps on the same vfsmount are permitted). LOOKUP_NO_MAGICLINKS: Blocks resolution through /proc/$pid/fd-style links. This is done by blocking the usage of nd_jump_link() during resolution in a filesystem. The term "magic-links" is used to match with the only reference to these links in Documentation/, but I'm happy to change the name. It should be noted that this is different to the scope of ~LOOKUP_FOLLOW in that it applies to all path components. However, you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it will *not* fail (assuming that no parent component was a magic-link), and you will have an fd for the magic-link. In order to correctly detect magic-links, the introduction of a new LOOKUP_MAGICLINK_JUMPED state flag was required. LOOKUP_BENEATH: Disallows escapes to outside the starting dirfd's tree, using techniques such as ".." or absolute links. Absolute paths in openat(2) are also disallowed. Conceptually this flag is to ensure you "stay below" a certain point in the filesystem tree -- but this requires some additional to protect against various races that would allow escape using "..". Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it can trivially beam you around the filesystem (breaking the protection). In future, there might be similar safety checks done as in LOOKUP_IN_ROOT, but that requires more discussion. In addition, two new flags are added that expand on the above ideas: LOOKUP_NO_SYMLINKS: Does what it says on the tin. No symlink resolution is allowed at all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an fd for the symlink as long as no parent path had a symlink component. LOOKUP_IN_ROOT: This is an extension of LOOKUP_BENEATH that, rather than blocking attempts to move past the root, forces all such movements to be scoped to the starting point. This provides chroot(2)-like protection but without the cost of a chroot(2) for each filesystem operation, as well as being safe against race attacks that chroot(2) is not. If a race is detected (as with LOOKUP_BENEATH) then an error is generated, and similar to LOOKUP_BENEATH it is not permitted to cross magic-links with LOOKUP_IN_ROOT. The primary need for this is from container runtimes, which currently need to do symlink scoping in userspace[7] when opening paths in a potentially malicious container. There is a long list of CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a few). In order to make all of the above more usable, I'm working on libpathrs[8] which is a C-friendly library for safe path resolution. It features a userspace-emulated backend if the kernel doesn't support openat2(2). Hopefully we can get userspace to switch to using it, and thus get openat2(2) support for free once it's ready. Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes though stale NFS handles)" * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Documentation: path-lookup: include new LOOKUP flags selftests: add openat2(2) selftests open: introduce openat2(2) syscall namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution namei: LOOKUP_IN_ROOT: chroot-like scoped resolution namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution namei: LOOKUP_NO_XDEV: block mountpoint crossing namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution namei: LOOKUP_NO_SYMLINKS: block symlink resolution namei: allow set_root() to produce errors namei: allow nd_jump_link() to produce errors nsfs: clean-up ns_get_path() signature to return int namei: only return -ECHILD from follow_dotdot_rcu()
2020-01-18open: introduce openat2(2) syscallAleksa Sarai
/* Background. */ For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Userspace also has a hard time figuring out whether a particular flag is supported on a particular kernel. While it is now possible with contemporary kernels (thanks to [3]), older kernels will expose unknown flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during openat(2) time matches modern syscall designs and is far more fool-proof. In addition, the newly-added path resolution restriction LOOKUP flags (which we would like to expose to user-space) don't feel related to the pre-existing O_* flag set -- they affect all components of path lookup. We'd therefore like to add a new flag argument. Adding a new syscall allows us to finally fix the flag-ignoring problem, and we can make it extensible enough so that we will hopefully never need an openat3(2). /* Syscall Prototype. */ /* * open_how is an extensible structure (similar in interface to * clone3(2) or sched_setattr(2)). The size parameter must be set to * sizeof(struct open_how), to allow for future extensions. All future * extensions will be appended to open_how, with their zero value * acting as a no-op default. */ struct open_how { /* ... */ }; int openat2(int dfd, const char *pathname, struct open_how *how, size_t size); /* Description. */ The initial version of 'struct open_how' contains the following fields: flags Used to specify openat(2)-style flags. However, any unknown flag bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR) will result in -EINVAL. In addition, this field is 64-bits wide to allow for more O_ flags than currently permitted with openat(2). mode The file mode for O_CREAT or O_TMPFILE. Must be set to zero if flags does not contain O_CREAT or O_TMPFILE. resolve Restrict path resolution (in contrast to O_* flags they affect all path components). The current set of flags are as follows (at the moment, all of the RESOLVE_ flags are implemented as just passing the corresponding LOOKUP_ flag). RESOLVE_NO_XDEV => LOOKUP_NO_XDEV RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS RESOLVE_BENEATH => LOOKUP_BENEATH RESOLVE_IN_ROOT => LOOKUP_IN_ROOT open_how does not contain an embedded size field, because it is of little benefit (userspace can figure out the kernel open_how size at runtime fairly easily without it). It also only contains u64s (even though ->mode arguably should be a u16) to avoid having padding fields which are never used in the future. Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE is no longer permitted for openat(2). As far as I can tell, this has always been a bug and appears to not be used by userspace (and I've not seen any problems on my machines by disallowing it). If it turns out this breaks something, we can special-case it and only permit it for openat(2) but not openat2(2). After input from Florian Weimer, the new open_how and flag definitions are inside a separate header from uapi/linux/fcntl.h, to avoid problems that glibc has with importing that header. /* Testing. */ In a follow-up patch there are over 200 selftests which ensure that this syscall has the correct semantics and will correctly handle several attack scenarios. In addition, I've written a userspace library[4] which provides convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care must be taken when using RESOLVE_IN_ROOT'd file descriptors with other syscalls). During the development of this patch, I've run numerous verification tests using libpathrs (showing that the API is reasonably usable by userspace). /* Future Work. */ Additional RESOLVE_ flags have been suggested during the review period. These can be easily implemented separately (such as blocking auto-mount during resolution). Furthermore, there are some other proposed changes to the openat(2) interface (the most obvious example is magic-link hardening[5]) which would be a good opportunity to add a way for userspace to restrict how O_PATH file descriptors can be re-opened. Another possible avenue of future work would be some kind of CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace which openat2(2) flags and fields are supported by the current kernel (to avoid userspace having to go through several guesses to figure it out). [1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags") [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/ [6]: https://youtu.be/ggD-eb3yPVs Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-13arch: wire up pidfd_getfd syscallSargun Dhillon
This wires up the pidfd_getfd syscall for all architectures. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20200107175927.4558-4-sargun@sargun.me Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-01-07arm64: Move __ARCH_WANT_SYS_CLONE3 definition to uapi headersAmanieu d'Antras
Previously this was only defined in the internal headers which resulted in __NR_clone3 not being defined in the user headers. Signed-off-by: Amanieu d'Antras <amanieu@gmail.com> Cc: linux-arm-kernel@lists.infradead.org Cc: <stable@vger.kernel.org> # 5.3.x Reviewed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20200102172413.654385-2-amanieu@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-07-11Merge tag 'clone3-v5.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull clone3 system call from Christian Brauner: "This adds the clone3 syscall which is an extensible successor to clone after we snagged the last flag with CLONE_PIDFD during the 5.2 merge window for clone(). It cleanly supports all of the flags from clone() and thus all legacy workloads. There are few user visible differences between clone3 and clone. First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse this flag. Second, the CSIGNAL flag is deprecated and will cause EINVAL to be reported. It is superseeded by a dedicated "exit_signal" argument in struct clone_args thus freeing up even more flags. And third, clone3 gives CLONE_PIDFD a dedicated return argument in struct clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr argument. The clone3 uapi is designed to be easy to handle on 32- and 64 bit: /* uapi */ struct clone_args { __aligned_u64 flags; __aligned_u64 pidfd; __aligned_u64 child_tid; __aligned_u64 parent_tid; __aligned_u64 exit_signal; __aligned_u64 stack; __aligned_u64 stack_size; __aligned_u64 tls; }; and a separate kernel struct is used that uses proper kernel typing: /* kernel internal */ struct kernel_clone_args { u64 flags; int __user *pidfd; int __user *child_tid; int __user *parent_tid; int exit_signal; unsigned long stack; unsigned long stack_size; unsigned long tls; }; The system call comes with a size argument which enables the kernel to detect what version of clone_args userspace is passing in. clone3 validates that any additional bytes a given kernel does not know about are set to zero and that the size never exceeds a page. A nice feature is that this patchset allowed us to cleanup and simplify various core kernel codepaths in kernel/fork.c by making the internal _do_fork() function take struct kernel_clone_args even for legacy clone(). This patch also unblocks the time namespace patchset which wants to introduce a new CLONE_TIMENS flag. Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and xtensa. These were the architectures that did not require special massaging. Other architectures treat fork-like system calls individually and after some back and forth neither Arnd nor I felt confident that we dared to add clone3 unconditionally to all architectures. We agreed to leave this up to individual architecture maintainers. This is why there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3 which any architecture can set once it has implemented support for clone3. The patch also adds a cond_syscall(clone3) for architectures such as nios2 or h8300 that generate their syscall table by simply including asm-generic/unistd.h. The hope is to get rid of __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon" * tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: arch: handle arches who do not yet define clone3 arch: wire-up clone3() syscall fork: add clone3
2019-07-10Merge tag 'pidfd-updates-v5.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd updates from Christian Brauner: "This adds two main features. - First, it adds polling support for pidfds. This allows process managers to know when a (non-parent) process dies in a race-free way. The notification mechanism used follows the same logic that is currently used when the parent of a task is notified of a child's death. With this patchset it is possible to put pidfds in an {e}poll loop and get reliable notifications for process (i.e. thread-group) exit. - The second feature compliments the first one by making it possible to retrieve pollable pidfds for processes that were not created using CLONE_PIDFD. A lot of processes get created with traditional PID-based calls such as fork() or clone() (without CLONE_PIDFD). For these processes a caller can currently not create a pollable pidfd. This is a problem for Android's low memory killer (LMK) and service managers such as systemd. Both patchsets are accompanied by selftests. It's perhaps worth noting that the work done so far and the work done in this branch for pidfd_open() and polling support do already see some adoption: - Android is in the process of backporting this work to all their LTS kernels [1] - Service managers make use of pidfd_send_signal but will need to wait until we enable waiting on pidfds for full adoption. - And projects I maintain make use of both pidfd_send_signal and CLONE_PIDFD [2] and will use polling support and pidfd_open() too" [1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22 https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22 https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22 [2] https://github.com/lxc/lxc/blob/aab6e3eb73c343231cdde775db938994fc6f2803/src/lxc/start.c#L1753 * tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: tests: add pidfd_open() tests arch: wire-up pidfd_open() pid: add pidfd_open() pidfd: add polling selftests pidfd: add polling support
2019-06-28arch: wire-up pidfd_open()Christian Brauner
This wires up the pidfd_open() syscall into all arches at once. Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: linux-alpha@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-ia64@vger.kernel.org Cc: linux-m68k@lists.linux-m68k.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org
2019-06-22arm64: compat: Add missing syscall numbersVincenzo Frascino
vDSO requires gettimeofday() and clock_gettime() syscalls to implement the fallback mechanism. Add the missing syscall numbers to unistd.h for arm64. Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Shijith Thotton <sthotton@marvell.com> Tested-by: Andre Przywara <andre.przywara@arm.com> Cc: linux-arch@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mips@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@armlinux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paul.burton@mips.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Huw Davies <huw@codeweavers.com> Link: https://lkml.kernel.org/r/20190621095252.32307-7-vincenzo.frascino@arm.com
2019-06-21arch: handle arches who do not yet define clone3Christian Brauner
This cleanly handles arches who do not yet define clone3. clone3() was initially placed under __ARCH_WANT_SYS_CLONE under the assumption that this would cleanly handle all architectures. It does not. Architectures such as nios2 or h8300 simply take the asm-generic syscall definitions and generate their syscall table from it. Since they don't define __ARCH_WANT_SYS_CLONE the build would fail complaining about sys_clone3 missing. The reason this doesn't happen for legacy clone is that nios2 and h8300 provide assembly stubs for sys_clone. This seems to be done for architectural reasons. The build failures for nios2 and h8300 were caught int -next luckily. The solution is to define __ARCH_WANT_SYS_CLONE3 that architectures can add. Additionally, we need a cond_syscall(clone3) for architectures such as nios2 or h8300 that generate their syscall table in the way I explained above. Fixes: 8f3220a80654 ("arch: wire-up clone3() syscall") Signed-off-by: Christian Brauner <christian@brauner.io> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Adrian Reber <adrian@lisas.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 234Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not see http www gnu org licenses extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 503 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Enrico Weigelt <info@metux.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190602204653.811534538@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09arch: wire-up clone3() syscallChristian Brauner
Wire up the clone3() call on all arches that don't require hand-rolled assembly. Some of the arches look like they need special assembly massaging and it is probably smarter if the appropriate arch maintainers would do the actual wiring. Arches that are wired-up are: - x86{_32,64} - arm{64} - xtensa Signed-off-by: Christian Brauner <christian@brauner.io> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Kees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Adrian Reber <adrian@lisas.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Cc: linux-api@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org
2019-05-16uapi: Wire up the mount API syscalls on non-x86 arches [ver #2]David Howells
Wire up the mount API syscalls on non-x86 arches. Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-15arch: add pidfd and io_uring syscalls everywhereArnd Bergmann
Add the io_uring and pidfd_send_signal system calls to all architectures. These system calls are designed to handle both native and compat tasks, so all entries are the same across architectures, only arm-compat and the generic tale still use an old format. Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> (s390) Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-02-07y2038: add 64-bit time_t syscalls to all 32-bit architecturesArnd Bergmann
This adds 21 new system calls on each ABI that has 32-bit time_t today. All of these have the exact same semantics as their existing counterparts, and the new ones all have macro names that end in 'time64' for clarification. This gets us to the point of being able to safely use a C library that has 64-bit time_t in user space. There are still a couple of loose ends to tie up in various areas of the code, but this is the big one, and should be entirely uncontroversial at this point. In particular, there are four system calls (getitimer, setitimer, waitid, and getrusage) that don't have a 64-bit counterpart yet, but these can all be safely implemented in the C library by wrapping around the existing system calls because the 32-bit time_t they pass only counts elapsed time, not time since the epoch. They will be dealt with later. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
2019-01-25ARM: add kexec_file_load system call numberArnd Bergmann
A couple of architectures including arm64 already implement the kexec_file_load system call, on many others we have assigned a system call number for it, but not implemented it yet. Adding the number in arch/arm/ lets us use the system call on arm64 systems in compat mode, and also reduces the number of differences between architectures. If we want to implement kexec_file_load on ARM in the future, the number assignment means that kexec tools can already be built with the now current set of kernel headers. Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-01-25ARM: add migrate_pages() system callArnd Bergmann
The migrate_pages system call has an assigned number on all architectures except ARM. When it got added initially in commit d80ade7b3231 ("ARM: Fix warning: #warning syscall migrate_pages not implemented"), it was intentionally left out based on the observation that there are no 32-bit ARM NUMA systems. However, there are now arm64 NUMA machines that can in theory run 32-bit kernels (actually enabling NUMA there would require additional work) as well as 32-bit user space on 64-bit kernels, so that argument is no longer very strong. Assigning the number lets us use the system call on 64-bit kernels as well as providing a more consistent set of syscalls across architectures. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
2019-01-04arm64: compat: Hook up io_pgetevents() for 32-bit tasksWill Deacon
Commit 73aeb2cbcdc9 ("ARM: 8787/1: wire up io_pgetevents syscall") hooked up the io_pgetevents() system call for 32-bit ARM, so we can do the same for the compat wrapper on arm64. Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-01-04arm64: compat: Avoid sending SIGILL for unallocated syscall numbersWill Deacon
The ARM Linux kernel handles the EABI syscall numbers as follows: 0 - NR_SYSCALLS-1 : Invoke syscall via syscall table NR_SYSCALLS - 0xeffff : -ENOSYS (to be allocated in future) 0xf0000 - 0xf07ff : Private syscall or -ENOSYS if not allocated > 0xf07ff : SIGILL Our compat code gets this wrong and ends up sending SIGILL in response to all syscalls greater than NR_SYSCALLS which have a value greater than 0x7ff in the bottom 16 bits. Fix this by defining the end of the ARM private syscall region and checking the syscall number against that directly. Update the comment while we're at it. Cc: <stable@vger.kernel.org> Cc: Dave Martin <Dave.Martin@arm.com> Reported-by: Pi-Hsun Shih <pihsun@chromium.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-08-29y2038: utimes: Rework #ifdef guards for compat syscallsArnd Bergmann
After changing over to 64-bit time_t syscalls, many architectures will want compat_sys_utimensat() but not respective handlers for utime(), utimes() and futimesat(). This adds a new __ARCH_WANT_SYS_UTIME32 to complement __ARCH_WANT_SYS_UTIME. For now, all 64-bit architectures that support CONFIG_COMPAT set it, but future 64-bit architectures will not (tile would not have needed it either, but got removed). As older 32-bit architectures get converted to using CONFIG_64BIT_TIME, they will have to use __ARCH_WANT_SYS_UTIME32 instead of __ARCH_WANT_SYS_UTIME. Architectures using the generic syscall ABI don't need either of them as they never had a utime syscall. Since the compat_utimbuf structure is now required outside of CONFIG_COMPAT, I'm moving it into compat_time.h. Signed-off-by: Arnd Bergmann <arnd@arndb.de> --- changed from last version: - renamed __ARCH_WANT_COMPAT_SYS_UTIME to __ARCH_WANT_SYS_UTIME32
2018-08-29asm-generic: Remove unneeded __ARCH_WANT_SYS_LLSEEK macroArnd Bergmann
The sys_llseek sytem call is needed on all 32-bit architectures and none of the 64-bit ones, so we can remove the __ARCH_WANT_SYS_LLSEEK guard and simplify the include/asm-generic/unistd.h header further. Since 32-bit tasks can run either natively or in compat mode on 64-bit architectures, we have to check for both !CONFIG_64BIT and CONFIG_COMPAT. There are a few 64-bit architectures that also reference sys_llseek in their 64-bit ABI (e.g. sparc), but I verified that those all select CONFIG_COMPAT, so the #if check is still correct here. It's a bit odd to include it in the syscall table though, as it's the same as sys_lseek() on 64-bit, but with strange calling conventions. Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-07-11arm64: rseq: Implement backend rseq calls and select HAVE_RSEQWill Deacon
Implement calls to rseq_signal_deliver, rseq_handle_notify_resume and rseq_syscall so that we can select HAVE_RSEQ on arm64. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-04-17Remove compat_sys_getdents64()Al Viro
Unlike normal compat syscall variants, it is needed only for biarch architectures that have different alignement requirements for u64 in 32bit and 64bit ABI *and* have __put_user() that won't handle a store of 64bit value at 32bit-aligned address. We used to have one such (ia64), but its biarch support has been gone since 2010 (after being broken in 2008, which went unnoticed since nobody had been using it). It had escaped removal at the same time only because back in 2004 a patch that switched several syscalls on amd64 from private wrappers to generic compat ones had switched to use of compat_sys_getdents64(), which hadn't needed (or used) a compat wrapper on amd64. Let's bury it - it's at least 7 years overdue. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-21arm64: compat: Update compat syscallsWill Deacon
Hook up three pkey syscalls (which we don't implement) and the new statx syscall, as has been done for arch/arm/. Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-06-01arm64: unistd32.h: wire up missing syscalls for compat tasksWill Deacon
We're missing entries for mlock2, copy_file_range, preadv2 and pwritev2 in our compat syscall table, so hook them up. Only the last two need compat wrappers. Signed-off-by: Will Deacon <will.deacon@arm.com>
2015-10-14arm64: compat: wire up new syscallsWill Deacon
Commit 208473c1f3ac ("ARM: wire up new syscalls") hooked up the new userfaultfd and membarrier syscalls for ARM, so do the same for our compat syscall table in arm64. Signed-off-by: Will Deacon <will.deacon@arm.com>
2015-02-11Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "arm64 updates for 3.20: - reimplementation of the virtual remapping of UEFI Runtime Services in a way that is stable across kexec - emulation of the "setend" instruction for 32-bit tasks (user endianness switching trapped in the kernel, SCTLR_EL1.E0E bit set accordingly) - compat_sys_call_table implemented in C (from asm) and made it a constant array together with sys_call_table - export CPU cache information via /sys (like other architectures) - DMA API implementation clean-up in preparation for IOMMU support - macros clean-up for KVM - dropped some unnecessary cache+tlb maintenance - CONFIG_ARM64_CPU_SUSPEND clean-up - defconfig update (CPU_IDLE) The EFI changes going via the arm64 tree have been acked by Matt Fleming. There is also a patch adding sys_*stat64 prototypes to include/linux/syscalls.h, acked by Andrew Morton" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (47 commits) arm64: compat: Remove incorrect comment in compat_siginfo arm64: Fix section mismatch on alloc_init_p[mu]d() arm64: Avoid breakage caused by .altmacro in fpsimd save/restore macros arm64: mm: use *_sect to check for section maps arm64: drop unnecessary cache+tlb maintenance arm64:mm: free the useless initial page table arm64: Enable CPU_IDLE in defconfig arm64: kernel: remove ARM64_CPU_SUSPEND config option arm64: make sys_call_table const arm64: Remove asm/syscalls.h arm64: Implement the compat_sys_call_table in C syscalls: Declare sys_*stat64 prototypes if __ARCH_WANT_(COMPAT_)STAT64 compat: Declare compat_sys_sigpending and compat_sys_sigprocmask prototypes arm64: uapi: expose our struct ucontext to the uapi headers smp, ARM64: Kill SMP single function call interrupt arm64: Emulate SETEND for AArch32 tasks arm64: Consolidate hotplug notifier for instruction emulation arm64: Track system support for mixed endian EL0 arm64: implement generic IOMMU configuration arm64: Combine coherent and non-coherent swiotlb dma_ops ...
2015-01-27arm64: Implement the compat_sys_call_table in CCatalin Marinas
Unlike the sys_call_table[], the compat one was implemented in sys32.S making it impossible to notice discrepancies between the number of compat syscalls and the __NR_compat_syscalls macro, the latter having to be defined in asm/unistd.h as including asm/unistd32.h would cause conflicts on __NR_* definitions. With this patch, incorrect __NR_compat_syscalls values will result in a build-time error. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com>