summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2016-11-27bpf: add owner_prog_type and accounted mem to array map's fdinfoDaniel Borkmann
Allow for checking the owner_prog_type of a program array map. In some cases bpf(2) can return -EINVAL /after/ the verifier passed and did all the rewrites of the bpf program. The reason that lets us fail at this late stage is that program array maps are incompatible. Allow users to inspect this earlier after they got the map fd through BPF_OBJ_GET command. tc will get support for this. Also, display how much we charged the map with regards to RLIMIT_MEMLOCK. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27bpf: drop unnecessary context cast from BPF_PROG_RUNDaniel Borkmann
Since long already bpf_func is not only about struct sk_buff * as input anymore. Make it generic as void *, so that callers don't need to cast for it each time they call BPF_PROG_RUN(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27module: extend 'rodata=off' boot cmdline parameter to module mappingsAKASHI Takahiro
The current "rodata=off" parameter disables read-only kernel mappings under CONFIG_DEBUG_RODATA: commit d2aa1acad22f ("mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings") This patch is a logical extension to module mappings ie. read-only mappings at module loading can be disabled even if CONFIG_DEBUG_SET_MODULE_RONX (mainly for debug use). Please note, however, that it only affects RO/RW permissions, keeping NX set. This is the first step to make CONFIG_DEBUG_SET_MODULE_RONX mandatory (always-on) in the future as CONFIG_DEBUG_RODATA on x86 and arm64. Suggested-by: and Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Link: http://lkml.kernel.org/r/20161114061505.15238-1-takahiro.akashi@linaro.org Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
udplite conflict is resolved by taking what 'net-next' did which removed the backlog receive method assignment, since it is no longer necessary. Two entries were added to the non-priv ethtool operations switch statement, one in 'net' and one in 'net-next, so simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-26module: Fix a comment above strong_try_module_get()Miroslav Benes
The comment above strong_try_module_get() function is not true anymore. Return values changed with commit c9a3ba55bb5d ("module: wait for dependent modules doing init."). Signed-off-by: Miroslav Benes <mbenes@suse.cz> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1611161635330.12580@pobox.suse.cz [jeyu@redhat.com: style fixes to make checkpatch.pl happy] Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-26module: When modifying a module's text ignore modules which are going away tooAaron Tomlin
By default, during the access permission modification of a module's core and init pages, we only ignore modules that are malformed. Albeit for a module which is going away, it does not make sense to change its text to RO since the module should be RW, before deallocation. This patch makes set_all_modules_text_ro() skip modules which are going away too. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Link: http://lkml.kernel.org/r/1477560966-781-1-git-send-email-atomlin@redhat.com [jeyu@redhat.com: add comment as suggested by Steven Rostedt] Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-26module: Ensure a module's state is set accordingly during module coming ↵Aaron Tomlin
cleanup code In load_module() in the event of an error, for e.g. unknown module parameter(s) specified we go to perform some module coming clean up operations. At this point the module is still in a "formed" state when it is actually going away. This patch updates the module's state accordingly to ensure anyone on the module_notify_list waiting for a module going away notification will be notified accordingly. Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Link: http://lkml.kernel.org/r/1476980293-19062-2-git-send-email-atomlin@redhat.com Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-26taint/module: Clean up global and module taint flags handlingPetr Mladek
The commit 66cc69e34e86a231 ("Fix: module signature vs tracepoints: add new TAINT_UNSIGNED_MODULE") updated module_taint_flags() to potentially print one more character. But it did not increase the size of the corresponding buffers in m_show() and print_modules(). We have recently done the same mistake when adding a taint flag for livepatching, see https://lkml.kernel.org/r/cfba2c823bb984690b73572aaae1db596b54a082.1472137475.git.jpoimboe@redhat.com Also struct module uses an incompatible type for mod-taints flags. It survived from the commit 2bc2d61a9638dab670d ("[PATCH] list module taint flags in Oops/panic"). There was used "int" for the global taint flags at these times. But only the global tain flags was later changed to "unsigned long" by the commit 25ddbb18aae33ad2 ("Make the taint flags reliable"). This patch defines TAINT_FLAGS_COUNT that can be used to create arrays and buffers of the right size. Note that we could not use enum because the taint flag indexes are used also in assembly code. Then it reworks the table that describes the taint flags. The TAINT_* numbers can be used as the index. Instead, we add information if the taint flag is also shown per-module. Finally, it uses "unsigned long", bit operations, and the updated taint_flags table also for mod->taints. It is not optimal because only few taint flags can be printed by module_taint_flags(). But better be on the safe side. IMHO, it is not worth the optimization and this is a good compromise. Signed-off-by: Petr Mladek <pmladek@suse.com> Link: http://lkml.kernel.org/r/1474458442-21581-1-git-send-email-pmladek@suse.com [jeyu@redhat.com: fix broken lkml link in changelog] Signed-off-by: Jessica Yu <jeyu@redhat.com>
2016-11-25bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commandsDaniel Mack
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and BPF_PROG_DETACH which allow attaching and detaching eBPF programs to a target. On the API level, the target could be anything that has an fd in userspace, hence the name of the field in union bpf_attr is called 'target_fd'. When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is expected to be a valid file descriptor of a cgroup v2 directory which has the bpf controller enabled. These are the only use-cases implemented by this patch at this point, but more can be added. If a program of the given type already exists in the given cgroup, the program is swapped automically, so userspace does not have to drop an existing program first before installing a new one, which would otherwise leave a gap in which no program is attached. For more information on the propagation logic to subcgroups, please refer to the bpf cgroup controller implementation. The API is guarded by CAP_NET_ADMIN. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25cgroup: add support for eBPF programsDaniel Mack
This patch adds two sets of eBPF program pointers to struct cgroup. One for such that are directly pinned to a cgroup, and one for such that are effective for it. To illustrate the logic behind that, assume the following example cgroup hierarchy. A - B - C \ D - E If only B has a program attached, it will be effective for B, C, D and E. If D then attaches a program itself, that will be effective for both D and E, and the program in B will only affect B and C. Only one program of a given type is effective for a cgroup. Attaching and detaching programs will be done through the bpf(2) syscall. For now, ingress and egress inet socket filtering are the only supported use-cases. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-24cpufreq: schedutil: Rectify comment in sugov_irq_work() functionViresh Kumar
This patch rectifies a comment present in sugov_irq_work() function to follow proper grammar. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-11-24sched: Extend scheduler's asym packingTim Chen
We generalize the scheduler's asym packing to provide an ordering of the cpu beyond just the cpu number. This allows the use of the ASYM_PACKING scheduler machinery to move loads to preferred CPU in a sched domain. The preference is defined with the cpu priority given by arch_asym_cpu_priority(cpu). We also record the most preferred cpu in a sched group when we build the cpu's capacity for fast lookup of preferred cpu during load balancing. Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: linux-pm@vger.kernel.org Cc: jolsa@redhat.com Cc: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: bp@suse.de Link: http://lkml.kernel.org/r/0e73ae12737dfaafa46c07066cc7c5d3f1675e46.1479844244.git.tim.c.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-24sched/autogroup: Fix 64-bit kernel nice level adjustmentMike Galbraith
Michael Kerrisk reported: > Regarding the previous paragraph... My tests indicate > that writing *any* value to the autogroup [nice priority level] > file causes the task group to get a lower priority. Because autogroup didn't call the then meaningless scale_load()... Autogroup nice level adjustment has been broken ever since load resolution was increased for 64-bit kernels. Use scale_load() to scale group weight. Michael Kerrisk tested this patch to fix the problem: > Applied and tested against 4.9-rc6 on an Intel u7 (4 cores). > Test setup: > > Terminal window 1: running 40 CPU burner jobs > Terminal window 2: running 40 CPU burner jobs > Terminal window 1: running 1 CPU burner job > > Demonstrated that: > * Writing "0" to the autogroup file for TW1 now causes no change > to the rate at which the process on the terminal consume CPU. > * Writing -20 to the autogroup file for TW1 caused those processes > to get the lion's share of CPU while TW2 TW3 get a tiny amount. > * Writing -20 to the autogroup files for TW1 and TW3 allowed the > process on TW3 to get as much CPU as it was getting as when > the autogroup nice values for both terminals were 0. Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-man <linux-man@vger.kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1479897217.4306.6.camel@gmx.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23ring-buffer: Force rb_end_commit() and rb_set_commit_to_write() inlineSteven Rostedt (Red Hat)
Both rb_end_commit() and rb_set_commit_to_write() are in the fast path of the ring buffer recording. Make sure they are always inlined. Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23ring-buffer: Froce rb_update_write_stamp() to be inlinedSteven Rostedt (Red Hat)
The function rb_update_write_stamp() is in the hotpath of the ring buffer recording. Make sure that it is inlined as well. There's not many places that call it. Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23ring-buffer: Force inline of hotpath helper functionsSteven Rostedt (Red Hat)
There's several small helper functions in ring_buffer.c that are used in the hot path. For some reason, even though they are marked inline, gcc tends not to enforce it. Make sure these functions are always inlined. Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23tracing: Make __buffer_unlock_commit() always_inlineSteven Rostedt (Red Hat)
The function __buffer_unlock_commit() is called in a few places outside of trace.c. But for the most part, it should really be inlined, as it is in the hot path of the trace_events. For the callers outside of trace.c, create a new function trace_buffer_unlock_commit_nostack(), as the reason it was used was to avoid the stack tracing that trace_buffer_unlock_commit() could do. Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23tracing: Make tracepoint_printk a static_keySteven Rostedt (Red Hat)
Currently, when tracepoint_printk is set (enabled by the "tp_printk" kernel command line), it causes trace events to print via printk(). This is a very dangerous operation, but is useful for debugging. The issue is, it's seldom used, but it is always checked even if it's not enabled by the kernel command line. Instead of having this feature called by a branch against a variable, turn that variable into a static key, and this will remove the test and jump. To simplify things, the functions output_printk() and trace_event_buffer_commit() were moved from trace_events.c to trace.c. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23ring-buffer: Always inline rb_event_data()Steven Rostedt (Red Hat)
The rb_event_data() is the fast path of getting the ring buffer data from an event. Externally, ring_buffer_event_data() is used to access this function. But unfortunately, rb_event_data() is not inlined, and calling ring_buffer_event_data() causes that function to be called again. Force rb_event_data() to be inlined to lower the number of operations needed when calling ring_buffer_event_data(). Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23ring-buffer: Make rb_reserve_next_event() always inlinedSteven Rostedt (Red Hat)
The function rb_reserved_next_event() is called by two functions: ring_buffer_lock_reserve() and ring_buffer_write(). This is in a very hot path of the tracing code, and it is best that they are not functions. The two callers are basically wrapers for rb_reserver_next_event(). Removing the function calls can save execution time in the hotpath of tracing. Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23tracing: Create a always_inlined __trace_buffer_lock_reserve()Steven Rostedt (Red Hat)
As Andi Kleen pointed out in the Link below, the trace events has quite a bit of code execution. A lot of that happens to be calling functions, where some of them should simply be inlined. One of these functions happens to be trace_buffer_lock_reserve() which is also a global, but it is used throughout the file it is defined in. Create a __trace_buffer_lock_reserve() that is always inlined that the file can benefit from. Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Six fixes for bugs that were found via fuzzing, and a trivial hw-enablement patch for AMD Family-17h CPU PMUs" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/uncore: Allow only a single PMU/box within an events group perf/x86/intel: Cure bogus unwind from PEBS entries perf/x86: Restore TASK_SIZE check on frame pointer perf/core: Fix address filter parser perf/x86: Add perf support for AMD family-17h processors perf/x86/uncore: Fix crash by removing bogus event_list[] handling for SNB client uncore IMC perf/core: Do not set cpuctx->cgrp for unscheduled cgroups
2016-11-23sched/fair: Clean up the tunable parameter definitionsIngo Molnar
No change in functionality: - align the default values vertically to make them easier to scan - standardize the 'default:' lines - fix minor whitespace typos Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23sched/dl: Fix comment in pick_next_task_dl()T.Zhou
Fix cut & paste oversight: s/pull_rt_task/pull_dl_task Signed-off-by: T.Zhou <t1zhou@163.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: juri.lelli@gmail.com Link: http://lkml.kernel.org/r/20161123004832.GA2983@geo Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Documentation updates, yet again just simple changes. - Miscellaneous fixes, including a change to call_rcu()'s rcu_head alignment check. - Security-motivated list consistency checks, which are disabled by default behind DEBUG_LIST. - Torture-test updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22tracing: Add error checks to creation of event filesSteven Rostedt (Red Hat)
The creation of the set_event_pid file was assigned to a variable "entry" but that variable was never used. Ideally, it should be used to check if the file was created and warn if it was not. The files header_page, header_event should also be checked and a warning if they fail to be created. The "enable" file was moved up, as it is a more crucial file to have and a hard failure (return -ENOMEM) should be returned if it is not created. Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-22tracing: Add hook to function tracing for other subsystems to useChunyan Zhang
Currently Function traces can be only exported to the ring buffer. This adds a trace_export concept which can process traces and export them to a registered destination as an addition to the current one that outputs to Ftrace - i.e. ring buffer. In this way, if we want function traces to be sent to other destinations rather than only to the ring buffer, we just need to register a new trace_export and implement its own .write() function for writing traces to storage. With this patch, only function tracing (trace type is TRACE_FN) is supported. Link: http://lkml.kernel.org/r/1479715043-6534-2-git-send-email-zhang.chunyan@linaro.org Signed-off-by: Chunyan Zhang <zhang.chunyan@linaro.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-22sched/nohz: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: rt@linuxtronix.de Link: http://lkml.kernel.org/r/20161117183541.8588-14-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-22exec: Ensure mm->user_ns contains the execed filesEric W. Biederman
When the user namespace support was merged the need to prevent ptrace from revealing the contents of an unreadable executable was overlooked. Correct this oversight by ensuring that the executed file or files are in mm->user_ns, by adjusting mm->user_ns. Use the new function privileged_wrt_inode_uidgid to see if the executable is a member of the user namespace, and as such if having CAP_SYS_PTRACE in the user namespace should allow tracing the executable. If not update mm->user_ns to the parent user namespace until an appropriate parent is found. Cc: stable@vger.kernel.org Reported-by: Jann Horn <jann@thejh.net> Fixes: 9e4a36ece652 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22ptrace: Don't allow accessing an undumpable mmEric W. Biederman
It is the reasonable expectation that if an executable file is not readable there will be no way for a user without special privileges to read the file. This is enforced in ptrace_attach but if ptrace is already attached before exec there is no enforcement for read-only executables. As the only way to read such an mm is through access_process_vm spin a variant called ptrace_access_vm that will fail if the target process is not being ptraced by the current process, or the current process did not have sufficient privileges when ptracing began to read the target processes mm. In the ptrace implementations replace access_process_vm by ptrace_access_vm. There remain several ptrace sites that still use access_process_vm as they are reading the target executables instructions (for kernel consumption) or register stacks. As such it does not appear necessary to add a permission check to those calls. This bug has always existed in Linux. Fixes: v1.0 Cc: stable@vger.kernel.org Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
All conflicts were simple overlapping changes except perhaps for the Thunder driver. That driver has a change_mtu method explicitly for sending a message to the hardware. If that fails it returns an error. Normally a driver doesn't need an ndo_change_mtu method becuase those are usually just range changes, which are now handled generically. But since this extra operation is needed in the Thunder driver, it has to stay. However, if the message send fails we have to restore the original MTU before the change because the entire call chain expects that if an error is thrown by ndo_change_mtu then the MTU did not change. Therefore code is added to nicvf_change_mtu to remember the original MTU, and to restore it upon nicvf_update_hw_max_frs() failue. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22ptrace: Capture the ptracer's creds not PT_PTRACE_CAPEric W. Biederman
When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was overlooked. This can result in incorrect behavior when an application like strace traces an exec of a setuid executable. Further PT_PTRACE_CAP does not have enough information for making good security decisions as it does not report which user namespace the capability is in. This has already allowed one mistake through insufficient granulariy. I found this issue when I was testing another corner case of exec and discovered that I could not get strace to set PT_PTRACE_CAP even when running strace as root with a full set of caps. This change fixes the above issue with strace allowing stracing as root a setuid executable without disabling setuid. More fundamentaly this change allows what is allowable at all times, by using the correct information in it's decision. Cc: stable@vger.kernel.org Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22mm: Add a user_ns owner to mm_struct and fix ptrace permission checksEric W. Biederman
During exec dumpable is cleared if the file that is being executed is not readable by the user executing the file. A bug in ptrace_may_access allows reading the file if the executable happens to enter into a subordinate user namespace (aka clone(CLONE_NEWUSER), unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER). This problem is fixed with only necessary userspace breakage by adding a user namespace owner to mm_struct, captured at the time of exec, so it is clear in which user namespace CAP_SYS_PTRACE must be present in to be able to safely give read permission to the executable. The function ptrace_may_access is modified to verify that the ptracer has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns. This ensures that if the task changes it's cred into a subordinate user namespace it does not become ptraceable. The function ptrace_attach is modified to only set PT_PTRACE_CAP when CAP_SYS_PTRACE is held over task->mm->user_ns. The intent of PT_PTRACE_CAP is to be a flag to note that whatever permission changes the task might go through the tracer has sufficient permissions for it not to be an issue. task->cred->user_ns is always the same as or descendent of mm->user_ns. Which guarantees that having CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks credentials. To prevent regressions mm->dumpable and mm->user_ns are not considered when a task has no mm. As simply failing ptrace_may_attach causes regressions in privileged applications attempting to read things such as /proc/<pid>/stat Cc: stable@vger.kernel.org Acked-by: Kees Cook <keescook@chromium.org> Tested-by: Cyrill Gorcunov <gorcunov@openvz.org> Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-11-22locking/mutex: Break out of expensive busy-loop on ↵Pan Xinhui
{mutex,rwsem}_spin_on_owner() when owner vCPU is preempted An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Break out of the loop if the vCPU is preempted: if vcpu_is_preempted(cpu) is true. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc0000000000768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Tested-by: Juergen Gross <jgross@suse.com> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: David.Laight@ACULAB.COM Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: benh@kernel.crashing.org Cc: boqun.feng@gmail.com Cc: bsingharora@gmail.com Cc: dave@stgolabs.net Cc: kernellwp@gmail.com Cc: konrad.wilk@oracle.com Cc: linuxppc-dev@lists.ozlabs.org Cc: mpe@ellerman.id.au Cc: paulmck@linux.vnet.ibm.com Cc: paulus@samba.org Cc: rkrcmar@redhat.com Cc: virtualization@lists.linux-foundation.org Cc: will.deacon@arm.com Cc: xen-devel-request@lists.xenproject.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-4-git-send-email-xinhui.pan@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU ↵Pan Xinhui
in osq_lock() An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because if vCPU-A holds the osq lock and yields out, vCPU-B ends up waiting for per_cpu node->locked to be set. IOW, vCPU-B waits for vCPU-A to run and unlock the osq lock. Use the new vcpu_is_preempted(cpu) interface to detect if a vCPU is currently running or not, and break out of the spin-loop if so. test case: $ perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun Feng <boqun.feng@gmail.com> Tested-by: Juergen Gross <jgross@suse.com> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: David.Laight@ACULAB.COM Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: benh@kernel.crashing.org Cc: bsingharora@gmail.com Cc: dave@stgolabs.net Cc: kernellwp@gmail.com Cc: konrad.wilk@oracle.com Cc: linuxppc-dev@lists.ozlabs.org Cc: mpe@ellerman.id.au Cc: paulmck@linux.vnet.ibm.com Cc: paulus@samba.org Cc: rkrcmar@redhat.com Cc: virtualization@lists.linux-foundation.org Cc: will.deacon@arm.com Cc: xen-devel-request@lists.xenproject.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1478077718-37424-3-git-send-email-xinhui.pan@linux.vnet.ibm.com [ Translated to English. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22sched/autogroup: Do not use autogroup->tg in zombie threadsOleg Nesterov
Exactly because for_each_thread() in autogroup_move_group() can't see it and update its ->sched_task_group before _put() and possibly free(). So the exiting task needs another sched_move_task() before exit_notify() and we need to re-introduce the PF_EXITING (or similar) check removed by the previous change for another reason. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hartsjc@redhat.com Cc: vbendel@redhat.com Cc: vlovejoy@redhat.com Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task()Oleg Nesterov
The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove it, but see the next patch. However the comment is correct in that autogroup_move_group() must always change task_group() for every thread so the sysctl_ check is very wrong; we can race with cgroups and even sys_setsid() is not safe because a task running with task_group() == ag->tg must participate in refcounting: int main(void) { int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY); assert(sctl > 0); if (fork()) { wait(NULL); // destroy the child's ag/tg pause(); } assert(pwrite(sctl, "1\n", 2, 0) == 2); assert(setsid() > 0); if (fork()) pause(); kill(getppid(), SIGKILL); sleep(1); // The child has gone, the grandchild runs with kref == 1 assert(pwrite(sctl, "0\n", 2, 0) == 2); assert(setsid() > 0); // runs with the freed ag/tg for (;;) sleep(1); return 0; } crashes the kernel. It doesn't really need sleep(1), it doesn't matter if autogroup_move_group() actually frees the task_group or this happens later. Reported-by: Vern Lovejoy <vlovejoy@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hartsjc@redhat.com Cc: vbendel@redhat.com Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22genirq/msi: Drop artificial PCI dependencyMarc Zyngier
The generic MSI layer doesn't have any PCI ties anymore, and the build hack should have been removed some time ago. Fixes: d9109698be6e ("genirq: Introduce msi_domain_alloc/free_irqs()") Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Link: http://lkml.kernel.org/r/1479806476-20801-1-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds
Pull sparc fixes from David Miller: 1) With modern networking cards we can run out of 32-bit DMA space, so support 64-bit DMA addressing when possible on sparc64. From Dave Tushar. 2) Some signal frame validation checks are inverted on sparc32, fix from Andreas Larsson. 3) Lockdep tables can get too large in some circumstances on sparc64, add a way to adjust the size a bit. From Babu Moger. 4) Fix NUMA node probing on some sun4v systems, from Thomas Tai. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: drop duplicate header scatterlist.h lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc sunbmac: Fix compiler warning sunqe: Fix compiler warnings sparc64: Enable 64-bit DMA sparc64: Enable sun4v dma ops to use IOMMU v2 APIs sparc64: Bind PCIe devices to use IOMMU v2 service sparc64: Initialize iommu_map_table and iommu_pool sparc64: Add ATU (new IOMMU) support sparc64: Add FORCE_MAX_ZONEORDER and default to 13 sparc64: fix compile warning section mismatch in find_node() sparc32: Fix inverted invalid_frame_pointer checks on sigreturns sparc64: Fix find_node warning if numa node cannot be found
2016-11-21PM / sleep / ACPI: Use the ACPI_FADT_LOW_POWER_S0 flagRafael J. Wysocki
Modify the ACPI system sleep support setup code to select suspend-to-idle as the default system sleep state if the ACPI_FADT_LOW_POWER_S0 flag is set in the FADT and the default sleep state was not selected from the kernel command line. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Mario Limonciello <mario.limonciello@dell.com>
2016-11-21PM / sleep: System sleep state selection interface reworkRafael J. Wysocki
There are systems in which the platform doesn't support any special sleep states, so suspend-to-idle (PM_SUSPEND_FREEZE) is the only available system sleep state. However, some user space frameworks only use the "mem" and (sometimes) "standby" sleep state labels, so the users of those systems need to modify user space in order to be able to use system suspend at all and that may be a pain in practice. Commit 0399d4db3edf (PM / sleep: Introduce command line argument for sleep state enumeration) attempted to address this problem by adding a command line argument to change the meaning of the "mem" string in /sys/power/state to make it trigger suspend-to-idle (instead of suspend-to-RAM). However, there also are systems in which the platform does support special sleep states, but suspend-to-idle is the preferred one anyway (it even may save more energy than the platform-provided sleep states in some cases) and the above commit doesn't help in those cases. For this reason, rework the system sleep state selection interface again (but preserve backwards compatibiliby). Namely, add a new sysfs file, /sys/power/mem_sleep, that will control the system suspend mode triggered by writing "mem" to /sys/power/state (in analogy with what /sys/power/disk does for hibernation). Make it select suspend-to-RAM ("deep" sleep) by default (if supported) and fall back to suspend-to-idle ("s2idle") otherwise and add a new command line argument, mem_sleep_default, allowing that default to be overridden if need be. At the same time, drop the relative_sleep_states command line argument that doesn't make sense any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Mario Limonciello <mario.limonciello@dell.com>
2016-11-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Clear congestion control state when changing algorithms on an existing socket, from Florian Westphal. 2) Fix register bit values in altr_tse_pcs portion of stmmac driver, from Jia Jie Ho. 3) Fix PTP handling in stammc driver for GMAC4, from Giuseppe CAVALLARO. 4) Fix udplite multicast delivery handling, it ignores the udp_table parameter passed into the lookups, from Pablo Neira Ayuso. 5) Synchronize the space estimated by rtnl_vfinfo_size and the space actually used by rtnl_fill_vfinfo. From Sabrina Dubroca. 6) Fix memory leak in fib_info when splitting nodes, from Alexander Duyck. 7) If a driver does a napi_hash_del() explicitily and not via netif_napi_del(), it must perform RCU synchronization as needed. Fix this in virtio-net and bnxt drivers, from Eric Dumazet. 8) Likewise, it is not necessary to invoke napi_hash_del() is we are also doing neif_napi_del() in the same code path. Remove such calls from be2net and cxgb4 drivers, also from Eric Dumazet. 9) Don't allocate an ID in peernet2id_alloc() if the netns is dead, from WANG Cong. 10) Fix OF node and device struct leaks in of_mdio, from Johan Hovold. 11) We cannot cache routes in ip6_tunnel when using inherited traffic classes, from Paolo Abeni. 12) Fix several crashes and leaks in cpsw driver, from Johan Hovold. 13) Splice operations cannot use freezable blocking calls in AF_UNIX, from WANG Cong. 14) Link dump filtering by master device and kind support added an error in loop index updates during the dump if we actually do filter, fix from Zhang Shengju. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits) tcp: zero ca_priv area when switching cc algorithms net: l2tp: Treat NET_XMIT_CN as success in l2tp_eth_dev_xmit ethernet: stmmac: make DWMAC_STM32 depend on it's associated SoC tipc: eliminate obsolete socket locking policy description rtnl: fix the loop index update error in rtnl_dump_ifinfo() l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind() net: macb: add check for dma mapping error in start_xmit() rtnetlink: fix FDB size computation netns: fix get_net_ns_by_fd(int pid) typo af_unix: conditionally use freezable blocking calls in read net: ethernet: ti: cpsw: fix fixed-link phy probe deferral net: ethernet: ti: cpsw: add missing sanity check net: ethernet: ti: cpsw: fix secondary-emac probe error path net: ethernet: ti: cpsw: fix of_node and phydev leaks net: ethernet: ti: cpsw: fix deferred probe net: ethernet: ti: cpsw: fix mdio device reference leak net: ethernet: ti: cpsw: fix bad register access in probe error path net: sky2: Fix shutdown crash cfg80211: limit scan results cache size net sched filters: pass netlink message flags in event notification ...
2016-11-21bpf, mlx5: fix mlx5e_create_rq taking reference on progDaniel Borkmann
In mlx5e_create_rq(), when creating a new queue, we call bpf_prog_add() but without checking the return value. bpf_prog_add() can fail since 92117d8443bc ("bpf: fix refcnt overflow"), so we really must check it. Take the reference right when we assign it to the rq from priv->xdp_prog, and just drop the reference on error path. Destruction in mlx5e_destroy_rq() looks good, though. Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-21perf/core: Fix address filter parserAlexander Shishkin
The token table passed into match_token() must be null-terminated, which it currently is not in the perf's address filter string parser, as caught by Vince's perf_fuzzer and KASAN. It doesn't blow up otherwise because of the alignment padding of the table to the next element in the .rodata, which is luck. Fixing by adding a null-terminator to the token table. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: stable@vger.kernel.org # v4.7+ Fixes: 375637bc524 ("perf/core: Introduce address range filtering") Link: http://lkml.kernel.org/r/877f81f264.fsf@ashishki-desk.ger.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-21sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_QWaiman Long
Currently the wake_q data structure is defined by the WAKE_Q() macro. This macro, however, looks like a function doing something as "wake" is a verb. Even checkpatch.pl was confused as it reported warnings like WARNING: Missing a blank line after declarations #548: FILE: kernel/futex.c:3665: + int ret; + WAKE_Q(wake_q); This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies what the macro is doing and eliminates the checkpatch.pl warnings. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com [ Resolved conflict and added missing rename. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-20audit: fix formatting of AUDIT_CONFIG_CHANGE eventsSteve Grubb
The AUDIT_CONFIG_CHANGE events sometimes use a op= field. The current code logs the value of the field with quotes. This field is documented to not be encoded, so it should not have quotes. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> [PM: reformatted commit description to make checkpatch.pl happy] Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-11-20audit: skip sessionid sentinel value when auto-incrementingRichard Guy Briggs
The value (unsigned int)-1 is used as a sentinel to indicate the sessionID is unset. Skip this value when the session_id value wraps. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-11-18lockdep: Limit static allocations if PROVE_LOCKING_SMALL is definedBabu Moger
Reduce the size of data structure for lockdep entries by half if PROVE_LOCKING_SMALL if defined. This is used only for sparc. Signed-off-by: Babu Moger <babu.moger@oracle.com> Acked-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: David S. Miller <davem@davemloft.net>