summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-05-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
The MSCC bug fix in 'net' had to be slightly adjusted because the register accesses are done slightly differently in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24Merge tag 'sched-urgent-2020-05-24' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "A set of fixes for the scheduler: - Fix handling of throttled parents in enqueue_task_fair() completely. The recent fix overlooked a corner case where the first iteration terminates due to an entity already being on the runqueue which makes the list management incomplete and later triggers the assertion which checks for completeness. - Fix a similar problem in unthrottle_cfs_rq(). - Show the correct uclamp values in procfs which prints the effective value twice instead of requested and effective" * tag 'sched-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list sched/debug: Fix requested task uclamp values shown in procfs sched/fair: Fix enqueue_task_fair() warning some more
2020-05-22printk: handle blank console arguments passed in.Shreyas Joshi
If uboot passes a blank string to console_setup then it results in a trashed memory. Ultimately, the kernel crashes during freeing up the memory. This fix checks if there is a blank parameter being passed to console_setup from uboot. In case it detects that the console parameter is blank then it doesn't setup the serial device and it gracefully exits. Link: https://lore.kernel.org/r/20200522065306.83-1-shreyas.joshi@biamp.com Signed-off-by: Shreyas Joshi <shreyas.joshi@biamp.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> [pmladek@suse.com: Better format the commit message and code, remove unnecessary brackets.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-05-21bpf: Verifier track null pointer branch_taken with JNE and JEQJohn Fastabend
Currently, when considering the branches that may be taken for a jump instruction if the register being compared is a pointer the verifier assumes both branches may be taken. But, if the jump instruction is comparing if a pointer is NULL we have this information in the verifier encoded in the reg->type so we can do better in these cases. Specifically, these two common cases can be handled. * If the instruction is BPF_JEQ and we are comparing against a zero value. This test is 'if ptr == 0 goto +X' then using the type information in reg->type we can decide if the ptr is not null. This allows us to avoid pushing both branches onto the stack and instead only use the != 0 case. For example PTR_TO_SOCK and PTR_TO_SOCK_OR_NULL encode the null pointer. Note if the type is PTR_TO_SOCK_OR_NULL we can not learn anything. And also if the value is non-zero we learn nothing because it could be any arbitrary value a different pointer for example * If the instruction is BPF_JNE and ware comparing against a zero value then a similar analysis as above can be done. The test in asm looks like 'if ptr != 0 goto +X'. Again using the type information if the non null type is set (from above PTR_TO_SOCK) we know the jump is taken. In this patch we extend is_branch_taken() to consider this extra information and to return only the branch that will be taken. This resolves a verifier issue reported with C code like the following. See progs/test_sk_lookup_kern.c in selftests. sk = bpf_sk_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0); bpf_printk("sk=%d\n", sk ? 1 : 0); if (sk) bpf_sk_release(sk); return sk ? TC_ACT_OK : TC_ACT_UNSPEC; In the above the bpf_printk() will resolve the pointer from PTR_TO_SOCK_OR_NULL to PTR_TO_SOCK. Then the second test guarding the release will cause the verifier to walk both paths resulting in the an unreleased sock reference. See verifier/ref_tracking.c in selftests for an assembly version of the above. After the above additional logic is added the C code above passes as expected. Reported-by: Andrey Ignatov <rdna@fb.com> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/159009164651.6313.380418298578070501.stgit@john-Precision-5820-Tower
2020-05-21xsk: Move xskmap.c to net/xdp/Björn Töpel
The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related code. Also, move AF_XDP struct definitions, and function declarations only used by AF_XDP internals into net/xdp/xsk.h. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-3-bjorn.topel@gmail.com
2020-05-21Merge branch 'nfsd-5.8' of git://linux-nfs.org/~cel/cel-2.6 into ↵J. Bruce Fields
for-5.8-incoming Highlights of this series: * Remove serialization of sending RPC/RDMA Replies * Convert the TCP socket send path to use xdr_buf::bvecs (pre-requisite for RPC-on-TLS) * Fix svcrdma backchannel sendto return code * Convert a number of dprintk call sites to use tracepoints * Fix the "suggest braces around empty body in an 'else' statement" warning
2020-05-21kernel/printk: add kmsg SEEK_CUR handlingBruno Meneguele
Userspace libraries, e.g. glibc's dprintf(), perform a SEEK_CUR operation over any file descriptor requested to make sure the current position isn't pointing to junk due to previous manipulation of that same fd. And whenever that fd doesn't have support for such operation, the userspace code expects -ESPIPE to be returned. However, when the fd in question references the /dev/kmsg interface, the current kernel code state returns -EINVAL instead, causing an unexpected behavior in userspace: in the case of glibc, when -ESPIPE is returned it gets ignored and the call completes successfully, while returning -EINVAL forces dprintf to fail without performing any action over that fd: if (_IO_SEEKOFF (fp, (off64_t)0, _IO_seek_cur, _IOS_INPUT|_IOS_OUTPUT) == _IO_pos_BAD && errno != ESPIPE) return NULL; With this patch we make sure to return the correct value when SEEK_CUR is requested over kmsg and also add some kernel doc information to formalize this behavior. Link: https://lore.kernel.org/r/20200317103344.574277-1-bmeneg@redhat.com Cc: linux-kernel@vger.kernel.org Cc: rostedt@goodmis.org, Cc: David.Laight@ACULAB.COM Signed-off-by: Bruno Meneguele <bmeneg@redhat.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-05-21printk: Fix a typo in comment "interator"->"iterator"Ethon Paul
There is a typo in comment, fix it. Signed-off-by: Ethon Paul <ethp@qq.com> Cc: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-05-21irqdomain: Allow software nodes for IRQ domain creationAndy Shevchenko
In some cases we need to have an IRQ domain created out of software node. One of such cases is DesignWare GPIO driver when it's instantiated from half-baked ACPI table (alas, we can't fix it for devices which are few years on market) and thus using software nodes to quirk this. But the driver is using IRQ domains based on per GPIO port firmware nodes, which are in the above case software ones. This brings a warning message to be printed [ 73.957183] irq: Invalid fwnode type for irqdomain and creates an anonymous IRQ domain without a debugfs entry. Allowing software nodes to be valid for IRQ domains rids us of the warning and debugs gets correctly populated. % ls -1 /sys/kernel/debug/irq/domains/ ... intel-quark-dw-apb-gpio:portA Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> [maz: refactored commit message] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200520164927.39090-3-andriy.shevchenko@linux.intel.com
2020-05-21irqdomain: Get rid of special treatment for ACPI in __irq_domain_add()Andy Shevchenko
Now that __irq_domain_add() is able to better deals with generic fwnodes, there is no need to special-case ACPI anymore. Get rid of the special treatment for ACPI. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200520164927.39090-2-andriy.shevchenko@linux.intel.com
2020-05-21irqdomain: Make __irq_domain_add() less OF-dependentAndy Shevchenko
__irq_domain_add() relies in some places on the fact that the fwnode can be only of type OF. This prevents refactoring of the code to support other types of fwnode. Make it less OF-dependent by switching it to use the fwnode directly where it makes sense. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20200520164927.39090-1-andriy.shevchenko@linux.intel.com
2020-05-20bpf: Prevent mmap()'ing read-only maps as writableAndrii Nakryiko
As discussed in [0], it's dangerous to allow mapping BPF map, that's meant to be frozen and is read-only on BPF program side, because that allows user-space to actually store a writable view to the page even after it is frozen. This is exacerbated by BPF verifier making a strong assumption that contents of such frozen map will remain unchanged. To prevent this, disallow mapping BPF_F_RDONLY_PROG mmap()'able BPF maps as writable, ever. [0] https://lore.kernel.org/bpf/CAEf4BzYGWYhXdp6BJ7_=9OQPJxQpgug080MMjdSB72i9R+5c6g@mail.gmail.com/ Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/bpf/20200519053824.1089415-1-andriin@fb.com
2020-05-20audit: add subj creds to NETFILTER_CFG record toRichard Guy Briggs
Some table unregister actions seem to be initiated by the kernel to garbage collect unused tables that are not initiated by any userspace actions. It was found to be necessary to add the subject credentials to cover this case to reveal the source of these actions. A sample record: The uid, auid, tty, ses and exe fields have not been included since they are in the SYSCALL record and contain nothing useful in the non-user context. Here are two sample orphaned records: type=NETFILTER_CFG msg=audit(2020-05-20 12:14:36.505:5) : table=filter family=ipv4 entries=0 op=register pid=1 subj=kernel comm=swapper/0 type=NETFILTER_CFG msg=audit(2020-05-20 12:15:27.701:301) : table=nat family=bridge entries=0 op=unregister pid=30 subj=system_u:system_r:kernel_t:s0 comm=kworker/u4:1 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-05-20exec: Teach prepare_exec_creds how exec treats uids & gidsEric W. Biederman
It is almost possible to use the result of prepare_exec_creds with no modifications during exec. Update prepare_exec_creds to initialize the suid and the fsuid to the euid, and the sgid and the fsgid to the egid. This is all that is needed to handle the common case of exec when nothing special like a setuid exec is happening. That this preserves the existing behavior of exec can be verified by examing bprm_fill_uid and cap_bprm_set_creds. This change makes it clear that the later parts of exec that update bprm->cred are just need to handle special cases such as setuid exec and change of domains. Link: https://lkml.kernel.org/r/871rng22dm.fsf_-_@x220.int.ebiederm.org Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-20fs: rename pipe_buf ->steal to ->try_stealChristoph Hellwig
And replace the arcane return value convention with a simple bool where true means success and false means failure. [AV: braino fix folded in] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20fs: make the pipe_buf_operations ->confirm operation optionalChristoph Hellwig
Just return 0 for success if it is not present. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20fs: make the pipe_buf_operations ->steal operation optionalChristoph Hellwig
Just return 1 for failure if it is not present. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20trace: remove tracing_pipe_buf_opsChristoph Hellwig
tracing_pipe_buf_ops has identical ops to default_pipe_buf_ops, so use that instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20Merge branch 'topic/uaccess-ppc' into nextMichael Ellerman
Merge our uaccess-ppc topic branch. It is based on the uaccess topic branch that we're sharing with Viro. This includes the addition of user_[read|write]_access_begin(), as well as some powerpc specific changes to our uaccess routines that would conflict badly if merged separately.
2020-05-20Merge tag 'noinstr-x86-kvm-2020-05-16' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD
2020-05-19tracing/probe: reverse arguments to list_addJulia Lawall
Elsewhere in the file, the function trace_kprobe_has_same_kprobe uses a trace_probe_event.probes object as the second argument of list_for_each_entry, ie as a list head, while the list_for_each_entry iterates over the list fields of the trace_probe structures, making them the list elements. So, exchange the arguments on the list_add call to put the list head in the second argument. Since both list_head structures were just initialized, this problem did not cause any loss of information. Link: https://lkml.kernel.org/r/1588879808-24488-1-git-send-email-Julia.Lawall@inria.fr Fixes: 60d53e2c3b75 ("tracing/probe: Split trace_event related data from trace_probe") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-19ftrace: show debugging information when panic_on_warn setCheng Jian
When an anomaly is detected in the function call modification code, ftrace_bug() is called to disable function tracing as well as give some warn and information that may help debug the problem. But currently, we call FTRACE_WARN_ON_ONCE() first in ftrace_bug(), so when panic_on_warn is set, we can't see the debugging information here. Call FTRACE_WARN_ON_ONCE() at the end of ftrace_bug() to ensure that the debugging information is displayed first. after this patch, the dmesg looks like: ------------[ ftrace bug ]------------ ftrace failed to modify [<ffff800010081004>] bcm2835_handle_irq+0x4/0x58 actual: 1f:20:03:d5 Setting ftrace call site to call ftrace function ftrace record flags: 80000001 (1) expected tramp: ffff80001009d6f0 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1635 at kernel/trace/ftrace.c:2078 ftrace_bug+0x204/0x238 Kernel panic - not syncing: panic_on_warn set ... CPU: 2 PID: 1635 Comm: sh Not tainted 5.7.0-rc5-00033-gb922183867f5 #14 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x1b0 show_stack+0x20/0x30 dump_stack+0xc0/0x10c panic+0x16c/0x368 __warn+0x120/0x160 report_bug+0xc8/0x160 bug_handler+0x28/0x98 brk_handler+0x70/0xd0 do_debug_exception+0xcc/0x1ac el1_sync_handler+0xe4/0x120 el1_sync+0x7c/0x100 ftrace_bug+0x204/0x238 Link: https://lkml.kernel.org/r/20200515100828.7091-1-cj.chengjian@huawei.com Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-19locking/lockdep: Replace zero-length array with flexible-arrayGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200507185804.GA15036@embeddedor
2020-05-19perf/core: Replace zero-length array with flexible-arrayGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200511201227.GA14041@embeddedor
2020-05-19sched: Defend cfs and rt bandwidth quota against overflowHuaixin Chang
When users write some huge number into cpu.cfs_quota_us or cpu.rt_runtime_us, overflow might happen during to_ratio() shifts of schedulable checks. to_ratio() could be altered to avoid unnecessary internal overflow, but min_cfs_quota_period is less than 1 << BW_SHIFT, so a cutoff would still be needed. Set a cap MAX_BW for cfs_quota_us and rt_runtime_us to prevent overflow. Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lkml.kernel.org/r/20200425105248.60093-1-changhuaixin@linux.alibaba.com
2020-05-19sched/cpuacct: Fix charge cpuacct.usage_sysMuchun Song
The user_mode(task_pt_regs(tsk)) always return true for user thread, and false for kernel thread. So it means that the cpuacct.usage_sys is the time that kernel thread uses not the time that thread uses in the kernel mode. We can try get_irq_regs() first, if it is NULL, then we can fall back to task_pt_regs(). Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200420070453.76815-1-songmuchun@bytedance.com
2020-05-19sched/fair: Replace zero-length array with flexible-arrayGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200507192141.GA16183@embeddedor
2020-05-19sched/pelt: Sync util/runnable_sum with PELT window when propagatingVincent Guittot
update_tg_cfs_*() propagate the impact of the attach/detach of an entity down into the cfs_rq hierarchy and must keep the sync with the current pelt window. Even if we can't sync child cfs_rq and its group se, we can sync the group se and its parent cfs_rq with current position in the PELT window. In fact, we must keep them sync in order to stay also synced with others entities and group entities that are already attached to the cfs_rq. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200506155301.14288-1-vincent.guittot@linaro.org
2020-05-19sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr()Muchun Song
The cpuacct_charge() and cpuacct_account_field() are called with rq->lock held, and this means preemption(and IRQs) are indeed disabled, so it is safe to use __this_cpu_*() to allow for better code-generation. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200507031039.32615-1-songmuchun@bytedance.com
2020-05-19sched/fair: Optimize enqueue_task_fair()Vincent Guittot
enqueue_task_fair jumps to enqueue_throttle label when cfs_rq_of(se) is throttled which means that se can't be NULL in such case and we can move the label after the if (!se) statement. Futhermore, the latter can be removed because se is always NULL when reaching this point. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200513135502.4672-1-vincent.guittot@linaro.org
2020-05-19Merge branch 'sched/urgent'Peter Zijlstra
2020-05-19sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq listVincent Guittot
Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair() are quite close and follow the same sequence for enqueuing an entity in the cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as enqueue_task_fair(). This fixes a problem already faced with the latter and add an optimization in the last for_each_sched_entity loop. Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning) Reported-by Tao Zhou <zohooouoto@zoho.com.cn> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lkml.kernel.org/r/20200513135528.4742-1-vincent.guittot@linaro.org
2020-05-19sched/debug: Fix requested task uclamp values shown in procfsPavankumar Kondeti
The intention of commit 96e74ebf8d59 ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs") was to print requested and effective task uclamp values. The requested values printed are read from p->uclamp, which holds the last effective values. Fix this by printing the values from p->uclamp_req. Fixes: 96e74ebf8d59 ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs") Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/1589115401-26391-1-git-send-email-pkondeti@codeaurora.org
2020-05-19sched/fair: Fix enqueue_task_fair() warning some morePhil Auld
sched/fair: Fix enqueue_task_fair warning some more The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning) did not fully resolve the issues with the rq->tmp_alone_branch != &rq->leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where the first for_each_sched_entity loop exits due to on_rq, having incompletely updated the list. In this case the second for_each_sched_entity loop can further modify se. The later code to fix up the list management fails to do what is needed because se does not point to the sched_entity which broke out of the first loop. The list is not fixed up because the throttled parent was already added back to the list by a task enqueue in a parallel child hierarchy. Address this by calling list_add_leaf_cfs_rq if there are throttled parents while doing the second for_each_sched_entity loop. Fixes: fe61468b2cb ("sched/fair: Fix enqueue_task_fair warning") Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20200512135222.GC2201@lorien.usersys.redhat.com
2020-05-19bpf: Add get{peer, sock}name attach types for sock_addrDaniel Borkmann
As stated in 983695fa6765 ("bpf: fix unconnected udp hooks"), the objective for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be transparent to applications. In Cilium we make use of these hooks [0] in order to enable E-W load balancing for existing Kubernetes service types for all Cilium managed nodes in the cluster. Those backends can be local or remote. The main advantage of this approach is that it operates as close as possible to the socket, and therefore allows to avoid packet-based NAT given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses. This also allows to expose NodePort services on loopback addresses in the host namespace, for example. As another advantage, this also efficiently blocks bind requests for applications in the host namespace for exposed ports. However, one missing item is that we also need to perform reverse xlation for inet{,6}_getname() hooks such that we can return the service IP/port tuple back to the application instead of the remote peer address. The vast majority of applications does not bother about getpeername(), but in a few occasions we've seen breakage when validating the peer's address since it returns unexpectedly the backend tuple instead of the service one. Therefore, this trivial patch allows to customise and adds a getpeername() as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order to address this situation. Simple example: # ./cilium/cilium service list ID Frontend Service Type Backend 1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80 Before; curl's verbose output example, no getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] After; with getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed peer to the context similar as in inet{,6}_getname() fashion, but API-wise this is suboptimal as it always enforces programs having to test for ctx->peer which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split. Similarly, the checked return code is on tnum_range(1, 1), but if a use case comes up in future, it can easily be changed to return an error code instead. Helper and ctx member access is the same as with connect/sendmsg/etc hooks. [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-19PM: hibernate: Split off snapshot dev optionDomenico Andreoli
Make it possible to reduce the attack surface in case the snapshot device is not to be used from userspace. Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-19PM: hibernate: Incorporate concurrency handlingDomenico Andreoli
Hibernation concurrency handling is currently delegated to user.c, where it's also used for regulating the access to the snapshot device. In the prospective of making user.c a separate configuration option, such mutual exclusion is brought into hibernate.c and made available through accessor helpers hereby introduced. Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-19pipe: Add notification lossage handlingDavid Howells
Add handling for loss of notifications by having read() insert a loss-notification message after it has read the pipe buffer that was last in the ring when the loss occurred. Lossage can come about either by running out of notification descriptors or by running out of space in the pipe ring. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19pipe: Allow buffers to be marked read-whole-or-error for notificationsDavid Howells
Allow a buffer to be marked such that read() must return the entire buffer in one go or return ENOBUFS. Multiple buffers can be amalgamated into a single read, but a short read will occur if the next "whole" buffer won't fit. This is useful for watch queue notifications to make sure we don't split a notification across multiple reads, especially given that we need to fabricate an overrun record under some circumstances - and that isn't in the buffers. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19pipe: Add general notification queue supportDavid Howells
Make it possible to have a general notification queue built on top of a standard pipe. Notifications are 'spliced' into the pipe and then read out. splice(), vmsplice() and sendfile() are forbidden on pipes used for notifications as post_one_notification() cannot take pipe->mutex. This means that notifications could be posted in between individual pipe buffers, making iov_iter_revert() difficult to effect. The way the notification queue is used is: (1) An application opens a pipe with a special flag and indicates the number of messages it wishes to be able to queue at once (this can only be set once): pipe2(fds, O_NOTIFICATION_PIPE); ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); (2) The application then uses poll() and read() as normal to extract data from the pipe. read() will return multiple notifications if the buffer is big enough, but it will not split a notification across buffers - rather it will return a short read or EMSGSIZE. Notification messages include a length in the header so that the caller can split them up. Each message has a header that describes it: struct watch_notification { __u32 type:24; __u32 subtype:8; __u32 info; }; The type indicates the source (eg. mount tree changes, superblock events, keyring changes, block layer events) and the subtype indicates the event type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field indicates a number of things, including the entry length, an ID assigned to a watchpoint contributing to this buffer and type-specific flags. Supplementary data, such as the key ID that generated an event, can be attached in additional slots. The maximum message size is 127 bytes. Messages may not be padded or aligned, so there is no guarantee, for example, that the notification type will be on a 4-byte bounary. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19kprobes: Prevent probes in .noinstr.text sectionThomas Gleixner
Instrumentation is forbidden in the .noinstr.text section. Make kprobes respect this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lkml.kernel.org/r/20200505134100.179862032@linutronix.de
2020-05-19rcu: Provide __rcu_is_watching()Thomas Gleixner
Same as rcu_is_watching() but without the preempt_disable/enable() pair inside the function. It is merked noinstr so it ends up in the non-instrumentable text section. This is useful for non-preemptible code especially in the low level entry section. Using rcu_is_watching() there results in a call to the preempt_schedule_notrace() thunk which triggers noinstr section warnings in objtool. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200512213810.518709291@linutronix.de
2020-05-19rcu: Provide rcu_irq_exit_preempt()Thomas Gleixner
Interrupts and exceptions invoke rcu_irq_enter() on entry and need to invoke rcu_irq_exit() before they either return to the interrupted code or invoke the scheduler due to preemption. The general assumption is that RCU idle code has to have preemption disabled so that a return from interrupt cannot schedule. So the return from interrupt code invokes rcu_irq_exit() and preempt_schedule_irq(). If there is any imbalance in the rcu_irq/nmi* invocations or RCU idle code had preemption enabled then this goes unnoticed until the CPU goes idle or some other RCU check is executed. Provide rcu_irq_exit_preempt() which can be invoked from the interrupt/exception return code in case that preemption is enabled. It invokes rcu_irq_exit() and contains a few sanity checks in case that CONFIG_PROVE_RCU is enabled to catch such issues directly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134904.364456424@linutronix.de
2020-05-19rcu: Make RCU IRQ enter/exit functions rely on in_nmi()Paul E. McKenney
The rcu_nmi_enter_common() and rcu_nmi_exit_common() functions take an "irq" parameter that indicates whether these functions have been invoked from an irq handler (irq==true) or an NMI handler (irq==false). However, recent changes have applied notrace to a few critical functions such that rcu_nmi_enter_common() and rcu_nmi_exit_common() many now rely on in_nmi(). Note that in_nmi() works no differently than before, but rather that tracing is now prohibited in code regions where in_nmi() would incorrectly report NMI state. Therefore remove the "irq" parameter and inline rcu_nmi_enter_common() and rcu_nmi_exit_common() into rcu_nmi_enter() and rcu_nmi_exit(), respectively. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.617130349@linutronix.de
2020-05-19rcu/tree: Mark the idle relevant functions noinstrThomas Gleixner
These functions are invoked from context tracking and other places in the low level entry code. Move them into the .noinstr.text section to exclude them from instrumentation. Mark the places which are safe to invoke traceable functions with instrumentation_begin/end() so objtool won't complain. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lkml.kernel.org/r/20200505134100.575356107@linutronix.de
2020-05-19sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exceptionPeter Zijlstra
SuperH is the last remaining user of arch_ftrace_nmi_{enter,exit}(), remove it from the generic code and into the SuperH code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20200505134101.248881738@linutronix.de
2020-05-19lockdep: Always inline lockdep_{off,on}()Peter Zijlstra
These functions are called {early,late} in nmi_{enter,exit} and should not be traced or probed. They are also puny, so 'inline' them. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.048523500@linutronix.de
2020-05-19printk: Disallow instrumenting print_nmi_enter()Peter Zijlstra
It happens early in nmi_enter(), no tracing, probing or other funnies allowed. Specifically as nmi_enter() will be used in do_debug(), which would cause recursive exceptions when kprobed. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134101.139720912@linutronix.de
2020-05-19printk: Prepare for nested printk_nmi_enter()Petr Mladek
There is plenty of space in the printk_context variable. Reserve one byte there for the NMI context to be on the safe side. It should never overflow. The BUG_ON(in_nmi() == NMI_MASK) in nmi_enter() will trigger much earlier. Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Link: https://lkml.kernel.org/r/20200505134100.681374113@linutronix.de
2020-05-19Merge tag 'noinstr-lds-2020-05-19' into core/rcuThomas Gleixner
Get the noinstr section and annotation markers to base the RCU parts on.