summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-11-18locking/rwsem: Disable preemption for spinning regionYanfei Xu
[ Upstream commit 7cdacc5f52d68a9370f182c844b5b3e6cc975cc1 ] The spinning region rwsem_spin_on_owner() should not be preempted, however the rwsem_down_write_slowpath() invokes it and don't disable preemption. Fix it by adding a pair of preempt_disable/enable(). Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> [peterz: Fix CONFIG_RWSEM_SPIN_ON_OWNER=n build] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20211013134154.1085649-3-yanfei.xu@windriver.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18tracing: Disable "other" permission bits in the tracefs filesSteven Rostedt (VMware)
[ Upstream commit 21ccc9cd72116289469e5519b6159c675a2fa58f ] When building the files in the tracefs file system, do not by default set any permissions for OTH (other). This will make it easier for admins who want to define a group for accessing tracefs and not having to first disable all the permission bits for "other" in the file system. As tracing can leak sensitive information, it should never by default allowing all users access. An admin can still set the permission bits for others to have access, which may be useful for creating a honeypot and seeing who takes advantage of it and roots the machine. Link: https://lkml.kernel.org/r/20210818153038.864149276@goodmis.org Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loopPaul E. McKenney
[ Upstream commit 0db7c32ad3160ae06f497d48a74bd46a2a35e6bf ] Early in debugging, it made some sense to differentiate the first iteration from subsequent iterations, but now this just causes confusion. This commit therefore moves the "set_tasks_gp_state(rtp, RTGS_WAIT_CBS)" statement to the beginning of the "for" loop in rcu_tasks_kthread(). Reported-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18locking/lockdep: Avoid RCU-induced noinstr failPeter Zijlstra
[ Upstream commit ce0b9c805dd66d5e49fd53ec5415ae398f4c56e6 ] vmlinux.o: warning: objtool: look_up_lock_class()+0xc7: call to rcu_read_lock_any_held() leaves .noinstr.text section Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210624095148.311980536@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18rcutorture: Avoid problematic critical section nesting on PREEMPT_RTScott Wood
[ Upstream commit 71921a9606ddbcc1d98c00eca7ae82c373d1fecd ] rcutorture is generating some nesting scenarios that are not compatible on PREEMPT_RT. For example: preempt_disable(); rcu_read_lock_bh(); preempt_enable(); rcu_read_unlock_bh(); The problem here is that on PREEMPT_RT the bottom halves have to be disabled and enabled in preemptible context. Reorder locking: start with BH locking and continue with then with disabling preemption or interrupts. In the unlocking do it reverse by first enabling interrupts and preemption and BH at the very end. Ensure that on PREEMPT_RT BH locking remains unchanged if in non-preemptible context. Link: https://lkml.kernel.org/r/20190911165729.11178-6-swood@redhat.com Link: https://lkml.kernel.org/r/20210819182035.GF4126399@paulmck-ThinkPad-P17-Gen-1 Signed-off-by: Scott Wood <swood@redhat.com> [bigeasy: Drop ATOM_BH, make it only about changing BH in atomic context. Allow enabling RCU in IRQ-off section. Reword commit message.] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-18ring-buffer: Protect ring_buffer_reset() from reentrancySteven Rostedt (VMware)
commit 51d157946666382e779f94c39891e8e9a020da78 upstream. The resetting of the entire ring buffer use to simply go through and reset each individual CPU buffer that had its own protection and synchronization. But this was very slow, due to performing a synchronization for each CPU. The code was reshuffled to do one disabling of all CPU buffers, followed by a single RCU synchronization, and then the resetting of each of the CPU buffers. But unfortunately, the mutex that prevented multiple occurrences of resetting the buffer was not moved to the upper function, and there is nothing to protect from it. Take the ring buffer mutex around the global reset. Cc: stable@vger.kernel.org Fixes: b23d7a5f4a07a ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU") Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18signal: Add SA_IMMUTABLE to ensure forced siganls do not get changedEric W. Biederman
commit 00b06da29cf9dc633cdba87acd3f57f4df3fd5c7 upstream. As Andy pointed out that there are races between force_sig_info_to_task and sigaction[1] when force_sig_info_task. As Kees discovered[2] ptrace is also able to change these signals. In the case of seeccomp killing a process with a signal it is a security violation to allow the signal to be caught or manipulated. Solve this problem by introducing a new flag SA_IMMUTABLE that prevents sigaction and ptrace from modifying these forced signals. This flag is carefully made kernel internal so that no new ABI is introduced. Longer term I think this can be solved by guaranteeing short circuit delivery of signals in this case. Unfortunately reliable and guaranteed short circuit delivery of these signals is still a ways off from being implemented, tested, and merged. So I have implemented a much simpler alternative for now. [1] https://lkml.kernel.org/r/b5d52d25-7bde-4030-a7b1-7c6f8ab90660@www.fastmail.com [2] https://lkml.kernel.org/r/202110281136.5CE65399A7@keescook Cc: stable@vger.kernel.org Fixes: 307d522f5eb8 ("signal/seccomp: Refactor seccomp signal and coredump generation") Tested-by: Andrea Righi <andrea.righi@canonical.com> Tested-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18signal: Remove the bogus sigkill_pending in ptrace_stopEric W. Biederman
commit 7d613f9f72ec8f90ddefcae038fdae5adb8404b3 upstream. The existence of sigkill_pending is a little silly as it is functionally a duplicate of fatal_signal_pending that is used in exactly one place. Checking for pending fatal signals and returning early in ptrace_stop is actively harmful. It casues the ptrace_stop called by ptrace_signal to return early before setting current->exit_code. Later when ptrace_signal reads the signal number from current->exit_code is undefined, making it unpredictable what will happen. Instead rely on the fact that schedule will not sleep if there is a pending signal that can awaken a task. Removing the explict sigkill_pending test fixes fixes ptrace_signal when ptrace_stop does not stop because current->exit_code is always set to to signr. Cc: stable@vger.kernel.org Fixes: 3d749b9e676b ("ptrace: simplify ptrace_stop()->sigkill_pending() path") Fixes: 1a669c2f16d4 ("Add arch_ptrace_stop") Link: https://lkml.kernel.org/r/87pmsyx29t.fsf@disp2133 Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-29Merge tag 'trace-v5.15-rc6-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing comment fixes from Steven Rostedt: - Some bots have informed me that some of the ftrace functions kernel-doc has formatting issues. - Also, fix my snake instinct. * tag 'trace-v5.15-rc6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix misspelling of "missing" ftrace: Fix kernel-doc formatting issues
2021-10-29tracing: Fix misspelling of "missing"Steven Rostedt (VMware)
My snake instinct was on and I wrote "misssing" instead of "missing". Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-29ftrace: Fix kernel-doc formatting issuesSteven Rostedt (VMware)
Some functions had kernel-doc that used a comma instead of a hash to separate the function name from the one line description. Also, the "ftrace_is_dead()" had an incomplete description. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-28Merge tag 'net-5.15-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from WiFi (mac80211), and BPF. Current release - regressions: - skb_expand_head: adjust skb->truesize to fix socket memory accounting - mptcp: fix corrupt receiver key in MPC + data + checksum Previous releases - regressions: - multicast: calculate csum of looped-back and forwarded packets - cgroup: fix memory leak caused by missing cgroup_bpf_offline - cfg80211: fix management registrations locking, prevent list corruption - cfg80211: correct false positive in bridge/4addr mode check - tcp_bpf: fix race in the tcp_bpf_send_verdict resulting in reusing previous verdict Previous releases - always broken: - sctp: enhancements for the verification tag, prevent attackers from killing SCTP sessions - tipc: fix size validations for the MSG_CRYPTO type - mac80211: mesh: fix HE operation element length check, prevent out of bound access - tls: fix sign of socket errors, prevent positive error codes being reported from read()/write() - cfg80211: scan: extend RCU protection in cfg80211_add_nontrans_list() - implement ->sock_is_readable() for UDP and AF_UNIX, fix poll() for sockets in a BPF sockmap - bpf: fix potential race in tail call compatibility check resulting in two operations which would make the map incompatible succeeding - bpf: prevent increasing bpf_jit_limit above max - bpf: fix error usage of map_fd and fdget() in generic batch update - phy: ethtool: lock the phy for consistency of results - prevent infinite while loop in skb_tx_hash() when Tx races with driver reconfiguring the queue <> traffic class mapping - usbnet: fixes for bad HW conjured by syzbot - xen: stop tx queues during live migration, prevent UAF - net-sysfs: initialize uid and gid before calling net_ns_get_ownership - mlxsw: prevent Rx stalls under memory pressure" * tag 'net-5.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits) Revert "net: hns3: fix pause config problem after autoneg disabled" mptcp: fix corrupt receiver key in MPC + data + checksum riscv, bpf: Fix potential NULL dereference octeontx2-af: Fix possible null pointer dereference. octeontx2-af: Display all enabled PF VF rsrc_alloc entries. octeontx2-af: Check whether ipolicers exists net: ethernet: microchip: lan743x: Fix skb allocation failure net/tls: Fix flipped sign in async_wait.err assignment net/tls: Fix flipped sign in tls_err_abort() calls net/smc: Correct spelling mistake to TCPF_SYN_RECV net/smc: Fix smc_link->llc_testlink_time overflow nfp: bpf: relax prog rejection for mtu check through max_pkt_offset vmxnet3: do not stop tx queues after netif_device_detach() r8169: Add device 10ec:8162 to driver r8169 ptp: Document the PTP_CLK_MAGIC ioctl number usbnet: fix error return code in usbnet_probe() net: hns3: adjust string spaces of some parameters of tx bd info in debugfs net: hns3: expand buffer len for some debugfs command net: hns3: add more string spaces for dumping packets number of queue info in debugfs net: hns3: fix data endian problem of some functions of debugfs ...
2021-10-28Merge tag 'trace-v5.15-rc6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Do not WARN when attaching event probe to non-existent event If the user tries to attach an event probe (eprobe) to an event that does not exist, it will trigger a warning. There's an error check that only expects memory issues otherwise it is considered a bug. But changes in the code to move around the locking made it that it can error out if the user attempts to attach to an event that does not exist, returning an -ENODEV. As this path can be caused by user space putting in a bad value, do not trigger a WARN" * tag 'trace-v5.15-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Do not warn when connecting eprobe to non existing event
2021-10-27tracing: Do not warn when connecting eprobe to non existing eventSteven Rostedt (VMware)
When the syscall trace points are not configured in, the kselftests for ftrace will try to attach an event probe (eprobe) to one of the system call trace points. This triggered a WARNING, because the failure only expects to see memory issues. But this is not the only failure. The user may attempt to attach to a non existent event, and the kernel must not warn about it. Link: https://lkml.kernel.org/r/20211027120854.0680aa0f@gandalf.local.home Fixes: 7491e2c442781 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-26Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2021-10-26 We've added 12 non-merge commits during the last 7 day(s) which contain a total of 23 files changed, 118 insertions(+), 98 deletions(-). The main changes are: 1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen. 2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang. 3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai. 4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer. 5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun. 6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian. 7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix potential race in tail call compatibility check bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET selftests/bpf: Use recv_timeout() instead of retries net: Implement ->sock_is_readable() for UDP and AF_UNIX skmsg: Extract and reuse sk_msg_is_readable() net: Rename ->stream_memory_read to ->sock_is_readable tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function cgroup: Fix memory leak caused by missing cgroup_bpf_offline bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch() bpf: Prevent increasing bpf_jit_limit above max bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT bpf: Define bpf_jit_alloc_exec_limit for riscv JIT ==================== Link: https://lore.kernel.org/r/20211026201920.11296-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26bpf: Fix potential race in tail call compatibility checkToke Høiland-Jørgensen
Lorenzo noticed that the code testing for program type compatibility of tail call maps is potentially racy in that two threads could encounter a map with an unset type simultaneously and both return true even though they are inserting incompatible programs. The race window is quite small, but artificially enlarging it by adding a usleep_range() inside the check in bpf_prog_array_compatible() makes it trivial to trigger from userspace with a program that does, essentially: map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0); pid = fork(); if (pid) { key = 0; value = xdp_fd; } else { key = 1; value = tc_fd; } err = bpf_map_update_elem(map_fd, &key, &value, 0); While the race window is small, it has potentially serious ramifications in that triggering it would allow a BPF program to tail call to a program of a different type. So let's get rid of it by protecting the update with a spinlock. The commit in the Fixes tag is the last commit that touches the code in question. v2: - Use a spinlock instead of an atomic variable and cmpxchg() (Alexei) v3: - Put lock and the members it protects into an embedded 'owner' struct (Daniel) Fixes: 3324b584b6f6 ("ebpf: misc core cleanup") Reported-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211026110019.363464-1-toke@redhat.com
2021-10-24Merge tag 'sched_urgent_for_v5.15_rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: "Reset clang's Shadow Call Stack on hotplug to prevent it from overflowing" * tag 'sched_urgent_for_v5.15_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/scs: Reset the shadow stack when idle_task_exit
2021-10-22cgroup: Fix memory leak caused by missing cgroup_bpf_offlineQuanyang Wang
When enabling CONFIG_CGROUP_BPF, kmemleak can be observed by running the command as below: $mount -t cgroup -o none,name=foo cgroup cgroup/ $umount cgroup/ unreferenced object 0xc3585c40 (size 64): comm "mount", pid 425, jiffies 4294959825 (age 31.990s) hex dump (first 32 bytes): 01 00 00 80 84 8c 28 c0 00 00 00 00 00 00 00 00 ......(......... 00 00 00 00 00 00 00 00 6c 43 a0 c3 00 00 00 00 ........lC...... backtrace: [<e95a2f9e>] cgroup_bpf_inherit+0x44/0x24c [<1f03679c>] cgroup_setup_root+0x174/0x37c [<ed4b0ac5>] cgroup1_get_tree+0x2c0/0x4a0 [<f85b12fd>] vfs_get_tree+0x24/0x108 [<f55aec5c>] path_mount+0x384/0x988 [<e2d5e9cd>] do_mount+0x64/0x9c [<208c9cfe>] sys_mount+0xfc/0x1f4 [<06dd06e0>] ret_fast_syscall+0x0/0x48 [<a8308cb3>] 0xbeb4daa8 This is because that since the commit 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path") root_cgrp->bpf.refcnt.data is allocated by the function percpu_ref_init in cgroup_bpf_inherit which is called by cgroup_setup_root when mounting, but not freed along with root_cgrp when umounting. Adding cgroup_bpf_offline which calls percpu_ref_kill to cgroup_kill_sb can free root_cgrp->bpf.refcnt.data in umount path. This patch also fixes the commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"). A cgroup_bpf_offline is needed to do a cleanup that frees the resources which are allocated by cgroup_bpf_inherit in cgroup_setup_root. And inside cgroup_bpf_offline, cgroup_get() is at the beginning and cgroup_put is at the end of cgroup_bpf_release which is called by cgroup_bpf_offline. So cgroup_bpf_offline can keep the balance of cgroup's refcount. Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path") Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20211018075623.26884-1-quanyang.wang@windriver.com
2021-10-22bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()Xu Kuohai
1. The ufd in generic_map_update_batch() should be read from batch.map_fd; 2. A call to fdget() should be followed by a symmetric call to fdput(). Fixes: aa2e93b8e58e ("bpf: Add generic support for update and delete batch ops") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211019032934.1210517-1-xukuohai@huawei.com
2021-10-22bpf: Prevent increasing bpf_jit_limit above maxLorenz Bauer
Restrict bpf_jit_limit to the maximum supported by the arch's JIT. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20211014142554.53120-4-lmb@cloudflare.com
2021-10-21Merge branch 'ucount-fixes-for-v5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ucounts fixes from Eric Biederman: "There has been one very hard to track down bug in the ucount code that we have been tracking since roughly v5.14 was released. Alex managed to find a reliable reproducer a few days ago and then I was able to instrument the code and figure out what the issue was. It turns out the sigqueue_alloc single atomic operation optimization did not play nicely with ucounts multiple level rlimits. It turned out that either sigqueue_alloc or sigqueue_free could be operating on multiple levels and trigger the conditions for the optimization on more than one level at the same time. To deal with that situation I have introduced inc_rlimit_get_ucounts and dec_rlimit_put_ucounts that just focuses on the optimization and the rlimit and ucount changes. While looking into the big bug I found I couple of other little issues so I am including those fixes here as well. When I have time I would very much like to dig into process ownership of the shared signal queue and see if we could pick a single owner for the entire queue so that all of the rlimits can count to that owner. That should entirely remove the need to call get_ucounts and put_ucounts in sigqueue_alloc and sigqueue_free. It is difficult because Linux unlike POSIX supports setuid that works on a single thread" * 'ucount-fixes-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyring ucounts: Proper error handling in set_cred_ucounts ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_creds ucounts: Fix signal ucount refcounting
2021-10-20Merge tag 'dma-mapping-5.15-2' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds
Pull dma-mapping fixes from Christoph Hellwig: - fix more dma-debug fallout (Gerald Schaefer, Hamza Mahfooz) - fix a kerneldoc warning (Logan Gunthorpe) * tag 'dma-mapping-5.15-2' of git://git.infradead.org/users/hch/dma-mapping: dma-debug: teach add_dma_entry() about DMA_ATTR_SKIP_CPU_SYNC dma-debug: fix sg checks in debug_dma_map_sg() dma-mapping: fix the kerneldoc for dma_map_sgtable()
2021-10-20Merge tag 'audit-pr-20211019' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fix from Paul Moore: "One small audit patch to add a pointer NULL check" * tag 'audit-pr-20211019' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: fix possible null-pointer dereference in audit_filter_rules
2021-10-20Merge tag 'trace-v5.15-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Recursion fix for tracing. While cleaning up some of the tracing recursion protection logic, I discovered a scenario that the current design would miss, and would allow an infinite recursion. Removing an optimization trick that opened the hole fixes the issue and cleans up the code as well" * tag 'trace-v5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Have all levels of checks prevent recursion
2021-10-20ucounts: Move get_ucounts from cred_alloc_blank to key_change_session_keyringEric W. Biederman
Setting cred->ucounts in cred_alloc_blank does not make sense. The uid and user_ns are deliberately not set in cred_alloc_blank but instead the setting is delayed until key_change_session_keyring. So move dealing with ucounts into key_change_session_keyring as well. Unfortunately that movement of get_ucounts adds a new failure mode to key_change_session_keyring. I do not see anything stopping the parent process from calling setuid and changing the relevant part of it's cred while keyctl_session_to_parent is running making it fundamentally necessary to call get_ucounts in key_change_session_keyring. Which means that the new failure mode cannot be avoided. A failure of key_change_session_keyring results in a single threaded parent keeping it's existing credentials. Which results in the parent process not being able to access the session keyring and whichever keys are in the new keyring. Further get_ucounts is only expected to fail if the number of bits in the refernece count for the structure is too few. Since the code has no other way to report the failure of get_ucounts and because such failures are not expected to be common add a WARN_ONCE to report this problem to userspace. Between the WARN_ONCE and the parent process not having access to the keys in the new session keyring I expect any failure of get_ucounts will be noticed and reported and we can find another way to handle this condition. (Possibly by just making ucounts->count an atomic_long_t). Cc: stable@vger.kernel.org Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Link: https://lkml.kernel.org/r/7k0ias0uf.fsf_-_@disp2133 Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-19ucounts: Proper error handling in set_cred_ucountsEric W. Biederman
Instead of leaking the ucounts in new if alloc_ucounts fails, store the result of alloc_ucounts into a temporary variable, which is later assigned to new->ucounts. Cc: stable@vger.kernel.org Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Link: https://lkml.kernel.org/r/87pms2s0v8.fsf_-_@disp2133 Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-19ucounts: Pair inc_rlimit_ucounts with dec_rlimit_ucoutns in commit_credsEric W. Biederman
The purpose of inc_rlimit_ucounts and dec_rlimit_ucounts in commit_creds is to change which rlimit counter is used to track a process when the credentials changes. Use the same test for both to guarantee the tracking is correct. Cc: stable@vger.kernel.org Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Link: https://lkml.kernel.org/r/87v91us0w4.fsf_-_@disp2133 Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Alexey Gladkov <legion@kernel.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-19sched/scs: Reset the shadow stack when idle_task_exitWoody Lin
Commit f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") removed the init_idle() call from idle_thread_get(). This was the sole call-path on hotplug that resets the Shadow Call Stack (scs) Stack Pointer (sp). Not resetting the scs-sp leads to scs overflow after enough hotplug cycles. Therefore add an explicit scs_task_reset() to the hotplug code to make sure the scs-sp does get reset on hotplug. Fixes: f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled") Signed-off-by: Woody Lin <woodylin@google.com> [peterz: Changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20211012083521.973587-1-woodylin@google.com
2021-10-18audit: fix possible null-pointer dereference in audit_filter_rulesGaosheng Cui
Fix possible null-pointer dereference in audit_filter_rules. audit_filter_rules() error: we previously assumed 'ctx' could be null Cc: stable@vger.kernel.org Fixes: bf361231c295 ("audit: add saddr_fam filter field") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-10-18tracing: Have all levels of checks prevent recursionSteven Rostedt (VMware)
While writing an email explaining the "bit = 0" logic for a discussion on making ftrace_test_recursion_trylock() disable preemption, I discovered a path that makes the "not do the logic if bit is zero" unsafe. The recursion logic is done in hot paths like the function tracer. Thus, any code executed causes noticeable overhead. Thus, tricks are done to try to limit the amount of code executed. This included the recursion testing logic. Having recursion testing is important, as there are many paths that can end up in an infinite recursion cycle when tracing every function in the kernel. Thus protection is needed to prevent that from happening. Because it is OK to recurse due to different running context levels (e.g. an interrupt preempts a trace, and then a trace occurs in the interrupt handler), a set of bits are used to know which context one is in (normal, softirq, irq and NMI). If a recursion occurs in the same level, it is prevented*. Then there are infrastructure levels of recursion as well. When more than one callback is attached to the same function to trace, it calls a loop function to iterate over all the callbacks. Both the callbacks and the loop function have recursion protection. The callbacks use the "ftrace_test_recursion_trylock()" which has a "function" set of context bits to test, and the loop function calls the internal trace_test_and_set_recursion() directly, with an "internal" set of bits. If an architecture does not implement all the features supported by ftrace then the callbacks are never called directly, and the loop function is called instead, which will implement the features of ftrace. Since both the loop function and the callbacks do recursion protection, it was seemed unnecessary to do it in both locations. Thus, a trick was made to have the internal set of recursion bits at a more significant bit location than the function bits. Then, if any of the higher bits were set, the logic of the function bits could be skipped, as any new recursion would first have to go through the loop function. This is true for architectures that do not support all the ftrace features, because all functions being traced must first go through the loop function before going to the callbacks. But this is not true for architectures that support all the ftrace features. That's because the loop function could be called due to two callbacks attached to the same function, but then a recursion function inside the callback could be called that does not share any other callback, and it will be called directly. i.e. traced_function_1: [ more than one callback tracing it ] call loop_func loop_func: trace_recursion set internal bit call callback callback: trace_recursion [ skipped because internal bit is set, return 0 ] call traced_function_2 traced_function_2: [ only traced by above callback ] call callback callback: trace_recursion [ skipped because internal bit is set, return 0 ] call traced_function_2 [ wash, rinse, repeat, BOOM! out of shampoo! ] Thus, the "bit == 0 skip" trick is not safe, unless the loop function is call for all functions. Since we want to encourage architectures to implement all ftrace features, having them slow down due to this extra logic may encourage the maintainers to update to the latest ftrace features. And because this logic is only safe for them, remove it completely. [*] There is on layer of recursion that is allowed, and that is to allow for the transition between interrupt context (normal -> softirq -> irq -> NMI), because a trace may occur before the context update is visible to the trace recursion logic. Link: https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d994@linux.alibaba.com/ Link: https://lkml.kernel.org/r/20211018154412.09fcad3c@gandalf.local.home Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@hansenpartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Jisheng Zhang <jszhang@kernel.org> Cc: =?utf-8?b?546L6LSH?= <yun.wang@linux.alibaba.com> Cc: Guo Ren <guoren@kernel.org> Cc: stable@vger.kernel.org Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-18ucounts: Fix signal ucount refcountingEric W. Biederman
In commit fda31c50292a ("signal: avoid double atomic counter increments for user accounting") Linus made a clever optimization to how rlimits and the struct user_struct. Unfortunately that optimization does not work in the obvious way when moved to nested rlimits. The problem is that the last decrement of the per user namespace per user sigpending counter might also be the last decrement of the sigpending counter in the parent user namespace as well. Which means that simply freeing the leaf ucount in __free_sigqueue is not enough. Maintain the optimization and handle the tricky cases by introducing inc_rlimit_get_ucounts and dec_rlimit_put_ucounts. By moving the entire optimization into functions that perform all of the work it becomes possible to ensure that every level is handled properly. The new function inc_rlimit_get_ucounts returns 0 on failure to increment the ucount. This is different than inc_rlimit_ucounts which increments the ucounts and returns LONG_MAX if the ucount counter has exceeded it's maximum or it wrapped (to indicate the counter needs to decremented). I wish we had a single user to account all pending signals to across all of the threads of a process so this complexity was not necessary Cc: stable@vger.kernel.org Fixes: d64696905554 ("Reimplement RLIMIT_SIGPENDING on top of ucounts") v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133 Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133 Reviewed-by: Alexey Gladkov <legion@kernel.org> Tested-by: Rune Kleveland <rune.kleveland@infomedia.dk> Tested-by: Yu Zhao <yuzhao@google.com> Tested-by: Jordan Glover <Golden_Miller83@protonmail.ch> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-10-18dma-debug: teach add_dma_entry() about DMA_ATTR_SKIP_CPU_SYNCHamza Mahfooz
Mapping something twice should be possible as long as, DMA_ATTR_SKIP_CPU_SYNC is passed to the strictly speaking second relevant mapping operation (that attempts to map the same thing). So, don't issue a warning if the specified condition is met in add_dma_entry(). Signed-off-by: Hamza Mahfooz <someguy@effective-light.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-16Merge tag 'trace-v5.15-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Tracing fixes for 5.15: - Fix defined but not use warning/error for osnoise function - Fix memory leak in event probe - Fix memblock leak in bootconfig - Fix the API of event probes to be like kprobes - Added test to check removal of event probe API - Fix recordmcount.pl for nds32 failed build * tag 'trace-v5.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: nds32/ftrace: Fix Error: invalid operands (*UND* and *UND* sections) for `^' selftests/ftrace: Update test for more eprobe removal process tracing: Fix event probe removal from dynamic events tracing: Fix missing * in comment block bootconfig: init: Fix memblock leak in xbc_make_cmdline() tracing: Fix memory leak in eprobe_register() tracing: Fix missing osnoise tracer on max_latency
2021-10-13tracing: Fix event probe removal from dynamic eventsSteven Rostedt (VMware)
When an event probe is to be removed via the API that created it via the dynamic events, an -ENOENT error is returned. This is because the removal of the event probe does not expect to see the event system and name that the event probe is attached to, even though that's part of the API to create it. As the removal of probes is to use the same API as they are created. In fact, the removal is not consistent with the kprobes and uprobes removal. Fix that by allowing various ways to remove the eprobe. The eprobe is created with: e:[GROUP/]NAME SYSTEM/EVENT [OPTIONS] Have it get removed by echoing in the following into dynamic_events: # Remove all eprobes with NAME echo '-:NAME' >> dynamic_events # Remove a specific eprobe echo '-:GROUP/NAME' >> dynamic_events echo '-:GROUP/NAME SYSTEM/EVENT' >> dynamic_events echo '-:NAME SYSTEM/EVENT' >> dynamic_events echo '-:GROUP/NAME SYSTEM/EVENT OPTIONS' >> dynamic_events echo '-:NAME SYSTEM/EVENT OPTIONS' >> dynamic_events Link: https://lkml.kernel.org/r/20211012081925.0e19cc4f@gandalf.local.home Link: https://lkml.kernel.org/r/20211013205533.630722129@goodmis.org Suggested-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 7491e2c442781 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-13Merge tag 'modules-for-v5.15-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules fix from Jessica Yu: - Build fix for cfi_init() when CONFIG_MODULE_UNLOAD=n * tag 'modules-for-v5.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: module: fix clang CFI with MODULE_UNLOAD=n
2021-10-11Merge branch 'for-5.15-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "All documentation / comment updates" * 'for-5.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroupv2, docs: fix misinformation in "device controller" section cgroup/cpuset: Change references of cpuset_mutex to cpuset_rwsem docs/cgroup: remove some duplicate words
2021-10-11Merge branch 'for-5.15-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "One patch to add a missing __printf annotation and the other to enable deferred printing for debug dumps to avoid deadlocks when triggered from some contexts (e.g. console drivers)" * 'for-5.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix state-dump console deadlock workqueue: annotate alloc_workqueue() as printf
2021-10-11workqueue: fix state-dump console deadlockJohan Hovold
Console drivers often queue work while holding locks also taken in their console write paths, something which can lead to deadlocks on SMP when dumping workqueue state (e.g. sysrq-t or on suspend failures). For serial console drivers this could look like: CPU0 CPU1 ---- ---- show_workqueue_state(); lock(&pool->lock); <IRQ> lock(&port->lock); schedule_work(); lock(&pool->lock); printk(); lock(console_owner); lock(&port->lock); where workqueues are, for example, used to push data to the line discipline, process break signals and handle modem-status changes. Line disciplines and serdev drivers can also queue work on write-wakeup notifications, etc. Reworking every console driver to avoid queuing work while holding locks also taken in their write paths would complicate drivers and is neither desirable or feasible. Instead use the deferred-printk mechanism to avoid printing while holding pool locks when dumping workqueue state. Note that there are a few WARN_ON() assertions in the workqueue code which could potentially also trigger a deadlock. Hopefully the ongoing printk rework will provide a general solution for this eventually. This was originally reported after a lockdep splat when executing sysrq-t with the imx serial driver. Fixes: 3494fc30846d ("workqueue: dump workqueues on sysrq-t") Cc: stable@vger.kernel.org # 4.0 Reported-by: Fabio Estevam <festevam@denx.de> Tested-by: Fabio Estevam <festevam@denx.de> Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-10-11dma-debug: fix sg checks in debug_dma_map_sg()Gerald Schaefer
The following warning occurred sporadically on s390: DMA-API: nvme 0006:00:00.0: device driver maps memory from kernel text or rodata [addr=0000000048cc5e2f] [len=131072] WARNING: CPU: 4 PID: 825 at kernel/dma/debug.c:1083 check_for_illegal_area+0xa8/0x138 It is a false-positive warning, due to broken logic in debug_dma_map_sg(). check_for_illegal_area() checks for overlay of sg elements with kernel text or rodata. It is called with sg_dma_len(s) instead of s->length as parameter. After the call to ->map_sg(), sg_dma_len() will contain the length of possibly combined sg elements in the DMA address space, and not the individual sg element length, which would be s->length. The check will then use the physical start address of an sg element, and add the DMA length for the overlap check, which could result in the false warning, because the DMA length can be larger than the actual single sg element length. In addition, the call to check_for_illegal_area() happens in the iteration over mapped_ents, which will not include all individual sg elements if any of them were combined in ->map_sg(). Fix this by using s->length instead of sg_dma_len(s). Also put the call to check_for_illegal_area() in a separate loop, iterating over all the individual sg elements ("nents" instead of "mapped_ents"). While at it, as suggested by Robin Murphy, also move check_for_stack() inside the new loop, as it is similarly concerned with validating the individual sg elements. Link: https://lore.kernel.org/lkml/20210705185252.4074653-1-gerald.schaefer@linux.ibm.com Fixes: 884d05970bfb ("dma-debug: use sg_dma_len accessor") Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-11dma-mapping: fix the kerneldoc for dma_map_sgtable()Logan Gunthorpe
htmldocs began producing the following warnings: kernel/dma/mapping.c:256: WARNING: Definition list ends without a blank line; unexpected unindent. kernel/dma/mapping.c:257: WARNING: Bullet list ends without a blank line; unexpected unindent. Reformatting the list without hyphens fixes the warnings and produces both a readable text and HTML output. Fixes: fffe3cc8c219 ("dma-mapping: allow map_sg() ops to return negative error code") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-10tracing: Fix missing * in comment blockColin Ian King
There is a missing * in a comment block, add it in. Link: https://lkml.kernel.org/r/20211006172830.1025336-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-10tracing: Fix memory leak in eprobe_register()Vamshi K Sthambamkadi
kmemleak report: unreferenced object 0xffff900a70ec7ec0 (size 32): comm "ftracetest", pid 2770, jiffies 4295042510 (age 311.464s) hex dump (first 32 bytes): c8 31 23 45 0a 90 ff ff 40 85 c7 6e 0a 90 ff ff .1#E....@..n.... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000009d3751fd>] kmem_cache_alloc_trace+0x2a2/0x440 [<0000000088b8124b>] eprobe_register+0x1e3/0x350 [<000000002a9a0517>] __ftrace_event_enable_disable+0x7c/0x240 [<0000000019109321>] event_enable_write+0x93/0xe0 [<000000007d85b320>] vfs_write+0xb9/0x260 [<00000000b94c5e41>] ksys_write+0x67/0xe0 [<000000005a08c81d>] __x64_sys_write+0x1a/0x20 [<00000000240bf576>] do_syscall_64+0x3b/0xc0 [<0000000043d5d9f6>] entry_SYSCALL_64_after_hwframe+0x44/0xae unreferenced object 0xffff900a56bbf280 (size 128): comm "ftracetest", pid 2770, jiffies 4295042510 (age 311.464s) hex dump (first 32 bytes): ff ff ff ff ff ff ff ff 00 00 00 00 01 00 00 00 ................ 80 69 3b b2 ff ff ff ff 20 69 3b b2 ff ff ff ff .i;..... i;..... backtrace: [<000000009d3751fd>] kmem_cache_alloc_trace+0x2a2/0x440 [<00000000c4e90fad>] eprobe_register+0x1fc/0x350 [<000000002a9a0517>] __ftrace_event_enable_disable+0x7c/0x240 [<0000000019109321>] event_enable_write+0x93/0xe0 [<000000007d85b320>] vfs_write+0xb9/0x260 [<00000000b94c5e41>] ksys_write+0x67/0xe0 [<000000005a08c81d>] __x64_sys_write+0x1a/0x20 [<00000000240bf576>] do_syscall_64+0x3b/0xc0 [<0000000043d5d9f6>] entry_SYSCALL_64_after_hwframe+0x44/0xae In new_eprobe_trigger(), allocated edata and trigger variables are never freed. To fix, free memory in disable_eprobe(). Link: https://lkml.kernel.org/r/20211008071802.GA2098@cosmos Fixes: 7491e2c442781 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-07Merge tag 'net-5.15-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from xfrm, bpf, netfilter, and wireless. Current release - regressions: - xfrm: fix XFRM_MSG_MAPPING ABI breakage caused by inserting a new value in the middle of an enum - unix: fix an issue in unix_shutdown causing the other end read/write failures - phy: mdio: fix memory leak Current release - new code bugs: - mlx5e: improve MQPRIO resiliency against bad configs Previous releases - regressions: - bpf: fix integer overflow leading to OOB access in map element pre-allocation - stmmac: dwmac-rk: fix ethernet on rk3399 based devices - netfilter: conntrack: fix boot failure with nf_conntrack.enable_hooks=1 - brcmfmac: revert using ISO3166 country code and 0 rev as fallback - i40e: fix freeing of uninitialized misc IRQ vector - iavf: fix double unlock of crit_lock Previous releases - always broken: - bpf, arm: fix register clobbering in div/mod implementation - netfilter: nf_tables: correct issues in netlink rule change event notifications - dsa: tag_dsa: fix mask for trunked packets - usb: r8152: don't resubmit rx immediately to avoid soft lockup on device unplug - i40e: fix endless loop under rtnl if FW fails to correctly respond to capability query - mlx5e: fix rx checksum offload coexistence with ipsec offload - mlx5: force round second at 1PPS out start time and allow it only in supported clock modes - phy: pcs: xpcs: fix incorrect CL37 AN sequence, EEE disable sequence Misc: - xfrm: slightly rejig the new policy uAPI to make it less cryptic" * tag 'net-5.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits) net: prefer socket bound to interface when not in VRF iavf: fix double unlock of crit_lock i40e: Fix freeing of uninitialized misc IRQ vector i40e: fix endless loop under rtnl dt-bindings: net: dsa: marvell: fix compatible in example ionic: move filter sync_needed bit set gve: report 64bit tx_bytes counter from gve_handle_report_stats() gve: fix gve_get_stats() rtnetlink: fix if_nlmsg_stats_size() under estimation gve: Properly handle errors in gve_assign_qpl gve: Avoid freeing NULL pointer gve: Correct available tx qpl check unix: Fix an issue in unix_shutdown causing the other end read/write failures net: stmmac: trigger PCS EEE to turn off on link down net: pcs: xpcs: fix incorrect steps on disable EEE netlink: annotate data races around nlk->bound net: pcs: xpcs: fix incorrect CL37 AN sequence net: sfp: Fix typo in state machine debug string net/sched: sch_taprio: properly cancel timer from taprio_destroy() net: bridge: fix under estimation in br_get_linkxstats_size() ...
2021-10-07Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2021-10-07 We've added 7 non-merge commits during the last 8 day(s) which contain a total of 8 files changed, 38 insertions(+), 21 deletions(-). The main changes are: 1) Fix ARM BPF JIT to preserve caller-saved regs for DIV/MOD JIT-internal helper call, from Johan Almbladh. 2) Fix integer overflow in BPF stack map element size calculation when used with preallocation, from Tatsuhiko Yasumatsu. 3) Fix an AF_UNIX regression due to added BPF sockmap support related to shutdown handling, from Jiang Wang. 4) Fix a segfault in libbpf when generating light skeletons from objects without BTF, from Kumar Kartikeya Dwivedi. 5) Fix a libbpf memory leak in strset to free the actual struct strset itself, from Andrii Nakryiko. 6) Dual-license bpf_insn.h similarly as we did for libbpf and bpftool, with ACKs from all contributors, from Luca Boccassi. ==================== Link: https://lore.kernel.org/r/20211007135010.21143-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-07tracing: Fix missing osnoise tracer on max_latencyJackie Liu
The compiler warns when the data are actually unused: kernel/trace/trace.c:1712:13: error: ‘trace_create_maxlat_file’ defined but not used [-Werror=unused-function] 1712 | static void trace_create_maxlat_file(struct trace_array *tr, | ^~~~~~~~~~~~~~~~~~~~~~~~ [Why] CONFIG_HWLAT_TRACER=n, CONFIG_TRACER_MAX_TRACE=n, CONFIG_OSNOISE_TRACER=y gcc report warns. [How] Now trace_create_maxlat_file will only take effect when CONFIG_HWLAT_TRACER=y or CONFIG_TRACER_MAX_TRACE=y. In fact, after adding osnoise trace, it also needs to take effect. Link: https://lore.kernel.org/all/c1d9e328-ad7c-920b-6c24-9e1598a6421c@infradead.org/ Link: https://lkml.kernel.org/r/20210922025122.3268022-1-liu.yun@linux.dev Fixes: bce29ac9ce0b ("trace: Add osnoise tracer") Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-03Merge tag 'sched_urgent_for_v5.15_rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Tell the compiler to always inline is_percpu_thread() - Make sure tunable_scaling buffer is null-terminated after an update in sysfs - Fix LTP named regression due to cgroup list ordering * tag 'sched_urgent_for_v5.15_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Always inline is_percpu_thread() sched/fair: Null terminate buffer when updating tunable_scaling sched/fair: Add ancestors of unthrottled undecayed cfs_rq
2021-10-03Merge tag 'perf_urgent_for_v5.15_rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Make sure the destroy callback is reset when a event initialization fails - Update the event constraints for Icelake - Make sure the active time of an event is updated even for inactive events * tag 'perf_urgent_for_v5.15_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: fix userpage->time_enabled of inactive events perf/x86/intel: Update event constraints for ICX perf/x86: Reset destroy callback on event init failure
2021-10-01sched/fair: Null terminate buffer when updating tunable_scalingMel Gorman
This patch null-terminates the temporary buffer in sched_scaling_write() so kstrtouint() does not return failure and checks the value is valid. Before: $ cat /sys/kernel/debug/sched/tunable_scaling 1 $ echo 0 > /sys/kernel/debug/sched/tunable_scaling -bash: echo: write error: Invalid argument $ cat /sys/kernel/debug/sched/tunable_scaling 1 After: $ cat /sys/kernel/debug/sched/tunable_scaling 1 $ echo 0 > /sys/kernel/debug/sched/tunable_scaling $ cat /sys/kernel/debug/sched/tunable_scaling 0 $ echo 3 > /sys/kernel/debug/sched/tunable_scaling -bash: echo: write error: Invalid argument Fixes: 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210927114635.GH3959@techsingularity.net
2021-10-01sched/fair: Add ancestors of unthrottled undecayed cfs_rqMichal Koutný
Since commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") we add cfs_rqs with no runnable tasks but not fully decayed into the load (leaf) list. We may ignore adding some ancestors and therefore breaking tmp_alone_branch invariant. This broke LTP test cfs_bandwidth01 and it was partially fixed in commit fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after unthrottling"). I noticed the named test still fails even with the fix (but with low probability, 1 in ~1000 executions of the test). The reason is when bailing out of unthrottle_cfs_rq early, we may miss adding ancestors of the unthrottled cfs_rq, thus, not joining tmp_alone_branch properly. Fix this by adding ancestors if we notice the unthrottled cfs_rq was added to the load list. Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Odin Ugedal <odin@uged.al> Link: https://lore.kernel.org/r/20210917153037.11176-1-mkoutny@suse.com
2021-10-01perf/core: fix userpage->time_enabled of inactive eventsSong Liu
Users of rdpmc rely on the mmapped user page to calculate accurate time_enabled. Currently, userpage->time_enabled is only updated when the event is added to the pmu. As a result, inactive event (due to counter multiplexing) does not have accurate userpage->time_enabled. This can be reproduced with something like: /* open 20 task perf_event "cycles", to create multiplexing */ fd = perf_event_open(); /* open task perf_event "cycles" */ userpage = mmap(fd); /* use mmap and rdmpc */ while (true) { time_enabled_mmap = xxx; /* use logic in perf_event_mmap_page */ time_enabled_read = read(fd).time_enabled; if (time_enabled_mmap > time_enabled_read) BUG(); } Fix this by updating userpage for inactive events in merge_sched_in. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-and-tested-by: Lucian Grijincu <lucian@fb.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210929194313.2398474-1-songliubraving@fb.com