summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-04-02bpf: Explicitly memset some bpf info structures declared on the stackGreg Kroah-Hartman
commit 5c6f25887963f15492b604dd25cb149c501bbabf upstream. Trying to initialize a structure with "= {};" will not always clean out all padding locations in a structure. So be explicit and call memset to initialize everything for a number of bpf information structures that are then copied from userspace, sometimes from smaller memory locations than the size of the structure. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320162258.GA794295@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-02bpf: Explicitly memset the bpf_attr structureGreg Kroah-Hartman
commit 8096f229421f7b22433775e928d506f0342e5907 upstream. For the bpf syscall, we are relying on the compiler to properly zero out the bpf_attr union that we copy userspace data into. Unfortunately that doesn't always work properly, padding and other oddities might not be correctly zeroed, and in some tests odd things have been found when the stack is pre-initialized to other values. Fix this by explicitly memsetting the structure to 0 before using it. Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: John Stultz <john.stultz@linaro.org> Reported-by: Alexander Potapenko <glider@google.com> Reported-by: Alistair Delva <adelva@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://android-review.googlesource.com/c/kernel/common/+/1235490 Link: https://lore.kernel.org/bpf/20200320094813.GA421650@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-02genirq: Fix reference leaks on irq affinity notifiersEdward Cree
commit df81dfcfd6991d547653d46c051bac195cd182c1 upstream. The handling of notify->work did not properly maintain notify->kref in two cases: 1) where the work was already scheduled, another irq_set_affinity_locked() would get the ref and (no-op-ly) schedule the work. Thus when irq_affinity_notify() ran, it would drop the original ref but not the additional one. 2) when cancelling the (old) work in irq_set_affinity_notifier(), if there was outstanding work a ref had been got for it but was never put. Fix both by checking the return values of the work handling functions (schedule_work() for (1) and cancel_work_sync() for (2)) and put the extra ref if the return value indicates preexisting work. Fixes: cd7eab44e994 ("genirq: Add IRQ affinity notifiers") Fixes: 59c39840f5ab ("genirq: Prevent use-after-free and work list corruption") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ben Hutchings <ben@decadent.org.uk> Link: https://lkml.kernel.org/r/24f5983f-2ab5-e83a-44ee-a45b5f9300f5@solarflare.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-02cgroup1: don't call release_agent when it is ""Tycho Andersen
[ Upstream commit 2e5383d7904e60529136727e49629a82058a5607 ] Older (and maybe current) versions of systemd set release_agent to "" when shutting down, but do not set notify_on_release to 0. Since 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()"), we filter out such calls when the user mode helper path is "". However, when used in conjunction with an actual (i.e. non "") STATIC_USERMODEHELPER, the path is never "", so the real usermode helper will be called with argv[0] == "". Let's avoid this by not invoking the release_agent when it is "". Signed-off-by: Tycho Andersen <tycho@tycho.ws> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-02cgroup-v1: cgroup_pidlist_next should update position indexVasily Averin
[ Upstream commit db8dd9697238be70a6b4f9d0284cd89f59c0e070 ] if seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. # mount | grep cgroup # dd if=/mnt/cgroup.procs bs=1 # normal output ... 1294 1295 1296 1304 1382 584+0 records in 584+0 records out 584 bytes copied dd: /mnt/cgroup.procs: cannot skip to specified offset 83 <<< generates end of last line 1383 <<< ... and whole last line once again 0+1 records in 0+1 records out 8 bytes copied dd: /mnt/cgroup.procs: cannot skip to specified offset 1386 <<< generates last line anyway 0+1 records in 0+1 records out 5 bytes copied https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-02futex: Unbreak futex hashingThomas Gleixner
commit 8d67743653dce5a0e7aa500fcccb237cde7ad88e upstream. The recent futex inode life time fix changed the ordering of the futex key union struct members, but forgot to adjust the hash function accordingly, As a result the hashing omits the leading 64bit and even hashes beyond the futex key causing a bad hash distribution which led to a ~100% performance regression. Hand in the futex key pointer instead of a random struct member and make the size calculation based of the struct offset. Fixes: 8019ad13ef7f ("futex: Fix inode life-time issue") Reported-by: Rong Chen <rong.a.chen@intel.com> Decoded-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Rong Chen <rong.a.chen@intel.com> Link: https://lkml.kernel.org/r/87h7yy90ve.fsf@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-02futex: Fix inode life-time issuePeter Zijlstra
commit 8019ad13ef7f64be44d4f892af9c840179009254 upstream. As reported by Jann, ihold() does not in fact guarantee inode persistence. And instead of making it so, replace the usage of inode pointers with a per boot, machine wide, unique inode identifier. This sequence number is global, but shared (file backed) futexes are rare enough that this should not become a performance issue. Reported-by: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-02x86/mm: split vmalloc_sync_all()Joerg Roedel
commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream. Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the vunmap() code-path. While this change was necessary to maintain correctness on x86-32-pae kernels, it also adds additional cycles for architectures that don't need it. Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported severe performance regressions in micro-benchmarks because it now also calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But the vmalloc_sync_all() implementation on x86-64 is only needed for newly created mappings. To avoid the unnecessary work on x86-64 and to gain the performance back, split up vmalloc_sync_all() into two functions: * vmalloc_sync_mappings(), and * vmalloc_sync_unmappings() Most call-sites to vmalloc_sync_all() only care about new mappings being synchronized. The only exception is the new call-site added in the above mentioned commit. Shile Zhang directed us to a report of an 80% regression in reaim throughput. Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES] Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/ Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-20signal: avoid double atomic counter increments for user accountingLinus Torvalds
[ Upstream commit fda31c50292a5062332fa0343c084bd9f46604d9 ] When queueing a signal, we increment both the users count of pending signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount of the user struct itself (because we keep a reference to the user in the signal structure in order to correctly account for it when freeing). That turns out to be fairly expensive, because both of them are atomic updates, and particularly under extreme signal handling pressure on big machines, you can get a lot of cache contention on the user struct. That can then cause horrid cacheline ping-pong when you do these multiple accesses. So change the reference counting to only pin the user for the _first_ pending signal, and to unpin it when the last pending signal is dequeued. That means that when a user sees a lot of concurrent signal queuing - which is the only situation when this matters - the only atomic access needed is generally the 'sigpending' count update. This was noticed because of a particularly odd timing artifact on a dual-socket 96C/192T Cascade Lake platform: when you get into bad contention, on that machine for some reason seems to be much worse when the contention happens in the upper 32-byte half of the cacheline. As a result, the kernel test robot will-it-scale 'signal1' benchmark had an odd performance regression simply due to random alignment of the 'struct user_struct' (and pointed to a completely unrelated and apparently nonsensical commit for the regression). Avoiding the double increments (and decrements on the dequeueing side, of course) makes for much less contention and hugely improved performance on that will-it-scale microbenchmark. Quoting Feng Tang: "It makes a big difference, that the performance score is tripled! bump from original 17000 to 54000. Also the gap between 5.0-rc6 and 5.0-rc6+Jiri's patch is reduced to around 2%" [ The "2% gap" is the odd cacheline placement difference on that platform: under the extreme contention case, the effect of which half of the cacheline was hot was 5%, so with the reduced contention the odd timing artifact is reduced too ] It does help in the non-contended case too, but is not nearly as noticeable. Reported-and-tested-by: Feng Tang <feng.tang@intel.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Philip Li <philip.li@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-20workqueue: don't use wq_select_unbound_cpu() for bound worksHillf Danton
commit aa202f1f56960c60e7befaa0f49c72b8fa11b0a8 upstream. wq_select_unbound_cpu() is designed for unbound workqueues only, but it's wrongly called when using a bound workqueue too. Fixing this ensures work queued to a bound workqueue with cpu=WORK_CPU_UNBOUND always runs on the local CPU. Before, that would happen only if wq_unbound_cpumask happened to include it (likely almost always the case), or was empty, or we got lucky with forced round-robin placement. So restricting /sys/devices/virtual/workqueue/cpumask to a small subset of a machine's CPUs would cause some bound work items to run unexpectedly there. Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs") Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Hillf Danton <hdanton@sina.com> [dj: massage changelog] Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-20cgroup: Iterate tasks that did not finish do_exit()Michal Koutný
commit 9c974c77246460fa6a92c18554c3311c8c83c160 upstream. PF_EXITING is set earlier than actual removal from css_set when a task is exitting. This can confuse cgroup.procs readers who see no PF_EXITING tasks, however, rmdir is checking against css_set membership so it can transitionally fail with EBUSY. Fix this by listing tasks that weren't unlinked from css_set active lists. It may happen that other users of the task iterator (without CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This is equal to the state before commit c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") but it may be reviewed later. Reported-by: Suren Baghdasaryan <surenb@google.com> Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-20cgroup: cgroup_procs_next should increase position indexVasily Averin
commit 2d4ecb030dcc90fb725ecbfc82ce5d6c37906e0e upstream. If seq_file .next fuction does not change position index, read after some lseek can generate unexpected output: 1) dd bs=1 skip output of each 2nd elements $ dd if=/sys/fs/cgroup/cgroup.procs bs=8 count=1 2 3 4 5 1+0 records in 1+0 records out 8 bytes copied, 0,000267297 s, 29,9 kB/s [test@localhost ~]$ dd if=/sys/fs/cgroup/cgroup.procs bs=1 count=8 2 4 <<< NB! 3 was skipped 6 <<< ... and 5 too 8 <<< ... and 7 8+0 records in 8+0 records out 8 bytes copied, 5,2123e-05 s, 153 kB/s This happen because __cgroup_procs_start() makes an extra extra cgroup_procs_next() call 2) read after lseek beyond end of file generates whole last line. 3) read after lseek into middle of last line generates expected rest of last line and unexpected whole line once again. Additionally patch removes an extra position index changes in __cgroup_procs_start() Cc: stable@vger.kernel.org https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-20cgroup: memcg: net: do not associate sock with unrelated cgroupShakeel Butt
[ Upstream commit e876ecc67db80dfdb8e237f71e5b43bb88ae549c ] We are testing network memory accounting in our setup and noticed inconsistent network memory usage and often unrelated cgroups network usage correlates with testing workload. On further inspection, it seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in irq context specially for cgroup v1. mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context and kind of assumes that this can only happen from sk_clone_lock() and the source sock object has already associated cgroup. However in cgroup v1, where network memory accounting is opt-in, the source sock can be unassociated with any cgroup and the new cloned sock can get associated with unrelated interrupted cgroup. Cgroup v2 can also suffer if the source sock object was created by process in the root cgroup or if sk_alloc() is called in irq context. The fix is to just do nothing in interrupt. WARNING: Please note that about half of the TCP sockets are allocated from the IRQ context, so, memory used by such sockets will not be accouted by the memcg. The stack trace of mem_cgroup_sk_alloc() from IRQ-context: CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1 Hardware name: ... Call Trace: <IRQ> dump_stack+0x57/0x75 mem_cgroup_sk_alloc+0xe9/0xf0 sk_clone_lock+0x2a7/0x420 inet_csk_clone_lock+0x1b/0x110 tcp_create_openreq_child+0x23/0x3b0 tcp_v6_syn_recv_sock+0x88/0x730 tcp_check_req+0x429/0x560 tcp_v6_rcv+0x72d/0xa40 ip6_protocol_deliver_rcu+0xc9/0x400 ip6_input+0x44/0xd0 ? ip6_protocol_deliver_rcu+0x400/0x400 ip6_rcv_finish+0x71/0x80 ipv6_rcv+0x5b/0xe0 ? ip6_sublist_rcv+0x2e0/0x2e0 process_backlog+0x108/0x1e0 net_rx_action+0x26b/0x460 __do_softirq+0x104/0x2a6 do_softirq_own_stack+0x2a/0x40 </IRQ> do_softirq.part.19+0x40/0x50 __local_bh_enable_ip+0x51/0x60 ip6_finish_output2+0x23d/0x520 ? ip6table_mangle_hook+0x55/0x160 __ip6_finish_output+0xa1/0x100 ip6_finish_output+0x30/0xd0 ip6_output+0x73/0x120 ? __ip6_finish_output+0x100/0x100 ip6_xmit+0x2e3/0x600 ? ipv6_anycast_cleanup+0x50/0x50 ? inet6_csk_route_socket+0x136/0x1e0 ? skb_free_head+0x1e/0x30 inet6_csk_xmit+0x95/0xf0 __tcp_transmit_skb+0x5b4/0xb20 __tcp_send_ack.part.60+0xa3/0x110 tcp_send_ack+0x1d/0x20 tcp_rcv_state_process+0xe64/0xe80 ? tcp_v6_connect+0x5d1/0x5f0 tcp_v6_do_rcv+0x1b1/0x3f0 ? tcp_v6_do_rcv+0x1b1/0x3f0 __release_sock+0x7f/0xd0 release_sock+0x30/0xa0 __inet_stream_connect+0x1c3/0x3b0 ? prepare_to_wait+0xb0/0xb0 inet_stream_connect+0x3b/0x60 __sys_connect+0x101/0x120 ? __sys_getsockopt+0x11b/0x140 __x64_sys_connect+0x1a/0x20 do_syscall_64+0x51/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The stack trace of mem_cgroup_sk_alloc() from IRQ-context: Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking") Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-11kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logicMasami Hiramatsu
[ Upstream commit e4add247789e4ba5e08ad8256183ce2e211877d4 ] optimize_kprobe() and unoptimize_kprobe() cancels if a given kprobe is on the optimizing_list or unoptimizing_list already. However, since the following commit: f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code") modified the update timing of the KPROBE_FLAG_OPTIMIZED, it doesn't work as expected anymore. The optimized_kprobe could be in the following states: - [optimizing]: Before inserting jump instruction op.kp->flags has KPROBE_FLAG_OPTIMIZED and op->list is not empty. - [optimized]: jump inserted op.kp->flags has KPROBE_FLAG_OPTIMIZED and op->list is empty. - [unoptimizing]: Before removing jump instruction (including unused optprobe) op.kp->flags has KPROBE_FLAG_OPTIMIZED and op->list is not empty. - [unoptimized]: jump removed op.kp->flags doesn't have KPROBE_FLAG_OPTIMIZED and op->list is empty. Current code mis-expects [unoptimizing] state doesn't have KPROBE_FLAG_OPTIMIZED, and that can cause incorrect results. To fix this, introduce optprobe_queued_unopt() to distinguish [optimizing] and [unoptimizing] states and fixes the logic in optimize_kprobe() and unoptimize_kprobe(). [ mingo: Cleaned up the changelog and the code a bit. ] Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Fixes: f66c0447cca1 ("kprobes: Set unoptimized flag after unoptimizing code") Link: https://lkml.kernel.org/r/157840814418.7181.13478003006386303481.stgit@devnote2 Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-11audit: always check the netlink payload length in audit_receive_msg()Paul Moore
[ Upstream commit 756125289285f6e55a03861bf4b6257aa3d19a93 ] This patch ensures that we always check the netlink payload length in audit_receive_msg() before we take any action on the payload itself. Cc: stable@vger.kernel.org Reported-by: syzbot+399c44bf1f43b8747403@syzkaller.appspotmail.com Reported-by: syzbot+e4b12d8d202701f08b6d@syzkaller.appspotmail.com Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-03-11kprobes: Set unoptimized flag after unoptimizing codeMasami Hiramatsu
commit f66c0447cca1281116224d474cdb37d6a18e4b5b upstream. Set the unoptimized flag after confirming the code is completely unoptimized. Without this fix, when a kprobe hits the intermediate modified instruction (the first byte is replaced by an INT3, but later bytes can still be a jump address operand) while unoptimizing, it can return to the middle byte of the modified code, which causes an invalid instruction exception in the kernel. Usually, this is a rare case, but if we put a probe on the function call while text patching, it always causes a kernel panic as below: # echo p text_poke+5 > kprobe_events # echo 1 > events/kprobes/enable # echo 0 > events/kprobes/enable invalid opcode: 0000 [#1] PREEMPT SMP PTI RIP: 0010:text_poke+0x9/0x50 Call Trace: arch_unoptimize_kprobe+0x22/0x28 arch_unoptimize_kprobes+0x39/0x87 kprobe_optimizer+0x6e/0x290 process_one_work+0x2a0/0x610 worker_thread+0x28/0x3d0 ? process_one_work+0x610/0x610 kthread+0x10d/0x130 ? kthread_park+0x80/0x80 ret_from_fork+0x3a/0x50 text_poke() is used for patching the code in optprobes. This can happen even if we blacklist text_poke() and other functions, because there is a small time window during which we show the intermediate code to other CPUs. [ mingo: Edited the changelog. ] Tested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bristot@redhat.com Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing") Link: https://lkml.kernel.org/r/157483422375.25881.13508326028469515760.stgit@devnote2 Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-11tracing: Disable trace_printk() on post poned testsSteven Rostedt (VMware)
commit 78041c0c9e935d9ce4086feeff6c569ed88ddfd4 upstream. The tracing seftests checks various aspects of the tracing infrastructure, and one is filtering. If trace_printk() is active during a self test, it can cause the filtering to fail, which will disable that part of the trace. To keep the selftests from failing because of trace_printk() calls, trace_printk() checks the variable tracing_selftest_running, and if set, it does not write to the tracing buffer. As some tracers were registered earlier in boot, the selftest they triggered would fail because not all the infrastructure was set up for the full selftest. Thus, some of the tests were post poned to when their infrastructure was ready (namely file system code). The postpone code did not set the tracing_seftest_running variable, and could fail if a trace_printk() was added and executed during their run. Cc: stable@vger.kernel.org Fixes: 9afecfbb95198 ("tracing: Postpone tracer start-up tests till the system is more robust") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-11audit: fix error handling in audit_data_to_entry()Paul Moore
commit 2ad3e17ebf94b7b7f3f64c050ff168f9915345eb upstream. Commit 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive") combined a number of separate fields in the audit_field struct into a single union. Generally this worked just fine because they are generally mutually exclusive. Unfortunately in audit_data_to_entry() the overlap can be a problem when a specific error case is triggered that causes the error path code to attempt to cleanup an audit_field struct and the cleanup involves attempting to free a stored LSM string (the lsm_str field). Currently the code always has a non-NULL value in the audit_field.lsm_str field as the top of the for-loop transfers a value into audit_field.val (both .lsm_str and .val are part of the same union); if audit_data_to_entry() fails and the audit_field struct is specified to contain a LSM string, but the audit_field.lsm_str has not yet been properly set, the error handling code will attempt to free the bogus audit_field.lsm_str value that was set with audit_field.val at the top of the for-loop. This patch corrects this by ensuring that the audit_field.val is only set when needed (it is cleared when the audit_field struct is allocated with kcalloc()). It also corrects a few other issues to ensure that in case of error the proper error code is returned. Cc: stable@vger.kernel.org Fixes: 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive") Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28genirq/proc: Reject invalid affinity masks (again)Thomas Gleixner
commit cba6437a1854fde5934098ec3bd0ee83af3129f5 upstream. Qian Cai reported that the WARN_ON() in the x86/msi affinity setting code, which catches cases where the affinity setting is not done on the CPU which is the current target of the interrupt, triggers during CPU hotplug stress testing. It turns out that the warning which was added with the commit addressing the MSI affinity race unearthed yet another long standing bug. If user space writes a bogus affinity mask, i.e. it contains no online CPUs, then it calls irq_select_affinity_usr(). This was introduced for ALPHA in eee45269b0f5 ("[PATCH] Alpha: convert to generic irq framework (generic part)") and subsequently made available for all architectures in 18404756765c ("genirq: Expose default irq affinity mask (take 3)") which introduced the circumvention of the affinity setting restrictions for interrupt which cannot be moved in process context. The whole exercise is bogus in various aspects: 1) If the interrupt is already started up then there is absolutely no point to honour a bogus interrupt affinity setting from user space. The interrupt is already assigned to an online CPU and it does not make any sense to reassign it to some other randomly chosen online CPU. 2) If the interupt is not yet started up then there is no point either. A subsequent startup of the interrupt will invoke irq_setup_affinity() anyway which will chose a valid target CPU. So the only correct solution is to just return -EINVAL in case user space wrote an affinity mask which does not contain any online CPUs, except for ALPHA which has it's own magic sauce for this. Fixes: 18404756765c ("genirq: Expose default irq affinity mask (take 3)") Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Qian Cai <cai@lca.pw> Link: https://lkml.kernel.org/r/878sl8xdbm.fsf@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28trigger_next should increase position indexVasily Averin
[ Upstream commit 6722b23e7a2ace078344064a9735fb73e554e9ef ] if seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. Without patch: # dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist # Available triggers: # traceon traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist 6+1 records in 6+1 records out 206 bytes copied, 0.00027916 s, 738 kB/s Notice the printing of "# Available triggers:..." after the line. With the patch: # dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist 2+1 records in 2+1 records out 88 bytes copied, 0.000526867 s, 167 kB/s It only prints the end of the file, and does not restart. Link: http://lkml.kernel.org/r/3c35ee24-dd3a-8119-9c19-552ed253388a@virtuozzo.com https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28ftrace: fpid_next() should increase position indexVasily Averin
[ Upstream commit e4075e8bdffd93a9b6d6e1d52fabedceeca5a91b ] if seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. Without patch: # dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset id no pid 2+1 records in 2+1 records out 10 bytes copied, 0.000213285 s, 46.9 kB/s Notice the "id" followed by "no pid". With the patch: # dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset id 0+1 records in 0+1 records out 3 bytes copied, 0.000202112 s, 14.8 kB/s Notice that it only prints "id" and not the "no pid" afterward. Link: http://lkml.kernel.org/r/4f87c6ad-f114-30bb-8506-c32274ce2992@virtuozzo.com https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28watchdog/softlockup: Enforce that timestamp is valid on bootThomas Gleixner
[ Upstream commit 11e31f608b499f044f24b20be73f1dcab3e43f8a ] Robert reported that during boot the watchdog timestamp is set to 0 for one second which is the indicator for a watchdog reset. The reason for this is that the timestamp is in seconds and the time is taken from sched clock and divided by ~1e9. sched clock starts at 0 which means that for the first second during boot the watchdog timestamp is 0, i.e. reset. Use ULONG_MAX as the reset indicator value so the watchdog works correctly right from the start. ULONG_MAX would only conflict with a real timestamp if the system reaches an uptime of 136 years on 32bit and almost eternity on 64bit. Reported-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/87o8v3uuzl.fsf@nanos.tec.linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28tracing: Fix very unlikely race of registering two stat tracersSteven Rostedt (VMware)
[ Upstream commit dfb6cd1e654315168e36d947471bd2a0ccd834ae ] Looking through old emails in my INBOX, I came across a patch from Luis Henriques that attempted to fix a race of two stat tracers registering the same stat trace (extremely unlikely, as this is done in the kernel, and probably doesn't even exist). The submitted patch wasn't quite right as it needed to deal with clean up a bit better (if two stat tracers were the same, it would have the same files). But to make the code cleaner, all we needed to do is to keep the all_stat_sessions_mutex held for most of the registering function. Link: http://lkml.kernel.org/r/1410299375-20068-1-git-send-email-luis.henriques@canonical.com Fixes: 002bb86d8d42f ("tracing/ftrace: separate events tracing and stats tracing engine") Reported-by: Luis Henriques <luis.henriques@canonical.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28tracing: Fix tracing_stat return values in error handling pathsLuis Henriques
[ Upstream commit afccc00f75bbbee4e4ae833a96c2d29a7259c693 ] tracing_stat_init() was always returning '0', even on the error paths. It now returns -ENODEV if tracing_init_dentry() fails or -ENOMEM if it fails to created the 'trace_stat' debugfs directory. Link: http://lkml.kernel.org/r/1410299381-20108-1-git-send-email-luis.henriques@canonical.com Fixes: ed6f1c996bfe4 ("tracing: Check return value of tracing_init_dentry()") Signed-off-by: Luis Henriques <luis.henriques@canonical.com> [ Pulled from the archeological digging of my INBOX ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28cpu/hotplug, stop_machine: Fix stop_machine vs hotplug orderPeter Zijlstra
[ Upstream commit 45178ac0cea853fe0e405bf11e101bdebea57b15 ] Paul reported a very sporadic, rcutorture induced, workqueue failure. When the planets align, the workqueue rescuer's self-migrate fails and then triggers a WARN for running a work on the wrong CPU. Tejun then figured that set_cpus_allowed_ptr()'s stop_one_cpu() call could be ignored! When stopper->enabled is false, stop_machine will insta complete the work, without actually doing the work. Worse, it will not WARN about this (we really should fix this). It turns out there is a small window where a freshly online'ed CPU is marked 'online' but doesn't yet have the stopper task running: BP AP bringup_cpu() __cpu_up(cpu, idle) --> start_secondary() ... cpu_startup_entry() bringup_wait_for_ap() wait_for_ap_thread() <-- cpuhp_online_idle() while (1) do_idle() ... available to run kthreads ... stop_machine_unpark() stopper->enable = true; Close this by moving the stop_machine_unpark() into cpuhp_online_idle(), such that the stopper thread is ready before we start the idle loop and schedule. Reported-by: "Paul E. McKenney" <paulmck@kernel.org> Debugged-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28padata: Remove broken queue flushingHerbert Xu
commit 07928d9bfc81640bab36f5190e8725894d93b659 upstream. The function padata_flush_queues is fundamentally broken because it cannot force padata users to complete the request that is underway. IOW padata has to passively wait for the completion of any outstanding work. As it stands flushing is used in two places. Its use in padata_stop is simply unnecessary because nothing depends on the queues to be flushed afterwards. The other use in padata_replace is more substantial as we depend on it to free the old pd structure. This patch instead uses the pd->refcnt to dynamically free the pd structure once all requests are complete. Fixes: 2b73b07ab8a4 ("padata: Flush the padata queues actively") Cc: <stable@vger.kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> [dj: leave "pd->pinst = pinst" assignment in padata_alloc_pd()] Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-14perf/core: Fix mlock accounting in perf_mmap()Song Liu
commit 003461559ef7a9bd0239bae35a22ad8924d6e9ad upstream. Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of a perf ring buffer may lead to an integer underflow in locked memory accounting. This may lead to the undesired behaviors, such as failures in BPF map creation. Address this by adjusting the accounting logic to take into account the possibility that the amount of already locked memory may exceed the current limit. Fixes: c4b75479741c ("perf/core: Make the mlock accounting simple again") Suggested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-14clocksource: Prevent double add_timer_on() for watchdog_timerKonstantin Khlebnikov
commit febac332a819f0e764aa4da62757ba21d18c182b upstream. Kernel crashes inside QEMU/KVM are observed: kernel BUG at kernel/time/timer.c:1154! BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on(). At the same time another cpu got: general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in: __hlist_del at include/linux/list.h:681 (inlined by) detach_timer at kernel/time/timer.c:818 (inlined by) expire_timers at kernel/time/timer.c:1355 (inlined by) __run_timers at kernel/time/timer.c:1686 (inlined by) run_timer_softirq at kernel/time/timer.c:1699 Unfortunately kernel logs are badly scrambled, stacktraces are lost. Printing the timer->function before the BUG_ON() pointed to clocksource_watchdog(). The execution of clocksource_watchdog() can race with a sequence of clocksource_stop_watchdog() .. clocksource_start_watchdog(): expire_timers() detach_timer(timer, true); timer->entry.pprev = NULL; raw_spin_unlock_irq(&base->lock); call_timer_fn clocksource_watchdog() clocksource_watchdog_kthread() or clocksource_unbind() spin_lock_irqsave(&watchdog_lock, flags); clocksource_stop_watchdog(); del_timer(&watchdog_timer); watchdog_running = 0; spin_unlock_irqrestore(&watchdog_lock, flags); spin_lock_irqsave(&watchdog_lock, flags); clocksource_start_watchdog(); add_timer_on(&watchdog_timer, ...); watchdog_running = 1; spin_unlock_irqrestore(&watchdog_lock, flags); spin_lock(&watchdog_lock); add_timer_on(&watchdog_timer, ...); BUG_ON(timer_pending(timer) || !timer->function); timer_pending() -> true BUG() I.e. inside clocksource_watchdog() watchdog_timer could be already armed. Check timer_pending() before calling add_timer_on(). This is sufficient as all operations are synchronized by watchdog_lock. Fixes: 75c5158f70c0 ("timekeeping: Update clocksource with stop_machine") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-14ftrace: Protect ftrace_graph_hash with ftrace_syncSteven Rostedt (VMware)
[ Upstream commit 54a16ff6f2e50775145b210bcd94d62c3c2af117 ] As function_graph tracer can run when RCU is not "watching", it can not be protected by synchronize_rcu() it requires running a task on each CPU before it can be freed. Calling schedule_on_each_cpu(ftrace_sync) needs to be used. Link: https://lore.kernel.org/r/20200205131110.GT2935@paulmck-ThinkPad-P72 Cc: stable@vger.kernel.org Fixes: b9b0c831bed26 ("ftrace: Convert graph filter to use hash tables") Reported-by: "Paul E. McKenney" <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14ftrace: Add comment to why rcu_dereference_sched() is open codedSteven Rostedt (VMware)
[ Upstream commit 16052dd5bdfa16dbe18d8c1d4cde2ddab9d23177 ] Because the function graph tracer can execute in sections where RCU is not "watching", the rcu_dereference_sched() for the has needs to be open coded. This is fine because the RCU "flavor" of the ftrace hash is protected by its own RCU handling (it does its own little synchronization on every CPU and does not rely on RCU sched). Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14tracing: Annotate ftrace_graph_notrace_hash pointer with __rcuAmol Grover
[ Upstream commit fd0e6852c407dd9aefc594f54ddcc21d84803d3b ] Fix following instances of sparse error kernel/trace/ftrace.c:5667:29: error: incompatible types in comparison kernel/trace/ftrace.c:5813:21: error: incompatible types in comparison kernel/trace/ftrace.c:5868:36: error: incompatible types in comparison kernel/trace/ftrace.c:5870:25: error: incompatible types in comparison Use rcu_dereference_protected to dereference the newly annotated pointer. Link: http://lkml.kernel.org/r/20200205055701.30195-1-frextrite@gmail.com Signed-off-by: Amol Grover <frextrite@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14tracing: Annotate ftrace_graph_hash pointer with __rcuAmol Grover
[ Upstream commit 24a9729f831462b1d9d61dc85ecc91c59037243f ] Fix following instances of sparse error kernel/trace/ftrace.c:5664:29: error: incompatible types in comparison kernel/trace/ftrace.c:5785:21: error: incompatible types in comparison kernel/trace/ftrace.c:5864:36: error: incompatible types in comparison kernel/trace/ftrace.c:5866:25: error: incompatible types in comparison Use rcu_dereference_protected to access the __rcu annotated pointer. Link: http://lkml.kernel.org/r/20200201072703.17330-1-frextrite@gmail.com Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Amol Grover <frextrite@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-14alarmtimer: Unregister wakeup source when module get failsStephen Boyd
commit 6b6d188aae79a630957aefd88ff5c42af6553ee3 upstream. The alarmtimer_rtc_add_device() function creates a wakeup source and then tries to grab a module reference. If that fails the function returns early with an error code, but fails to remove the wakeup source. Cleanup this exit path so there is no dangling wakeup source, which is named 'alarmtime' left allocated which will conflict with another RTC device that may be registered later. Fixes: 51218298a25e ("alarmtimer: Ensure RTC module is not unloaded") Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200109155910.907-2-swboyd@chromium.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-14irqdomain: Fix a memory leak in irq_domain_push_irq()Kevin Hao
commit 0f394daef89b38d58c91118a2b08b8a1b316703b upstream. Fix a memory leak reported by kmemleak: unreferenced object 0xffff000bc6f50e80 (size 128): comm "kworker/23:2", pid 201, jiffies 4294894947 (age 942.132s) hex dump (first 32 bytes): 00 00 00 00 41 00 00 00 86 c0 03 00 00 00 00 00 ....A........... 00 a0 b2 c6 0b 00 ff ff 40 51 fd 10 00 80 ff ff ........@Q...... backtrace: [<00000000e62d2240>] kmem_cache_alloc_trace+0x1a4/0x320 [<00000000279143c9>] irq_domain_push_irq+0x7c/0x188 [<00000000d9f4c154>] thunderx_gpio_probe+0x3ac/0x438 [<00000000fd09ec22>] pci_device_probe+0xe4/0x198 [<00000000d43eca75>] really_probe+0xdc/0x320 [<00000000d3ebab09>] driver_probe_device+0x5c/0xf0 [<000000005b3ecaa0>] __device_attach_driver+0x88/0xc0 [<000000004e5915f5>] bus_for_each_drv+0x7c/0xc8 [<0000000079d4db41>] __device_attach+0xe4/0x140 [<00000000883bbda9>] device_initial_probe+0x18/0x20 [<000000003be59ef6>] bus_probe_device+0x98/0xa0 [<0000000039b03d3f>] deferred_probe_work_func+0x74/0xa8 [<00000000870934ce>] process_one_work+0x1c8/0x470 [<00000000e3cce570>] worker_thread+0x1f8/0x428 [<000000005d64975e>] kthread+0xfc/0x128 [<00000000f0eaa764>] ret_from_fork+0x10/0x18 Fixes: 495c38d3001f ("irqdomain: Add irq_domain_{push,pop}_irq() functions") Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200120043547.22271-1-haokexin@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-14tracing: Fix sched switch start/stop refcount racy updatesMathieu Desnoyers
commit 64ae572bc7d0060429e40e1c8d803ce5eb31a0d6 upstream. Reading the sched_cmdline_ref and sched_tgid_ref initial state within tracing_start_sched_switch without holding the sched_register_mutex is racy against concurrent updates, which can lead to tracepoint probes being registered more than once (and thus trigger warnings within tracepoint.c). [ May be the fix for this bug ] Link: https://lore.kernel.org/r/000000000000ab6f84056c786b93@google.com Link: http://lkml.kernel.org/r/20190817141208.15226-1-mathieu.desnoyers@efficios.com Cc: stable@vger.kernel.org CC: Steven Rostedt (VMware) <rostedt@goodmis.org> CC: Joel Fernandes (Google) <joel@joelfernandes.org> CC: Peter Zijlstra <peterz@infradead.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Paul E. McKenney <paulmck@linux.ibm.com> Reported-by: syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com Fixes: d914ba37d7145 ("tracing: Add support for recording tgid of tasks") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-14kernel/module: Fix memleak in module_add_modinfo_attrs()YueHaibing
[ Upstream commit f6d061d617124abbd55396a3bc37b9bf7d33233c ] In module_add_modinfo_attrs() if sysfs_create_file() fails on the first iteration of the loop (so i = 0), we forget to free the modinfo_attrs. Fixes: bc6f2a757d52 ("kernel/module: Fix mem leak in module_add_modinfo_attrs") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-05cgroup: Prevent double killing of css when enabling threaded cgroupMichal Koutný
commit 3bc0bb36fa30e95ca829e9cf480e1ef7f7638333 upstream. The test_cgcore_no_internal_process_constraint_on_threads selftest when running with subsystem controlling noise triggers two warnings: > [ 597.443115] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3131 cgroup_apply_control_enable+0xe0/0x3f0 > [ 597.443413] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3177 cgroup_apply_control_disable+0xa6/0x160 Both stem from a call to cgroup_type_write. The first warning was also triggered by syzkaller. When we're switching cgroup to threaded mode shortly after a subsystem was disabled on it, we can see the respective subsystem css dying there. The warning in cgroup_apply_control_enable is harmless in this case since we're not adding new subsys anyway. The warning in cgroup_apply_control_disable indicates an attempt to kill css of recently disabled subsystem repeatedly. The commit prevents these situations by making cgroup_type_write wait for all dying csses to go away before re-applying subtree controls. When at it, the locations of WARN_ON_ONCE calls are moved so that warning is triggered only when we are about to misuse the dying css. Reported-by: syzbot+5493b2a54d31d6aea629@syzkaller.appspotmail.com Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-27irqdomain: Add the missing assignment of domain->fwnode for named fwnodeDexuan Cui
[ Upstream commit 711419e504ebd68c8f03656616829c8ad7829389 ] Recently device pass-through stops working for Linux VM running on Hyper-V. git-bisect shows the regression is caused by the recent commit 467a3bb97432 ("PCI: hv: Allocate a named fwnode ..."), but the root cause is that the commit d59f6617eef0 forgets to set the domain->fwnode for IRQCHIP_FWNODE_NAMED*, and as a result: 1. The domain->fwnode remains to be NULL. 2. irq_find_matching_fwspec() returns NULL since "h->fwnode == fwnode" is false, and pci_set_bus_msi_domain() sets the Hyper-V PCI root bus's msi_domain to NULL. 3. When the device is added onto the root bus, the device's dev->msi_domain is set to NULL in pci_set_msi_domain(). 4. When a device driver tries to enable MSI-X, pci_msi_setup_msi_irqs() calls arch_setup_msi_irqs(), which uses the native MSI chip (i.e. arch/x86/kernel/apic/msi.c: pci_msi_controller) to set up the irqs, but actually pci_msi_setup_msi_irqs() is supposed to call msi_domain_alloc_irqs() with the hbus->irq_domain, which is created in hv_pcie_init_irq_domain() and is associated with the Hyper-V chip hv_msi_irq_chip. Consequently, the irq line is not properly set up, and the device driver can not receive any interrupt. Fixes: d59f6617eef0 ("genirq: Allow fwnode to carry name information only") Fixes: 467a3bb97432 ("PCI: hv: Allocate a named fwnode instead of an address-based one") Reported-by: Lili Deng <v-lide@microsoft.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/PU1P153MB01694D9AF625AC335C600C5FBFBE0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27signal: Allow cifs and drbd to receive their terminating signalsEric W. Biederman
[ Upstream commit 33da8e7c814f77310250bb54a9db36a44c5de784 ] My recent to change to only use force_sig for a synchronous events wound up breaking signal reception cifs and drbd. I had overlooked the fact that by default kthreads start out with all signals set to SIG_IGN. So a change I thought was safe turned out to have made it impossible for those kernel thread to catch their signals. Reverting the work on force_sig is a bad idea because what the code was doing was very much a misuse of force_sig. As the way force_sig ultimately allowed the signal to happen was to change the signal handler to SIG_DFL. Which after the first signal will allow userspace to send signals to these kernel threads. At least for wake_ack_receiver in drbd that does not appear actively wrong. So correct this problem by adding allow_kernel_signal that will allow signals whose siginfo reports they were sent by the kernel through, but will not allow userspace generated signals, and update cifs and drbd to call allow_kernel_signal in an appropriate place so that their thread can receive this signal. Fixing things this way ensures that userspace won't be able to send signals and cause problems, that it is clear which signals the threads are expecting to receive, and it guarantees that nothing else in the system will be affected. This change was partly inspired by similar cifs and drbd patches that added allow_signal. Reported-by: ronnie sahlberg <ronniesahlberg@gmail.com> Reported-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Cc: Steve French <smfrench@gmail.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: David Laight <David.Laight@ACULAB.COM> Fixes: 247bc9470b1e ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes") Fixes: 72abe3bcf091 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig") Fixes: fee109901f39 ("signal/drbd: Use send_sig not force_sig") Fixes: 3cf5d076fb4d ("signal: Remove task parameter from force_sig") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27perf/ioctl: Add check for the sample_period valueRavi Bangoria
[ Upstream commit 913a90bc5a3a06b1f04c337320e9aeee2328dd77 ] perf_event_open() limits the sample_period to 63 bits. See: 0819b2e30ccb ("perf: Limit perf_event_attr::sample_period to 63 bits") Make ioctl() consistent with it. Also on PowerPC, negative sample_period could cause a recursive PMIs leading to a hang (reported when running perf-fuzzer). Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: maddy@linux.vnet.ibm.com Cc: mpe@ellerman.id.au Fixes: 0819b2e30ccb ("perf: Limit perf_event_attr::sample_period to 63 bits") Link: https://lkml.kernel.org/r/20190604042953.914-1-ravi.bangoria@linux.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27kdb: do a sanity check on the cpu in kdb_per_cpu()Dan Carpenter
[ Upstream commit b586627e10f57ee3aa8f0cfab0d6f7dc4ae63760 ] The "whichcpu" comes from argv[3]. The cpu_online() macro looks up the cpu in a bitmap of online cpus, but if the value is too high then it could read beyond the end of the bitmap and possibly Oops. Fixes: 5d5314d6795f ("kdb: core for kgdb back end (1 of 2)") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-23tick/sched: Annotate lockless access to last_jiffies_updateEric Dumazet
commit de95a991bb72e009f47e0c4bbc90fc5f594588d5 upstream. syzbot (KCSAN) reported a data-race in tick_do_update_jiffies64(): BUG: KCSAN: data-race in tick_do_update_jiffies64 / tick_do_update_jiffies64 write to 0xffffffff8603d008 of 8 bytes by interrupt on cpu 1: tick_do_update_jiffies64+0x100/0x250 kernel/time/tick-sched.c:73 tick_sched_do_timer+0xd4/0xe0 kernel/time/tick-sched.c:138 tick_sched_timer+0x43/0xe0 kernel/time/tick-sched.c:1292 __run_hrtimer kernel/time/hrtimer.c:1514 [inline] __hrtimer_run_queues+0x274/0x5f0 kernel/time/hrtimer.c:1576 hrtimer_interrupt+0x22a/0x480 kernel/time/hrtimer.c:1638 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1110 [inline] smp_apic_timer_interrupt+0xdc/0x280 arch/x86/kernel/apic/apic.c:1135 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830 arch_local_irq_restore arch/x86/include/asm/paravirt.h:756 [inline] kcsan_setup_watchpoint+0x1d4/0x460 kernel/kcsan/core.c:436 check_access kernel/kcsan/core.c:466 [inline] __tsan_read1 kernel/kcsan/core.c:593 [inline] __tsan_read1+0xc2/0x100 kernel/kcsan/core.c:593 kallsyms_expand_symbol.constprop.0+0x70/0x160 kernel/kallsyms.c:79 kallsyms_lookup_name+0x7f/0x120 kernel/kallsyms.c:170 insert_report_filterlist kernel/kcsan/debugfs.c:155 [inline] debugfs_write+0x14b/0x2d0 kernel/kcsan/debugfs.c:256 full_proxy_write+0xbd/0x100 fs/debugfs/file.c:225 __vfs_write+0x67/0xc0 fs/read_write.c:494 vfs_write fs/read_write.c:558 [inline] vfs_write+0x18a/0x390 fs/read_write.c:542 ksys_write+0xd5/0x1b0 fs/read_write.c:611 __do_sys_write fs/read_write.c:623 [inline] __se_sys_write fs/read_write.c:620 [inline] __x64_sys_write+0x4c/0x60 fs/read_write.c:620 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 read to 0xffffffff8603d008 of 8 bytes by task 0 on cpu 0: tick_do_update_jiffies64+0x2b/0x250 kernel/time/tick-sched.c:62 tick_nohz_update_jiffies kernel/time/tick-sched.c:505 [inline] tick_nohz_irq_enter kernel/time/tick-sched.c:1257 [inline] tick_irq_enter+0x139/0x1c0 kernel/time/tick-sched.c:1274 irq_enter+0x4f/0x60 kernel/softirq.c:354 entering_irq arch/x86/include/asm/apic.h:517 [inline] entering_ack_irq arch/x86/include/asm/apic.h:523 [inline] smp_apic_timer_interrupt+0x55/0x280 arch/x86/kernel/apic/apic.c:1133 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830 native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:60 arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571 default_idle_call+0x1e/0x40 kernel/sched/idle.c:94 cpuidle_idle_call kernel/sched/idle.c:154 [inline] do_idle+0x1af/0x280 kernel/sched/idle.c:263 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355 rest_init+0xec/0xf6 init/main.c:452 arch_call_rest_init+0x17/0x37 start_kernel+0x838/0x85e init/main.c:786 x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490 x86_64_start_kernel+0x72/0x76 arch/x86/kernel/head64.c:471 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc7+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Use READ_ONCE() and WRITE_ONCE() to annotate this expected race. Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20191205045619.204946-1-edumazet@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()Christian Brauner
commit 6b3ad6649a4c75504edeba242d3fd36b3096a57f upstream. Commit 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat") introduced the ability to opt out of audit messages for accesses to various proc files since they are not violations of policy. While doing so it somehow switched the check from ns_capable() to has_ns_capability{_noaudit}(). That means it switched from checking the subjective credentials of the task to using the objective credentials. This is wrong since. ptrace_has_cap() is currently only used in ptrace_may_access() And is used to check whether the calling task (subject) has the CAP_SYS_PTRACE capability in the provided user namespace to operate on the target task (object). According to the cred.h comments this would mean the subjective credentials of the calling task need to be used. This switches ptrace_has_cap() to use security_capable(). Because we only call ptrace_has_cap() in ptrace_may_access() and in there we already have a stable reference to the calling task's creds under rcu_read_lock() there's no need to go through another series of dereferences and rcu locking done in ns_capable{_noaudit}(). As one example where this might be particularly problematic, Jann pointed out that in combination with the upcoming IORING_OP_OPENAT feature, this bug might allow unprivileged users to bypass the capability checks while asynchronously opening files like /proc/*/mem, because the capability checks for this would be performed against kernel credentials. To illustrate on the former point about this being exploitable: When io_uring creates a new context it records the subjective credentials of the caller. Later on, when it starts to do work it creates a kernel thread and registers a callback. The callback runs with kernel creds for ktask->real_cred and ktask->cred. To prevent this from becoming a full-blown 0-day io_uring will call override_cred() and override ktask->cred with the subjective credentials of the creator of the io_uring instance. With ptrace_has_cap() currently looking at ktask->real_cred this override will be ineffective and the caller will be able to open arbitray proc files as mentioned above. Luckily, this is currently not exploitable but will turn into a 0-day once IORING_OP_OPENAT{2} land in v5.6. Fix it now! Cc: Oleg Nesterov <oleg@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: stable@vger.kernel.org Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Jann Horn <jannh@google.com> Fixes: 69f594a38967 ("ptrace: do not audit capability check when outputing /proc/pid/stat") Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-14tracing: Have stack tracer compile when MCOUNT_INSN_SIZE is not definedSteven Rostedt (VMware)
commit b8299d362d0837ae39e87e9019ebe6b736e0f035 upstream. On some archs with some configurations, MCOUNT_INSN_SIZE is not defined, and this makes the stack tracer fail to compile. Just define it to zero in this case. Link: https://lore.kernel.org/r/202001020219.zvE3vsty%lkp@intel.com Cc: stable@vger.kernel.org Fixes: 4df297129f622 ("tracing: Remove most or all of stack tracer stack size from stack_max_size") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-14kernel/trace: Fix do not unregister tracepoints when register ↵Kaitao Cheng
sched_migrate_task fail commit 50f9ad607ea891a9308e67b81f774c71736d1098 upstream. In the function, if register_trace_sched_migrate_task() returns error, sched_switch/sched_wakeup_new/sched_wakeup won't unregister. That is why fail_deprobe_sched_switch was added. Link: http://lkml.kernel.org/r/20191231133530.2794-1-pilgrimtao@gmail.com Cc: stable@vger.kernel.org Fixes: 478142c39c8c2 ("tracing: do not grab lock in wakeup latency function tracing") Signed-off-by: Kaitao Cheng <pilgrimtao@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-12bpf: Fix passing modified ctx to ld/abs/ind instructionDaniel Borkmann
commit 6d4f151acf9a4f6fab09b615f246c717ddedcf0c upstream. Anatoly has been fuzzing with kBdysch harness and reported a KASAN slab oob in one of the outcomes: [...] [ 77.359642] BUG: KASAN: slab-out-of-bounds in bpf_skb_load_helper_8_no_cache+0x71/0x130 [ 77.360463] Read of size 4 at addr ffff8880679bac68 by task bpf/406 [ 77.361119] [ 77.361289] CPU: 2 PID: 406 Comm: bpf Not tainted 5.5.0-rc2-xfstests-00157-g2187f215eba #1 [ 77.362134] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 77.362984] Call Trace: [ 77.363249] dump_stack+0x97/0xe0 [ 77.363603] print_address_description.constprop.0+0x1d/0x220 [ 77.364251] ? bpf_skb_load_helper_8_no_cache+0x71/0x130 [ 77.365030] ? bpf_skb_load_helper_8_no_cache+0x71/0x130 [ 77.365860] __kasan_report.cold+0x37/0x7b [ 77.366365] ? bpf_skb_load_helper_8_no_cache+0x71/0x130 [ 77.366940] kasan_report+0xe/0x20 [ 77.367295] bpf_skb_load_helper_8_no_cache+0x71/0x130 [ 77.367821] ? bpf_skb_load_helper_8+0xf0/0xf0 [ 77.368278] ? mark_lock+0xa3/0x9b0 [ 77.368641] ? kvm_sched_clock_read+0x14/0x30 [ 77.369096] ? sched_clock+0x5/0x10 [ 77.369460] ? sched_clock_cpu+0x18/0x110 [ 77.369876] ? bpf_skb_load_helper_8+0xf0/0xf0 [ 77.370330] ___bpf_prog_run+0x16c0/0x28f0 [ 77.370755] __bpf_prog_run32+0x83/0xc0 [ 77.371153] ? __bpf_prog_run64+0xc0/0xc0 [ 77.371568] ? match_held_lock+0x1b/0x230 [ 77.371984] ? rcu_read_lock_held+0xa1/0xb0 [ 77.372416] ? rcu_is_watching+0x34/0x50 [ 77.372826] sk_filter_trim_cap+0x17c/0x4d0 [ 77.373259] ? sock_kzfree_s+0x40/0x40 [ 77.373648] ? __get_filter+0x150/0x150 [ 77.374059] ? skb_copy_datagram_from_iter+0x80/0x280 [ 77.374581] ? do_raw_spin_unlock+0xa5/0x140 [ 77.375025] unix_dgram_sendmsg+0x33a/0xa70 [ 77.375459] ? do_raw_spin_lock+0x1d0/0x1d0 [ 77.375893] ? unix_peer_get+0xa0/0xa0 [ 77.376287] ? __fget_light+0xa4/0xf0 [ 77.376670] __sys_sendto+0x265/0x280 [ 77.377056] ? __ia32_sys_getpeername+0x50/0x50 [ 77.377523] ? lock_downgrade+0x350/0x350 [ 77.377940] ? __sys_setsockopt+0x2a6/0x2c0 [ 77.378374] ? sock_read_iter+0x240/0x240 [ 77.378789] ? __sys_socketpair+0x22a/0x300 [ 77.379221] ? __ia32_sys_socket+0x50/0x50 [ 77.379649] ? mark_held_locks+0x1d/0x90 [ 77.380059] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 77.380536] __x64_sys_sendto+0x74/0x90 [ 77.380938] do_syscall_64+0x68/0x2a0 [ 77.381324] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 77.381878] RIP: 0033:0x44c070 [...] After further debugging, turns out while in case of other helper functions we disallow passing modified ctx, the special case of ld/abs/ind instruction which has similar semantics (except r6 being the ctx argument) is missing such check. Modified ctx is impossible here as bpf_skb_load_helper_8_no_cache() and others are expecting skb fields in original position, hence, add check_ctx_reg() to reject any modified ctx. Issue was first introduced back in f1174f77b50c ("bpf/verifier: rework value tracking"). Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200106215157.3553-1-daniel@iogearbox.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-12bpf: reject passing modified ctx to helper functionsDaniel Borkmann
commit 58990d1ff3f7896ee341030e9a7c2e4002570683 upstream. As commit 28e33f9d78ee ("bpf: disallow arithmetic operations on context pointer") already describes, f1174f77b50c ("bpf/verifier: rework value tracking") removed the specific white-listed cases we had previously where we would allow for pointer arithmetic in order to further generalize it, and allow e.g. context access via modified registers. While the dereferencing of modified context pointers had been forbidden through 28e33f9d78ee, syzkaller did recently manage to trigger several KASAN splats for slab out of bounds access and use after frees by simply passing a modified context pointer to a helper function which would then do the bad access since verifier allowed it in adjust_ptr_min_max_vals(). Rejecting arithmetic on ctx pointer in adjust_ptr_min_max_vals() generally could break existing programs as there's a valid use case in tracing in combination with passing the ctx to helpers as bpf_probe_read(), where the register then becomes unknown at verification time due to adding a non-constant offset to it. An access sequence may look like the following: offset = args->filename; /* field __data_loc filename */ bpf_probe_read(&dst, len, (char *)args + offset); // args is ctx There are two options: i) we could special case the ctx and as soon as we add a constant or bounded offset to it (hence ctx type wouldn't change) we could turn the ctx into an unknown scalar, or ii) we generalize the sanity test for ctx member access into a small helper and assert it on the ctx register that was passed as a function argument. Fwiw, latter is more obvious and less complex at the same time, and one case that may potentially be legitimate in future for ctx member access at least would be for ctx to carry a const offset. Therefore, fix follows approach from ii) and adds test cases to BPF kselftests. Fixes: f1174f77b50c ("bpf/verifier: rework value tracking") Reported-by: syzbot+3d0b2441dbb71751615e@syzkaller.appspotmail.com Reported-by: syzbot+c8504affd4fdd0c1b626@syzkaller.appspotmail.com Reported-by: syzbot+e5190cb881d8660fb1a3@syzkaller.appspotmail.com Reported-by: syzbot+efae31b384d5badbd620@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-12locking/spinlock/debug: Fix various data racesMarco Elver
[ Upstream commit 1a365e822372ba24c9da0822bc583894f6f3d821 ] This fixes various data races in spinlock_debug. By testing with KCSAN, it is observable that the console gets spammed with data races reports, suggesting these are extremely frequent. Example data race report: read to 0xffff8ab24f403c48 of 4 bytes by task 221 on cpu 2: debug_spin_lock_before kernel/locking/spinlock_debug.c:85 [inline] do_raw_spin_lock+0x9b/0x210 kernel/locking/spinlock_debug.c:112 __raw_spin_lock include/linux/spinlock_api_smp.h:143 [inline] _raw_spin_lock+0x39/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:338 [inline] get_partial_node.isra.0.part.0+0x32/0x2f0 mm/slub.c:1873 get_partial_node mm/slub.c:1870 [inline] <snip> write to 0xffff8ab24f403c48 of 4 bytes by task 167 on cpu 3: debug_spin_unlock kernel/locking/spinlock_debug.c:103 [inline] do_raw_spin_unlock+0xc9/0x1a0 kernel/locking/spinlock_debug.c:138 __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:159 [inline] _raw_spin_unlock_irqrestore+0x2d/0x50 kernel/locking/spinlock.c:191 spin_unlock_irqrestore include/linux/spinlock.h:393 [inline] free_debug_processing+0x1b3/0x210 mm/slub.c:1214 __slab_free+0x292/0x400 mm/slub.c:2864 <snip> As a side-effect, with KCSAN, this eventually locks up the console, most likely due to deadlock, e.g. .. -> printk lock -> spinlock_debug -> KCSAN detects data race -> kcsan_print_report() -> printk lock -> deadlock. This fix will 1) avoid the data races, and 2) allow using lock debugging together with KCSAN. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Marco Elver <elver@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: https://lkml.kernel.org/r/20191120155715.28089-1-elver@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09ftrace: Avoid potential division by zero in function profilerWen Yang
commit e31f7939c1c27faa5d0e3f14519eaf7c89e8a69d upstream. The ftrace_profile->counter is unsigned long and do_div truncates it to 32 bits, which means it can test non-zero and be truncated to zero for division. Fix this issue by using div64_ul() instead. Link: http://lkml.kernel.org/r/20200103030248.14516-1-wenyang@linux.alibaba.com Cc: stable@vger.kernel.org Fixes: e330b3bcd8319 ("tracing: Show sample std dev in function profiling") Fixes: 34886c8bc590f ("tracing: add average time in function to function profiler") Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09exit: panic before exit_mm() on global init exitchenqiwu
commit 43cf75d96409a20ef06b756877a2e72b10a026fc upstream. Currently, when global init and all threads in its thread-group have exited we panic via: do_exit() -> exit_notify() -> forget_original_parent() -> find_child_reaper() This makes it hard to extract a useable coredump for global init from a kernel crashdump because by the time we panic exit_mm() will have already released global init's mm. This patch moves the panic futher up before exit_mm() is called. As was the case previously, we only panic when global init and all its threads in the thread-group have exited. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> [christian.brauner@ubuntu.com: fix typo, rewrite commit message] Link: https://lore.kernel.org/r/1576736993-10121-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>