summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-02-23futex: Change locking rulesPeter Zijlstra
Currently futex-pi relies on hb->lock to serialize everything. But hb->lock creates another set of problems, especially priority inversions on RT where hb->lock becomes a rt_mutex itself. The rt_mutex::wait_lock is the most obvious protection for keeping the futex user space value and the kernel internal pi_state in sync. Rework and document the locking so rt_mutex::wait_lock is held accross all operations which modify the user space value and the pi state. This allows to invoke rt_mutex_unlock() (including deboost) without holding hb->lock as a next step. Nothing yet relies on the new locking rules. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [Lee: Back-ported in support of a previous futex back-port attempt] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23futex: Ensure the correct return value from futex_lock_pi()Thomas Gleixner
commit 12bb3f7f1b03d5913b3f9d4236a488aa7774dfe9 upstream In case that futex_lock_pi() was aborted by a signal or a timeout and the task returned without acquiring the rtmutex, but is the designated owner of the futex due to a concurrent futex_unlock_pi() fixup_owner() is invoked to establish consistent state. In that case it invokes fixup_pi_state_owner() which in turn tries to acquire the rtmutex again. If that succeeds then it does not propagate this success to fixup_owner() and futex_lock_pi() returns -EINTR or -ETIMEOUT despite having the futex locked. Return success from fixup_pi_state_owner() in all cases where the current task owns the rtmutex and therefore the futex and propagate it correctly through fixup_owner(). Fixup the other callsite which does not expect a positive return value. Fixes: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [Lee: Back-ported in support of a previous futex attempt] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-23fgraph: Initialize tracing_graph_pause at task creationSteven Rostedt (VMware)
commit 7e0a9220467dbcfdc5bc62825724f3e52e50ab31 upstream. On some archs, the idle task can call into cpu_suspend(). The cpu_suspend() will disable or pause function graph tracing, as there's some paths in bringing down the CPU that can have issues with its return address being modified. The task_struct structure has a "tracing_graph_pause" atomic counter, that when set to something other than zero, the function graph tracer will not modify the return address. The problem is that the tracing_graph_pause counter is initialized when the function graph tracer is enabled. This can corrupt the counter for the idle task if it is suspended in these architectures. CPU 1 CPU 2 ----- ----- do_idle() cpu_suspend() pause_graph_tracing() task_struct->tracing_graph_pause++ (0 -> 1) start_graph_tracing() for_each_online_cpu(cpu) { ftrace_graph_init_idle_task(cpu) task-struct->tracing_graph_pause = 0 (1 -> 0) unpause_graph_tracing() task_struct->tracing_graph_pause-- (0 -> -1) The above should have gone from 1 to zero, and enabled function graph tracing again. But instead, it is set to -1, which keeps it disabled. There's no reason that the field tracing_graph_pause on the task_struct can not be initialized at boot up. Cc: stable@vger.kernel.org Fixes: 380c4b1411ccd ("tracing/function-graph-tracer: append the tracing_graph_flag") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211339 Reported-by: pierre.gondois@arm.com Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10kretprobe: Avoid re-registration of the same kretprobe earlierWang ShaoBo
commit 0188b87899ffc4a1d36a0badbe77d56c92fd91dc upstream. Our system encountered a re-init error when re-registering same kretprobe, where the kretprobe_instance in rp->free_instances is illegally accessed after re-init. Implementation to avoid re-registration has been introduced for kprobe before, but lags for register_kretprobe(). We must check if kprobe has been re-registered before re-initializing kretprobe, otherwise it will destroy the data struct of kretprobe registered, which can lead to memory leak, system crash, also some unexpected behaviors. We use check_kprobe_rereg() to check if kprobe has been re-registered before running register_kretprobe()'s body, for giving a warning message and terminate registration process. Link: https://lkml.kernel.org/r/20210128124427.2031088-1-bobo.shaobowang@huawei.com Cc: stable@vger.kernel.org Fixes: 1f0ab40976460 ("kprobes: Prevent re-registration of the same kprobe") [ The above commit should have been done for kretprobes too ] Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@linux.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com> Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10elfcore: fix building with clangArnd Bergmann
commit 6e7b64b9dd6d96537d816ea07ec26b7dedd397b9 upstream. kernel/elfcore.c only contains weak symbols, which triggers a bug with clang in combination with recordmcount: Cannot find symbol for section 2: .text. kernel/elfcore.o: failed Move the empty stubs into linux/elfcore.h as inline functions. As only two architectures use these, just use the architecture specific Kconfig symbols to key off the declaration. Link: https://lkml.kernel.org/r/20201204165742.3815221-2-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Barret Rhoden <brho@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10futex: Handle faults correctly for PI futexesThomas Gleixner
fixup_pi_state_owner() tries to ensure that the state of the rtmutex, pi_state and the user space value related to the PI futex are consistent before returning to user space. In case that the user space value update faults and the fault cannot be resolved by faulting the page in via fault_in_user_writeable() the function returns with -EFAULT and leaves the rtmutex and pi_state owner state inconsistent. A subsequent futex_unlock_pi() operates on the inconsistent pi_state and releases the rtmutex despite not owning it which can corrupt the RB tree of the rtmutex and cause a subsequent kernel stack use after free. It was suggested to loop forever in fixup_pi_state_owner() if the fault cannot be resolved, but that results in runaway tasks which is especially undesired when the problem happens due to a programming error and not due to malice. As the user space value cannot be fixed up, the proper solution is to make the rtmutex and the pi_state consistent so both have the same owner. This leaves the user space value out of sync. Any subsequent operation on the futex will fail because the 10th rule of PI futexes (pi_state owner and user space value are consistent) has been violated. As a consequence this removes the inept attempts of 'fixing' the situation in case that the current task owns the rtmutex when returning with an unresolvable fault by unlocking the rtmutex which left pi_state::owner and rtmutex::owner out of sync in a different and only slightly less dangerous way. Fixes: 1b7558e457ed ("futexes: fix fault handling in futex_lock_pi") Reported-by: gzobqq@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10futex: Simplify fixup_pi_state_owner()Thomas Gleixner
[ Upstream commit f2dac39d93987f7de1e20b3988c8685523247ae2 ] Too many gotos already and an upcoming fix would make it even more unreadable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10futex: Use pi_state_update_owner() in put_pi_state()Thomas Gleixner
[ Upstream commit 6ccc84f917d33312eb2846bd7b567639f585ad6d ] No point in open coding it. This way it gains the extra sanity checks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10rtmutex: Remove unused argument from rt_mutex_proxy_unlock()Thomas Gleixner
[ Upstream commit 2156ac1934166d6deb6cd0f6ffc4c1076ec63697 ] Nothing uses the argument. Remove it as preparation to use pi_state_update_owner(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10futex: Provide and use pi_state_update_owner()Thomas Gleixner
[ Upstream commit c5cade200ab9a2a3be9e7f32a752c8d86b502ec7 ] Updating pi_state::owner is done at several places with the same code. Provide a function for it and use that at the obvious places. This is also a preparation for a bug fix to avoid yet another copy of the same code or alternatively introducing a completely unpenetratable mess of gotos. Originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10futex: Replace pointless printk in fixup_owner()Thomas Gleixner
[ Upstream commit 04b79c55201f02ffd675e1231d731365e335c307 ] If that unexpected case of inconsistent arguments ever happens then the futex state is left completely inconsistent and the printk is not really helpful. Replace it with a warning and make the state consistent. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10futex: Avoid violating the 10th rule of futexPeter Zijlstra
commit c1e2f0eaf015fb7076d51a339011f2383e6dd389 upstream. Julia reported futex state corruption in the following scenario: waiter waker stealer (prio > waiter) futex(WAIT_REQUEUE_PI, uaddr, uaddr2, timeout=[N ms]) futex_wait_requeue_pi() futex_wait_queue_me() freezable_schedule() <scheduled out> futex(LOCK_PI, uaddr2) futex(CMP_REQUEUE_PI, uaddr, uaddr2, 1, 0) /* requeues waiter to uaddr2 */ futex(UNLOCK_PI, uaddr2) wake_futex_pi() cmp_futex_value_locked(uaddr2, waiter) wake_up_q() <woken by waker> <hrtimer_wakeup() fires, clears sleeper->task> futex(LOCK_PI, uaddr2) __rt_mutex_start_proxy_lock() try_to_take_rt_mutex() /* steals lock */ rt_mutex_set_owner(lock, stealer) <preempted> <scheduled in> rt_mutex_wait_proxy_lock() __rt_mutex_slowlock() try_to_take_rt_mutex() /* fails, lock held by stealer */ if (timeout && !timeout->task) return -ETIMEDOUT; fixup_owner() /* lock wasn't acquired, so, fixup_pi_state_owner skipped */ return -ETIMEDOUT; /* At this point, we've returned -ETIMEDOUT to userspace, but the * futex word shows waiter to be the owner, and the pi_mutex has * stealer as the owner */ futex_lock(LOCK_PI, uaddr2) -> bails with EDEADLK, futex word says we're owner. And suggested that what commit: 73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state") removes from fixup_owner() looks to be just what is needed. And indeed it is -- I completely missed that requeue_pi could also result in this case. So we need to restore that, except that subsequent patches, like commit: 16ffa12d7425 ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock") changed all the locking rules. Even without that, the sequence: - if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) { - locked = 1; - goto out; - } - raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock); - owner = rt_mutex_owner(&q->pi_state->pi_mutex); - if (!owner) - owner = rt_mutex_next_owner(&q->pi_state->pi_mutex); - raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock); - ret = fixup_pi_state_owner(uaddr, q, owner); already suggests there were races; otherwise we'd never have to look at next_owner. So instead of doing 3 consecutive wait_lock sections with who knows what races, we do it all in a single section. Additionally, the usage of pi_state->owner in fixup_owner() was only safe because only the rt_mutex owner would modify it, which this additional case wrecks. Luckily the values can only change away and not to the value we're testing, this means we can do a speculative test and double check once we have the wait_lock. Fixes: 73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state") Reported-by: Julia Cartwright <julia@ni.com> Reported-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Julia Cartwright <julia@ni.com> Tested-by: Gratian Crisan <gratian.crisan@ni.com> Cc: Darren Hart <dvhart@infradead.org> Link: https://lkml.kernel.org/r/20171208124939.7livp7no2ov65rrc@hirez.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [Lee: Back-ported to solve a dependency] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10futex: Rework inconsistent rt_mutex/futex_q statePeter Zijlstra
[Upstream commit 73d786bd043ebc855f349c81ea805f6b11cbf2aa ] There is a weird state in the futex_unlock_pi() path when it interleaves with a concurrent futex_lock_pi() at the point where it drops hb->lock. In this case, it can happen that the rt_mutex wait_list and the futex_q disagree on pending waiters, in particular rt_mutex will find no pending waiters where futex_q thinks there are. In this case the rt_mutex unlock code cannot assign an owner. The futex side fixup code has to cleanup the inconsistencies with quite a bunch of interesting corner cases. Simplify all this by changing wake_futex_pi() to return -EAGAIN when this situation occurs. This then gives the futex_lock_pi() code the opportunity to continue and the retried futex_unlock_pi() will now observe a coherent state. The only problem is that this breaks RT timeliness guarantees. That is, consider the following scenario: T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1) CPU0 T1 lock_pi() queue_me() <- Waiter is visible preemption T2 unlock_pi() loops with -EAGAIN forever Which is undesirable for PI primitives. Future patches will rectify this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.850383690@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [Lee: Back-ported to solve a dependency] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10futex: Remove rt_mutex_deadlock_account_*()Peter Zijlstra
These are unused and clutter up the code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.652692478@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [Lee: Back-ported to solve a dependency] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10futex,rt_mutex: Provide futex specific rt_mutex APIPeter Zijlstra
[ Upstream commit 5293c2efda37775346885c7e924d4ef7018ea60b ] Part of what makes futex_unlock_pi() intricate is that rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop rt_mutex::wait_lock. This means it cannot rely on the atomicy of wait_lock, which would be preferred in order to not rely on hb->lock so much. The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can race with the rt_mutex fastpath, however futexes have their own fast path. Since futexes already have a bunch of separate rt_mutex accessors, complete that set and implement a rt_mutex variant without fastpath for them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [Lee: Back-ported to solve a dependency] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03futex: Prevent exit livelockThomas Gleixner
commit 3ef240eaff36b8119ac9e2ea17cbf41179c930ba upstream. Oleg provided the following test case: int main(void) { struct sched_param sp = {}; sp.sched_priority = 2; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); int lock = vfork(); if (!lock) { sp.sched_priority = 1; assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0); _exit(0); } syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0); return 0; } This creates an unkillable RT process spinning in futex_lock_pi() on a UP machine or if the process is affine to a single CPU. The reason is: parent child set FIFO prio 2 vfork() -> set FIFO prio 1 implies wait_for_child() sched_setscheduler(...) exit() do_exit() .... mm_release() tsk->futex_state = FUTEX_STATE_EXITING; exit_futex(); (NOOP in this case) complete() --> wakes parent sys_futex() loop infinite because tsk->futex_state == FUTEX_STATE_EXITING The same problem can happen just by regular preemption as well: task holds futex ... do_exit() tsk->futex_state = FUTEX_STATE_EXITING; --> preemption (unrelated wakeup of some other higher prio task, e.g. timer) switch_to(other_task) return to user sys_futex() loop infinite as above Just for the fun of it the futex exit cleanup could trigger the wakeup itself before the task sets its futex state to DEAD. To cure this, the handling of the exiting owner is changed so: - A refcount is held on the task - The task pointer is stored in a caller visible location - The caller drops all locks (hash bucket, mmap_sem) and blocks on task::futex_exit_mutex. When the mutex is acquired then the exiting task has completed the cleanup and the state is consistent and can be reevaluated. This is not a pretty solution, but there is no choice other than returning an error code to user space, which would break the state consistency guarantee and open another can of problems including regressions. For stable backports the preparatory commits ac31c7ff8624 .. ba31c1a48538 are required as well, but for anything older than 5.3.y the backports are going to be provided when this hits mainline as the other dependencies for those kernels are definitely not stable material. Fixes: 778e9a9c3e71 ("pi-futex: fix exit races and locking problems") Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Stable Team <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03futex: Provide distinct return value when owner is exitingThomas Gleixner
commit ac31c7ff8624409ba3c4901df9237a616c187a5d upstream. attach_to_pi_owner() returns -EAGAIN for various cases: - Owner task is exiting - Futex value has changed The caller drops the held locks (hash bucket, mmap_sem) and retries the operation. In case of the owner task exiting this can result in a live lock. As a preparatory step for seperating those cases, provide a distinct return value (EBUSY) for the owner exiting case. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.935606117@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03futex: Add mutex around futex exitThomas Gleixner
commit 3f186d974826847a07bc7964d79ec4eded475ad9 upstream. The mutex will be used in subsequent changes to replace the busy looping of a waiter when the futex owner is currently executing the exit cleanup to prevent a potential live lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.845798895@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03futex: Provide state handling for exec() as wellThomas Gleixner
commit af8cbda2cfcaa5515d61ec500498d46e9a8247e2 upstream. exec() attempts to handle potentially held futexes gracefully by running the futex exit handling code like exit() does. The current implementation has no protection against concurrent incoming waiters. The reason is that the futex state cannot be set to FUTEX_STATE_DEAD after the cleanup because the task struct is still active and just about to execute the new binary. While its arguably buggy when a task holds a futex over exec(), for consistency sake the state handling can at least cover the actual futex exit cleanup section. This provides state consistency protection accross the cleanup. As the futex state of the task becomes FUTEX_STATE_OK after the cleanup has been finished, this cannot prevent subsequent attempts to attach to the task in case that the cleanup was not successfull in mopping up all leftovers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.753355618@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03futex: Sanitize exit state handlingThomas Gleixner
commit 4a8e991b91aca9e20705d434677ac013974e0e30 upstream. Instead of having a smp_mb() and an empty lock/unlock of task::pi_lock move the state setting into to the lock section. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.645603214@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03futex: Mark the begin of futex exit explicitlyThomas Gleixner
commit 18f694385c4fd77a09851fd301236746ca83f3cb upstream. Instead of relying on PF_EXITING use an explicit state for the futex exit and set it in the futex exit function. This moves the smp barrier and the lock/unlock serialization into the futex code. As with the DEAD state this is restricted to the exit path as exec continues to use the same task struct. This allows to simplify that logic in a next step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03futex: Set task::futex_state to DEAD right after handling futex exitThomas Gleixner
commit f24f22435dcc11389acc87e5586239c1819d217c upstream. Setting task::futex_state in do_exit() is rather arbitrarily placed for no reason. Move it into the futex code. Note, this is only done for the exit cleanup as the exec cleanup cannot set the state to FUTEX_STATE_DEAD because the task struct is still in active use. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03futex: Split futex_mm_release() for exit/execThomas Gleixner
commit 150d71584b12809144b8145b817e83b81158ae5f upstream. To allow separate handling of the futex exit state in the futex exit code for exit and exec, split futex_mm_release() into two functions and invoke them from the corresponding exit/exec_mm_release() callsites. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03exit/exec: Seperate mm_release()Thomas Gleixner
commit 4610ba7ad877fafc0a25a30c6c82015304120426 upstream. mm_release() contains the futex exit handling. mm_release() is called from do_exit()->exit_mm() and from exec()->exec_mm(). In the exit_mm() case PF_EXITING and the futex state is updated. In the exec_mm() case these states are not touched. As the futex exit code needs further protections against exit races, this needs to be split into two functions. Preparatory only, no functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03futex: Replace PF_EXITPIDONE with a stateThomas Gleixner
commit 3d4775df0a89240f671861c6ab6e8d59af8e9e41 upstream. The futex exit handling relies on PF_ flags. That's suboptimal as it requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in the middle of do_exit() to enforce the observability of PF_EXITING in the futex code. Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic over to the new state. The PF_EXITING dependency will be cleaned up in a later step. This prepares for handling various futex exit issues later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03futex: Move futex exit handling into futex codeThomas Gleixner
commit ba31c1a48538992316cc71ce94fa9cd3e7b427c0 upstream. The futex exit handling is #ifdeffed into mm_release() which is not pretty to begin with. But upcoming changes to address futex exit races need to add more functionality to this exit code. Split it out into a function, move it into futex code and make the various futex exit functions static. Preparatory only and no functional change. Folded build fix from Borislav. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03y2038: futex: Move compat implementation into futex.cArnd Bergmann
commit 04e7712f4460585e5eed5b853fd8b82a9943958f upstream. We are going to share the compat_sys_futex() handler between 64-bit architectures and 32-bit architectures that need to deal with both 32-bit and 64-bit time_t, and this is easier if both entry points are in the same file. In fact, most other system call handlers do the same thing these days, so let's follow the trend here and merge all of futex_compat.c into futex.c. In the process, a few minor changes have to be done to make sure everything still makes sense: handle_futex_death() and futex_cmpxchg_enabled() become local symbol, and the compat version of the fetch_robust_entry() function gets renamed to compat_fetch_robust_entry() to avoid a symbol clash. This is intended as a purely cosmetic patch, no behavior should change. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [Lee: Back-ported to satisfy a build dependency] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30tracing: Fix race in trace_open and buffer resize callGaurav Kohli
commit bbeb97464eefc65f506084fd9f18f21653e01137 upstream. Below race can come, if trace_open and resize of cpu buffer is running parallely on different cpus CPUX CPUY ring_buffer_resize atomic_read(&buffer->resize_disabled) tracing_open tracing_reset_online_cpus ring_buffer_reset_cpu rb_reset_cpu rb_update_pages remove/insert pages resetting pointer This race can cause data abort or some times infinte loop in rb_remove_pages and rb_insert_pages while checking pages for sanity. Take buffer lock to fix this. Link: https://lkml.kernel.org/r/1601976833-24377-1-git-send-email-gkohli@codeaurora.org Cc: stable@vger.kernel.org Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Reported-by: Denis Efremov <efremov@linux.com> Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30bpf: Fix buggy rsh min/max bounds trackingDaniel Borkmann
[ no upstream commit ] Fix incorrect bounds tracking for RSH opcode. Commit f23cc643f9ba ("bpf: fix range arithmetic for bpf map access") had a wrong assumption about min/max bounds. The new dst_reg->min_value needs to be derived by right shifting the max_val bounds, not min_val, and likewise new dst_reg->max_value needs to be derived by right shifting the min_val bounds, not max_val. Later stable kernels than 4.9 are not affected since bounds tracking was overall reworked and they already track this similarly as in the fix. Fixes: f23cc643f9ba ("bpf: fix range arithmetic for bpf map access") Reported-by: Ryota Shiga (Flatt Security) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12workqueue: Kick a worker based on the actual activation of delayed worksYunfeng Ye
[ Upstream commit 01341fbd0d8d4e717fc1231cdffe00343088ce0b ] In realtime scenario, We do not want to have interference on the isolated cpu cores. but when invoking alloc_workqueue() for percpu wq on the housekeeping cpu, it kick a kworker on the isolated cpu. alloc_workqueue pwq_adjust_max_active wake_up_worker The comment in pwq_adjust_max_active() said: "Need to kick a worker after thawed or an unbound wq's max_active is bumped" So it is unnecessary to kick a kworker for percpu's wq when invoking alloc_workqueue(). this patch only kick a worker based on the actual activation of delayed works. Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-09module: delay kobject uevent until after module init callJessica Yu
[ Upstream commit 38dc717e97153e46375ee21797aa54777e5498f3 ] Apparently there has been a longstanding race between udev/systemd and the module loader. Currently, the module loader sends a uevent right after sysfs initialization, but before the module calls its init function. However, some udev rules expect that the module has initialized already upon receiving the uevent. This race has been triggered recently (see link in references) in some systemd mount unit files. For instance, the configfs module creates the /sys/kernel/config mount point in its init function, however the module loader issues the uevent before this happens. sys-kernel-config.mount expects to be able to mount /sys/kernel/config upon receipt of the module loading uevent, but if the configfs module has not called its init function yet, then this directory will not exist and the mount unit fails. A similar situation exists for sys-fs-fuse-connections.mount, as the fuse sysfs mount point is created during the fuse module's init function. If udev is faster than module initialization then the mount unit would fail in a similar fashion. To fix this race, delay the module KOBJ_ADD uevent until after the module has finished calling its init routine. References: https://github.com/systemd/systemd/issues/17586 Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-By: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-09module: set MODULE_STATE_GOING state when a module fails to loadMiroslav Benes
[ Upstream commit 5e8ed280dab9eeabc1ba0b2db5dbe9fe6debb6b5 ] If a module fails to load due to an error in prepare_coming_module(), the following error handling in load_module() runs with MODULE_STATE_COMING in module's state. Fix it by correctly setting MODULE_STATE_GOING under "bug_cleanup" label. Signed-off-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-29kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handlingNicholas Piggin
[ Upstream commit 8ff00399b153440c1c83e20c43020385b416415b ] powerpc/64s keeps a counter in the mm which counts bits set in mm_cpumask as well as other things. This means it can't use generic code to clear bits out of the mask and doesn't adjust the arch specific counter. Add an arch override that allows powerpc/64s to use clear_tasks_mm_cpumask(). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201126102530.691335-4-npiggin@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-11tracing: Fix userstacktrace option for instancesSteven Rostedt (VMware)
commit bcee5278958802b40ee8b26679155a6d9231783e upstream. When the instances were able to use their own options, the userstacktrace option was left hardcoded for the top level. This made the instance userstacktrace option bascially into a nop, and will confuse users that set it, but nothing happens (I was confused when it happened to me!) Cc: stable@vger.kernel.org Fixes: 16270145ce6b ("tracing: Add trace options for core options to instances") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-11ftrace: Fix updating FTRACE_FL_TRAMPNaveen N. Rao
commit 4c75b0ff4e4bf7a45b5aef9639799719c28d0073 upstream. On powerpc, kprobe-direct.tc triggered FTRACE_WARN_ON() in ftrace_get_addr_new() followed by the below message: Bad trampoline accounting at: 000000004222522f (wake_up_process+0xc/0x20) (f0000001) The set of steps leading to this involved: - modprobe ftrace-direct-too - enable_probe - modprobe ftrace-direct - rmmod ftrace-direct <-- trigger The problem turned out to be that we were not updating flags in the ftrace record properly. From the above message about the trampoline accounting being bad, it can be seen that the ftrace record still has FTRACE_FL_TRAMP set though ftrace-direct module is going away. This happens because we are checking if any ftrace_ops has the FTRACE_FL_TRAMP flag set _before_ updating the filter hash. The fix for this is to look for any _other_ ftrace_ops that also needs FTRACE_FL_TRAMP. Link: https://lkml.kernel.org/r/56c113aa9c3e10c19144a36d9684c7882bf09af5.1606412433.git.naveen.n.rao@linux.vnet.ibm.com Cc: stable@vger.kernel.org Fixes: a124692b698b0 ("ftrace: Enable trampoline when rec count returns back to one") Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18reboot: fix overflow parsing reboot cpu numberMatteo Croce
commit df5b0ab3e08a156701b537809914b339b0daa526 upstream. Limit the CPU number to num_possible_cpus(), because setting it to a value lower than INT_MAX but higher than NR_CPUS produces the following error on reboot and shutdown: BUG: unable to handle page fault for address: ffffffff90ab1bb0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1c09067 P4D 1c09067 PUD 1c0a063 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 1 Comm: systemd-shutdow Not tainted 5.9.0-rc8-kvm #110 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:migrate_to_reboot_cpu+0xe/0x60 Code: ea ea 00 48 89 fa 48 c7 c7 30 57 f1 81 e9 fa ef ff ff 66 2e 0f 1f 84 00 00 00 00 00 53 8b 1d d5 ea ea 00 e8 14 33 fe ff 89 da <48> 0f a3 15 ea fc bd 00 48 89 d0 73 29 89 c2 c1 e8 06 65 48 8b 3c RSP: 0018:ffffc90000013e08 EFLAGS: 00010246 RAX: ffff88801f0a0000 RBX: 0000000077359400 RCX: 0000000000000000 RDX: 0000000077359400 RSI: 0000000000000002 RDI: ffffffff81c199e0 RBP: ffffffff81c1e3c0 R08: ffff88801f41f000 R09: ffffffff81c1e348 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 00007f32bedf8830 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f32bedf8980(0000) GS:ffff88801f480000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff90ab1bb0 CR3: 000000001d057000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __do_sys_reboot.cold+0x34/0x5b do_syscall_64+0x2d/0x40 Fixes: 1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic kernel") Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Fabian Frederick <fabf@skynet.be> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Kees Cook <keescook@chromium.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201103214025.116799-3-mcroce@linux.microsoft.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [sudip: use reboot_mode instead of mode] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint"Matteo Croce
commit 8b92c4ff4423aa9900cf838d3294fcade4dbda35 upstream. Patch series "fix parsing of reboot= cmdline", v3. The parsing of the reboot= cmdline has two major errors: - a missing bound check can crash the system on reboot - parsing of the cpu number only works if specified last Fix both. This patch (of 2): This reverts commit 616feab753972b97. kstrtoint() and simple_strtoul() have a subtle difference which makes them non interchangeable: if a non digit character is found amid the parsing, the former will return an error, while the latter will just stop parsing, e.g. simple_strtoul("123xyx") = 123. The kernel cmdline reboot= argument allows to specify the CPU used for rebooting, with the syntax `s####` among the other flags, e.g. "reboot=warm,s31,force", so if this flag is not the last given, it's silently ignored as well as the subsequent ones. Fixes: 616feab75397 ("kernel/reboot.c: convert simple_strtoul to kstrtoint") Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Petr Mladek <pmladek@suse.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Mike Rapoport <rppt@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201103214025.116799-2-mcroce@linux.microsoft.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [sudip: use reboot_mode instead of mode] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18perf/core: Fix race in the perf_mmap_close() functionJiri Olsa
commit f91072ed1b7283b13ca57fcfbece5a3b92726143 upstream. There's a possible race in perf_mmap_close() when checking ring buffer's mmap_count refcount value. The problem is that the mmap_count check is not atomic because we call atomic_dec() and atomic_read() separately. perf_mmap_close: ... atomic_dec(&rb->mmap_count); ... if (atomic_read(&rb->mmap_count)) goto out_put; <ring buffer detach> free_uid out_put: ring_buffer_put(rb); /* could be last */ The race can happen when we have two (or more) events sharing same ring buffer and they go through atomic_dec() and then they both see 0 as refcount value later in atomic_read(). Then both will go on and execute code which is meant to be run just once. The code that detaches ring buffer is probably fine to be executed more than once, but the problem is in calling free_uid(), which will later on demonstrate in related crashes and refcount warnings, like: refcount_t: addition on 0; use-after-free. ... RIP: 0010:refcount_warn_saturate+0x6d/0xf ... Call Trace: prepare_creds+0x190/0x1e0 copy_creds+0x35/0x172 copy_process+0x471/0x1a80 _do_fork+0x83/0x3a0 __do_sys_wait4+0x83/0x90 __do_sys_clone+0x85/0xa0 do_syscall_64+0x5b/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Using atomic decrease and check instead of separated calls. Tested-by: Michael Petlan <mpetlan@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Wade Mealing <wmealing@redhat.com> Fixes: 9bb5d40cd93c ("perf: Fix mmap() accounting hole"); Link: https://lore.kernel.org/r/20200916115311.GE2301783@krava [sudip: backport to v4.9.y by using ring_buffer] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18perf/core: Fix a memory leak in perf_event_parse_addr_filter()kiyin(尹亮)
commit 7bdb157cdebbf95a1cd94ed2e01b338714075d00 upstream As shown through runtime testing, the "filename" allocation is not always freed in perf_event_parse_addr_filter(). There are three possible ways that this could happen: - It could be allocated twice on subsequent iterations through the loop, - or leaked on the success path, - or on the failure path. Clean up the code flow to make it obvious that 'filename' is always freed in the reallocation path and in the two return paths as well. We rely on the fact that kfree(NULL) is NOP and filename is initialized with NULL. This fixes the leak. No other side effects expected. [ Dan Carpenter: cleaned up the code flow & added a changelog. ] [ Ingo Molnar: updated the changelog some more. ] Fixes: 375637bc5249 ("perf/core: Introduce address range filtering") Signed-off-by: "kiyin(尹亮)" <kiyin@tencent.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> Cc: Anthony Liguori <aliguori@amazon.com> [sudip: Backported to 4.9: adjust context] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18perf/core: Fix crash when using HW tracing kernel filtersMathieu Poirier
commit 7f635ff187ab6be0b350b3ec06791e376af238ab upstream In function perf_event_parse_addr_filter(), the path::dentry of each struct perf_addr_filter is left unassigned (as it should be) when the pattern being parsed is related to kernel space. But in function perf_addr_filter_match() the same dentries are given to d_inode() where the value is not expected to be NULL, resulting in the following splat: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058 pc : perf_event_mmap+0x2fc/0x5a0 lr : perf_event_mmap+0x2c8/0x5a0 Process uname (pid: 2860, stack limit = 0x000000001cbcca37) Call trace: perf_event_mmap+0x2fc/0x5a0 mmap_region+0x124/0x570 do_mmap+0x344/0x4f8 vm_mmap_pgoff+0xe4/0x110 vm_mmap+0x2c/0x40 elf_map+0x60/0x108 load_elf_binary+0x450/0x12c4 search_binary_handler+0x90/0x290 __do_execve_file.isra.13+0x6e4/0x858 sys_execve+0x3c/0x50 el0_svc_naked+0x30/0x34 This patch is fixing the problem by introducing a new check in function perf_addr_filter_match() to see if the filter's dentry is NULL. Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Cc: miklos@szeredi.hu Cc: namhyung@kernel.org Cc: songliubraving@fb.com Fixes: 9511bce9fe8e ("perf/core: Fix bad use of igrab()") Link: http://lkml.kernel.org/r/1531782831-1186-1-git-send-email-mathieu.poirier@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18perf/core: Fix bad use of igrab()Song Liu
commit 9511bce9fe8e5e6c0f923c09243a713eba560141 upstream As Miklos reported and suggested: "This pattern repeats two times in trace_uprobe.c and in kernel/events/core.c as well: ret = kern_path(filename, LOOKUP_FOLLOW, &path); if (ret) goto fail_address_parse; inode = igrab(d_inode(path.dentry)); path_put(&path); And it's wrong. You can only hold a reference to the inode if you have an active ref to the superblock as well (which is normally through path.mnt) or holding s_umount. This way unmounting the containing filesystem while the tracepoint is active will give you the "VFS: Busy inodes after unmount..." message and a crash when the inode is finally put. Solution: store path instead of inode." This patch fixes the issue in kernel/event/core.c. Reviewed-and-tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Reported-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <kernel-team@fb.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 375637bc5249 ("perf/core: Introduce address range filtering") Link: http://lkml.kernel.org/r/20180418062907.3210386-2-songliubraving@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [sudip: Backported to 4.9: use file_inode()] Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18random32: make prandom_u32() output unpredictableGeorge Spelvin
commit c51f8f88d705e06bd696d7510aff22b33eb8e638 upstream. Non-cryptographic PRNGs may have great statistical properties, but are usually trivially predictable to someone who knows the algorithm, given a small sample of their output. An LFSR like prandom_u32() is particularly simple, even if the sample is widely scattered bits. It turns out the network stack uses prandom_u32() for some things like random port numbers which it would prefer are *not* trivially predictable. Predictability led to a practical DNS spoofing attack. Oops. This patch replaces the LFSR with a homebrew cryptographic PRNG based on the SipHash round function, which is in turn seeded with 128 bits of strong random key. (The authors of SipHash have *not* been consulted about this abuse of their algorithm.) Speed is prioritized over security; attacks are rare, while performance is always wanted. Replacing all callers of prandom_u32() is the quick fix. Whether to reinstate a weaker PRNG for uses which can tolerate it is an open question. Commit f227e3ec3b5c ("random32: update the net random state on interrupt and activity") was an earlier attempt at a solution. This patch replaces it. Reported-by: Amit Klein <aksecurity@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Eric Dumazet <edumazet@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tytso@mit.edu Cc: Florian Westphal <fw@strlen.de> Cc: Marc Plumb <lkml.mplumb@gmail.com> Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity") Signed-off-by: George Spelvin <lkml@sdf.org> Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/ [ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions to prandom.h for later use; merged George's prandom_seed() proposal; inlined siprand_u32(); replaced the net_rand_state[] array with 4 members to fix a build issue; cosmetic cleanups to make checkpatch happy; fixed RANDOM32_SELFTEST build ] [wt: backported to 4.9 -- various context adjustments; timer API change] Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18don't dump the threads that had been already exiting when zapped.Al Viro
commit 77f6ab8b7768cf5e6bdd0e72499270a0671506ee upstream. Coredump logics needs to report not only the registers of the dumping thread, but (since 2.5.43) those of other threads getting killed. Doing that might require extra state saved on the stack in asm glue at kernel entry; signal delivery logics does that (we need to be able to save sigcontext there, at the very least) and so does seccomp. That covers all callers of do_coredump(). Secondary threads get hit with SIGKILL and caught as soon as they reach exit_mm(), which normally happens in signal delivery, so those are also fine most of the time. Unfortunately, it is possible to end up with secondary zapped when it has already entered exit(2) (or, worse yet, is oopsing). In those cases we reach exit_mm() when mm->core_state is already set, but the stack contents is not what we would have in signal delivery. At least on two architectures (alpha and m68k) it leads to infoleaks - we end up with a chunk of kernel stack written into coredump, with the contents consisting of normal C stack frames of the call chain leading to exit_mm() instead of the expected copy of userland registers. In case of alpha we leak 312 bytes of stack. Other architectures (including the regset-using ones) might have similar problems - the normal user of regsets is ptrace and the state of tracee at the time of such calls is special in the same way signal delivery is. Note that had the zapper gotten to the exiting thread slightly later, it wouldn't have been included into coredump anyway - we skip the threads that have already cleared their ->mm. So let's pretend that zapper always loses the race. IOW, have exit_mm() only insert into the dumper list if we'd gotten there from handling a fatal signal[*] As the result, the callers of do_exit() that have *not* gone through get_signal() are not seen by coredump logics as secondary threads. Which excludes voluntary exit()/oopsen/traps/etc. The dumper thread itself is unaffected by that, so seccomp is fine. [*] originally I intended to add a new flag in tsk->flags, but ebiederman pointed out that PF_SIGNALED is already doing just what we need. Cc: stable@vger.kernel.org Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3") History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-18perf: Fix get_recursion_context()Peter Zijlstra
[ Upstream commit ce0f17fc93f63ee91428af10b7b2ddef38cd19e5 ] One should use in_serving_softirq() to detect SoftIRQ context. Fixes: 96f6d4444302 ("perf_counter: avoid recursion") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201030151955.120572175@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18genirq: Let GENERIC_IRQ_IPI select IRQ_DOMAIN_HIERARCHYMarc Zyngier
[ Upstream commit 151a535171be6ff824a0a3875553ea38570f4c05 ] kernel/irq/ipi.c otherwise fails to compile if nothing else selects it. Fixes: 379b656446a3 ("genirq: Add GENERIC_IRQ_IPI Kconfig symbol") Reported-by: Pavel Machek <pavel@ucw.cz> Tested-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20201015101222.GA32747@amd Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-18ring-buffer: Fix recursion protection transitions between interrupt contextSteven Rostedt (VMware)
[ Upstream commit b02414c8f045ab3b9afc816c3735bc98c5c3d262 ] The recursion protection of the ring buffer depends on preempt_count() to be correct. But it is possible that the ring buffer gets called after an interrupt comes in but before it updates the preempt_count(). This will trigger a false positive in the recursion code. Use the same trick from the ftrace function callback recursion code which uses a "transition" bit that gets set, to allow for a single recursion for to handle transitions between contexts. Cc: stable@vger.kernel.org Fixes: 567cd4da54ff4 ("ring-buffer: User context bit recursion checking") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-10fork: fix copy_process(CLONE_PARENT) race with the exiting ->real_parentEddy Wu
commit b4e00444cab4c3f3fec876dc0cccc8cbb0d1a948 upstream. current->group_leader->exit_signal may change during copy_process() if current->real_parent exits. Move the assignment inside tasklist_lock to avoid the race. Signed-off-by: Eddy Wu <eddy_wu@trendmicro.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10tracing: Fix out of bounds write in get_trace_bufQiujun Huang
commit c1acb4ac1a892cf08d27efcb964ad281728b0545 upstream. The nesting count of trace_printk allows for 4 levels of nesting. The nesting counter starts at zero and is incremented before being used to retrieve the current context's buffer. But the index to the buffer uses the nesting counter after it was incremented, and not its original number, which in needs to do. Link: https://lkml.kernel.org/r/20201029161905.4269-1-hqjagain@gmail.com Cc: stable@vger.kernel.org Fixes: 3d9622c12c887 ("tracing: Add barrier to trace_printk() buffer nesting modification") Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10ftrace: Handle tracing when switching between contextSteven Rostedt (VMware)
commit 726b3d3f141fba6f841d715fc4d8a4a84f02c02a upstream. When an interrupt or NMI comes in and switches the context, there's a delay from when the preempt_count() shows the update. As the preempt_count() is used to detect recursion having each context have its own bit get set when tracing starts, and if that bit is already set, it is considered a recursion and the function exits. But if this happens in that section where context has changed but preempt_count() has not been updated, this will be incorrectly flagged as a recursion. To handle this case, create another bit call TRANSITION and test it if the current context bit is already set. Flag the call as a recursion if the TRANSITION bit is already set, and if not, set it and continue. The TRANSITION bit will be cleared normally on the return of the function that set it, or if the current context bit is clear, set it and clear the TRANSITION bit to allow for another transition between the current context and an even higher one. Cc: stable@vger.kernel.org Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10ftrace: Fix recursion check for NMI testSteven Rostedt (VMware)
commit ee11b93f95eabdf8198edd4668bf9102e7248270 upstream. The code that checks recursion will work to only do the recursion check once if there's nested checks. The top one will do the check, the other nested checks will see recursion was already checked and return zero for its "bit". On the return side, nothing will be done if the "bit" is zero. The problem is that zero is returned for the "good" bit when in NMI context. This will set the bit for NMIs making it look like *all* NMI tracing is recursing, and prevent tracing of anything in NMI context! The simple fix is to return "bit + 1" and subtract that bit on the end to get the real bit. Cc: stable@vger.kernel.org Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>