summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2018-11-12rcu: Adjust the comment of function rcu_is_watchingZhouyi Zhou
Because RCU avoids interrupting idle CPUs, rcu_is_watching() is used to test whether or not it is currently legal to run RCU read-side critical sections on this CPU. However, the first sentence and last sentences of current comment for rcu_is_watching have opposite meaning of what is expected. This commit therefore fixes this header comment. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Add jiffies-since-GP-activity to show_rcu_gp_kthreads()Paul E. McKenney
This commit adds a printout of the number of jiffies since the last time that the RCU grace-period kthread did any processing. This can be useful when tracking down forward-progress issues. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Add state name to show_rcu_gp_kthreads() outputPaul E. McKenney
This commit adds the name of the RCU grace-period state to the show_rcu_gp_kthreads() output in order to ease debugging. This commit also moves gp_state_getname() up in the code so that show_rcu_gp_kthreads() can use it. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Parameterize rcu_check_gp_start_stall()Paul E. McKenney
In order to debug forward-progress stalls, it is necessary to check for excessively delayed grace-period starts. This is currently done for RCU CPU stall warnings by rcu_check_gp_start_stall(), which checks to see if the start of a requested grace period has been delayed by an RCU CPU stall warning period. Because rcutorture will need to check for the time consumed by an RCU forward-progress delay, this commit promotes gpssdelay from a local variable to a formal parameter. It is not necessary to export rcu_check_gp_start_stall() because rcutorture will access it via a wrapper function. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Avoid double multiply by HZPaul E. McKenney
The rcu_check_gp_start_stall() function multiplies the return value from rcu_jiffies_till_stall_check() by HZ, but the units are already in jiffies. This commit therefore avoids the need for introduction of a jiffies-squared unit by removing the extraneous multiplication. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Eliminate BUG_ON() for kernel/rcu/update.cPaul E. McKenney
The update.c file has a number of calls to BUG_ON(), which panics the kernel, which is not a good strategy for devices (like embedded) that don't have a way to capture console output. This commit therefore converts these BUG_ON() calls to WARN_ON_ONCE() and WARN_ONCE(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-12rcu: Eliminate BUG_ON() for kernel/rcu/tree_plugin.hPaul E. McKenney
The tree_plugin.h file has a number of calls to BUG_ON(), which panics the kernel, which is not a good strategy for devices (like embedded) that don't have a way to capture console output. This commit therefore converts these BUG_ON() calls to WARN_ON_ONCE() and WARN_ONCE(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> [ paulmck: Fix typo: s/rcuo/rcub/. ]
2018-11-12audit: Use 'mark' name for fsnotify_mark variablesJan Kara
Variables pointing to fsnotify_mark are sometimes called 'entry' and sometimes 'mark'. Use 'mark' in all places. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> [PM: minor merge fuzz due to updated patches previously in the series] Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Replace chunk attached to mark instead of replacing markJan Kara
Audit tree code currently associates new fsnotify mark with each new chunk. As chunk attached to an inode is replaced when new tag is added / removed, we also need to remove old fsnotify mark and add a new one on such occasion. This is cumbersome and makes locking rules somewhat difficult to follow. Fix these problems by allocating fsnotify mark independently of chunk and keeping it all the time while there is some chunk attached to an inode. Also add documentation about the locking rules so that things are easier to follow. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> [PM: minor merge fuzz due to updated patches previously in the series] Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Simplify locking around untag_chunk()Jan Kara
untag_chunk() has to be called with hash_lock, it drops it and reacquires it when returning. The unlocking of hash_lock is thus hidden from the callers of untag_chunk() with is rather error prone. Reorganize the code so that untag_chunk() is called without hash_lock, only with mark reference preventing the chunk from going away. Since this requires some more code in the caller of untag_chunk() to assure forward progress, factor out loop pruning tree from all chunks into a common helper function. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Drop all unused chunk nodes during deletionJan Kara
When deleting chunk from a tree, drop all unused nodes in a chunk instead of just the one used by the tree. This gets rid of possibly lingering unused nodes (created due to fallback path in untag_chunk()) and also removes some special cases and will allow us to simplify locking in untag_chunk(). Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Guarantee forward progress of chunk untaggingJan Kara
When removing chunk from a tree, we do shrink the chunk. This can fail for various reasons (due to races, ENOMEM, etc.) and in some cases we just bail from untag_chunk() relying on someone else to cleanup. Although this currently works, later we will need to add new failure situation which would break. Also this simplifies the code and will allow us to make locking around untag_chunk() less awkward. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Allocate fsnotify mark independently of chunkJan Kara
Allocate fsnotify mark independently instead of embedding it inside chunk. This will allow us to just replace chunk attached to mark when growing / shrinking chunk instead of replacing mark attached to inode which is a more complex operation. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Provide helper for dropping mark's chunk referenceJan Kara
Provide a helper function audit_mark_put_chunk() for dropping mark's reference (which has to happen only after RCU grace period expires). Currently that happens only from a single place but in later patches we introduce more callers. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Remove pointless check in insert_hash()Jan Kara
The audit_tree_group->mark_mutex is held all the time while we create the fsnotify mark, add it to the inode, and insert chunk into the hash. Hence mark cannot get detached during this time and so the check whether the mark is attached in insert_hash() is pointless. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Factor out chunk replacement codeJan Kara
Chunk replacement code is very similar for the cases where we grow or shrink chunk. Factor the code out into a common helper function. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Make hash table insertion safe against concurrent lookupsJan Kara
Currently, the audit tree code does not make sure that when a chunk is inserted into the hash table, it is fully initialized. So in theory a user of RCU lookup could see uninitialized structure in the hash table and crash. Add appropriate barriers between initialization of the structure and its insertion into hash table. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Embed key into chunkJan Kara
Currently chunk hash key (which is in fact pointer to the inode) is derived as chunk->mark.conn->obj. It is tricky to make this dereference reliable for hash table lookups only under RCU as mark can get detached from the connector and connector gets freed independently of the running lookup. Thus there is a possible use after free / NULL ptr dereference issue: CPU1 CPU2 untag_chunk() ... audit_tree_lookup() list_for_each_entry_rcu(p, list, hash) { list_del_rcu(&chunk->hash); fsnotify_destroy_mark(entry); fsnotify_put_mark(entry) chunk_to_key(p) if (!chunk->mark.connector) ... hlist_del_init_rcu(&mark->obj_list); if (hlist_empty(&conn->list)) { inode = fsnotify_detach_connector_from_object(conn); mark->connector = NULL; ... frees connector from workqueue chunk->mark.connector->obj This race is probably impossible to hit in practice as the race window on CPU1 is very narrow and CPU2 has a lot of code to execute. Still it's better to have this fixed. Since the inode the chunk is attached to is constant during chunk's lifetime it is easy to cache the key in the chunk itself and thus avoid these issues. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Fix possible tagging failuresJan Kara
Audit tree code is replacing marks attached to inodes in non-atomic way. Thus fsnotify_find_mark() in tag_chunk() may find a mark that belongs to a chunk that is no longer valid one and will soon be destroyed. Tags added to such chunk will be simply lost. Fix the problem by making sure old mark is marked as going away (through fsnotify_detach_mark()) before dropping mark_mutex and thus in an atomic way wrt tag_chunk(). Note that this does not fix the problem completely as if tag_chunk() finds a mark that is going away, it fails with -ENOENT. But at least the failure is not silent and currently there's no way to search for another fsnotify mark attached to the inode. We'll fix this problem in later patch. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit: Fix possible spurious -ENOSPC errorJan Kara
When an inode is tagged with a tree, tag_chunk() checks whether there is audit_tree_group mark attached to the inode and adds one if not. However nothing protects another tag_chunk() to add the mark between we've checked and try to add the fsnotify mark thus resulting in an error from fsnotify_add_mark() and consequently an ENOSPC error from tag_chunk(). Fix the problem by holding mark_mutex over the whole check-insert code sequence. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12audit_tree: Remove mark->lock lockingJan Kara
Currently, audit_tree code uses mark->lock to protect against detaching of mark from an inode. In most places it however also uses mark->group->mark_mutex (as we need to atomically replace attached marks) and this provides protection against mark detaching as well. So just remove protection with mark->lock from audit tree code and replace it with mark->group->mark_mutex protection in all the places. It simplifies the code and gets rid of some ugly catches like calling fsnotify_add_mark_locked() with mark->lock held (which cannot sleep only because we hold a reference to another mark attached to the same inode). Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-11-12sched/core: Clean up the #ifdef block in add_nr_running()Viresh Kumar
There is no point in keeping the conditional statement of the #if block outside of the #ifdef block, while all of its body is contained within the #ifdef block. Move the conditional statement under the #ifdef block as well. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/78cbd78a615d6f9fdcd3327f1ead68470f92593e.1541482935.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12sched/fair: Make some variables staticMuchun Song
The variables are local to the source and do not need to be in global scope, so make them static. Signed-off-by: Muchun Song <smuchun@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181110075202.61172-1-smuchun@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12sched/core: Create task_has_idle_policy() helperViresh Kumar
We already have task_has_rt_policy() and task_has_dl_policy() helpers, create task_has_idle_policy() as well and update sched core to start using it. While at it, use task_has_dl_policy() at one more place. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/ce3915d5b490fc81af926a3b6bfb775e7188e005.1541416894.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12sched/fair: Add lsub_positive() and use it consistentlyPatrick Bellasi
The following pattern: var -= min_t(typeof(var), var, val); is used multiple times in fair.c. The existing sub_positive() already captures that pattern, but it also adds an explicit load-store to properly support lockless observations. In other cases the pattern above is used to update local, and/or not concurrently accessed, variables. Let's add a simpler version of sub_positive(), targeted at local variables updates, which gives the same readability benefits at calling sites, without enforcing {READ,WRITE}_ONCE() barriers. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/lkml/20181031184527.GA3178@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12sched/fair: Mask UTIL_AVG_UNCHANGED usagesPatrick Bellasi
The _task_util_est() is mainly used to add/remove the task contribution to/from the rq's estimated utilization at task enqueue/dequeue time. In both cases we ensure the UTIL_AVG_UNCHANGED flag is set to keep consistency between enqueue and dequeue time while still being transparent to update_load_avg calls which will eventually reset the flag. Let's move the flag forcing within _task_util_est() itself so that we can simplify calling code by hiding that estimated utilization implementation detail into one of its internal functions. This will affect also the "public" API task_util_est() but we know that the flag will (eventually) impact just on the LSB of the estimated utilization, thus it's certainly acceptable. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20181105145400.935-3-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12Merge branch 'sched/urgent' into sched/core, to pick up dependent fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-12sched/fair: Fix cpu_util_wake() for 'execl' type workloadsPatrick Bellasi
A ~10% regression has been reported for UnixBench's execl throughput test by Aaron Lu and Ye Xiaolong: https://lkml.org/lkml/2018/10/30/765 That test is pretty simple, it does a "recursive" execve() syscall on the same binary. Starting from the syscall, this sequence is possible: do_execve() do_execveat_common() __do_execve_file() sched_exec() select_task_rq_fair() <==| Task already enqueued find_idlest_cpu() find_idlest_group() capacity_spare_wake() <==| Functions not called from cpu_util_wake() | the wakeup path which means we can end up calling cpu_util_wake() not only from the "wakeup path", as its name would suggest. Indeed, the task doing an execve() syscall is already enqueued on the CPU we want to get the cpu_util_wake() for. The estimated utilization for a CPU computed in cpu_util_wake() was written under the assumption that function can be called only from the wakeup path. If instead the task is already enqueued, we end up with a utilization which does not remove the current task's contribution from the estimated utilization of the CPU. This will wrongly assume a reduced spare capacity on the current CPU and increase the chances to migrate the task on execve. The regression is tracked down to: commit d519329f72a6 ("sched/fair: Update util_est only on util_avg updates") because in that patch we turn on by default the UTIL_EST sched feature. However, the real issue is introduced by: commit f9be3e5961c5 ("sched/fair: Use util_est in LB and WU paths") Let's fix this by ensuring to always discount the task estimated utilization from the CPU's estimated utilization when the task is also the current one. The same benchmark of the bug report, executed on a dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine, reports these "Execl Throughput" figures (higher the better): mainline : 48136.5 lps mainline+fix : 55376.5 lps which correspond to a 15% speedup. Moreover, since {cpu_util,capacity_spare}_wake() are not really only used from the wakeup path, let's remove this ambiguity by using a better matching name: {cpu_util,capacity_spare}_without(). Since we are at that, let's also improve the existing documentation. Reported-by: Aaron Lu <aaron.lu@intel.com> Reported-by: Ye Xiaolong <xiaolong.ye@intel.com> Tested-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Fixes: f9be3e5961c5 (sched/fair: Use util_est in LB and WU paths) Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/ Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-11Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "Just the removal of a redundant call into the sched deadline overrun check" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Remove useless call to check_dl_overrun()
2018-11-11Merge branch 'sched/urgent' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Two small scheduler fixes: - Take hotplug lock in sched_init_smp(). Technically not really required, but lockdep will complain other. - Trivial comment fix in sched/fair" * 'sched/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix a comment in task_numa_fault() sched/core: Take the hotplug lock in sched_init_smp()
2018-11-11Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core fixes from Thomas Gleixner: "A couple of fixlets for the core: - Kernel doc function documentation fixes - Missing prototypes for weak watchdog functions" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: resource/docs: Complete kernel-doc style function documentation watchdog/core: Add missing prototypes for weak functions resource/docs: Fix new kernel-doc warnings
2018-11-11rcu: Stop expedited grace periods from relying on stop-machinePaul E. McKenney
The CPU-selection code in sync_rcu_exp_select_cpus() disables preemption to prevent the cpu_online_mask from changing. However, this relies on the stop-machine mechanism in the CPU-hotplug offline code, which is not desirable (it would be good to someday remove the stop-machine mechanism). This commit therefore instead uses the relevant leaf rcu_node structure's ->ffmask, which has a bit set for all CPUs that are fully functional. A given CPU's bit is cleared very early during offline processing by rcutree_offline_cpu() and set very late during online processing by rcutree_online_cpu(). Therefore, if a CPU's bit is set in this mask, and preemption is disabled, we have to be before the synchronize_sched() in the CPU-hotplug offline code, which means that the CPU is guaranteed to be workqueue-ready throughout the duration of the enclosing preempt_disable() region of code. This also has the side-effect of using WORK_CPU_UNBOUND if all the CPUs for this leaf rcu_node structure are offline, which is an acceptable difference in behavior. Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-11-10bpf: Allow narrow loads with offset > 0Andrey Ignatov
Currently BPF verifier allows narrow loads for a context field only with offset zero. E.g. if there is a __u32 field then only the following loads are permitted: * off=0, size=1 (narrow); * off=0, size=2 (narrow); * off=0, size=4 (full). On the other hand LLVM can generate a load with offset different than zero that make sense from program logic point of view, but verifier doesn't accept it. E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code: #define DST_IP4 0xC0A801FEU /* 192.168.1.254 */ ... if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) && where ctx is struct bpf_sock_addr. Some versions of LLVM can produce the following byte code for it: 8: 71 12 07 00 00 00 00 00 r2 = *(u8 *)(r1 + 7) 9: 67 02 00 00 18 00 00 00 r2 <<= 24 10: 18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00 r3 = 4261412864 ll 12: 5d 32 07 00 00 00 00 00 if r2 != r3 goto +7 <LBB0_6> where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1 and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently rejected by verifier. Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok() what means any is_valid_access implementation, that uses the function, works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or sock_addr_is_valid_access() for bpf_sock_addr. The patch makes such loads supported. Offset can be in [0; size_default) but has to be multiple of load size. E.g. for __u32 field the following loads are supported now: * off=0, size=1 (narrow); * off=1, size=1 (narrow); * off=2, size=1 (narrow); * off=3, size=1 (narrow); * off=0, size=2 (narrow); * off=2, size=2 (narrow); * off=0, size=4 (full). Reported-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10bpf: do not pass netdev to translate() and prepare() offload callbacksQuentin Monnet
The kernel functions to prepare verifier and translate for offloaded program retrieve "offload" from "prog", and "netdev" from "offload". Then both "prog" and "netdev" are passed to the callbacks. Simplify this by letting the drivers retrieve the net device themselves from the offload object attached to prog - if they need it at all. There is currently no need to pass the netdev as an argument to those functions. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10bpf: pass prog instead of env to bpf_prog_offload_verifier_prep()Quentin Monnet
Function bpf_prog_offload_verifier_prep(), called from the kernel BPF verifier to run a driver-specific callback for preparing for the verification step for offloaded programs, takes a pointer to a struct bpf_verifier_env object. However, no driver callback needs the whole structure at this time: the two drivers supporting this, nfp and netdevsim, only need a pointer to the struct bpf_prog instance held by env. Update the callback accordingly, on kernel side and in these two drivers. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10bpf: pass destroy() as a callback and remove its ndo_bpf subcommandQuentin Monnet
As part of the transition from ndo_bpf() to callbacks attached to struct bpf_offload_dev for some of the eBPF offload operations, move the functions related to program destruction to the struct and remove the subcommand that was used to call them through the NDO. Remove function __bpf_offload_ndo(), which is no longer used. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10bpf: pass translate() as a callback and remove its ndo_bpf subcommandQuentin Monnet
As part of the transition from ndo_bpf() to callbacks attached to struct bpf_offload_dev for some of the eBPF offload operations, move the functions related to code translation to the struct and remove the subcommand that was used to call them through the NDO. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10bpf: call verifier_prep from its callback in struct bpf_offload_devQuentin Monnet
In a way similar to the change previously brought to the verify_insn hook and to the finalize callback, switch to the newly added ops in struct bpf_prog_offload for calling the functions used to prepare driver verifiers. Since the dev_ops pointer in struct bpf_prog_offload is no longer used by any callback, we can now remove it from struct bpf_prog_offload. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10bpf: call finalize() from its callback in struct bpf_offload_devQuentin Monnet
In a way similar to the change previously brought to the verify_insn hook, switch to the newly added ops in struct bpf_prog_offload for calling the functions used to perform final verification steps for offloaded programs. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10bpf: call verify_insn from its callback in struct bpf_offload_devQuentin Monnet
We intend to remove the dev_ops in struct bpf_prog_offload, and to only keep the ops in struct bpf_offload_dev instead, which is accessible from more locations for passing function pointers. But dev_ops is used for calling the verify_insn hook. Switch to the newly added ops in struct bpf_prog_offload instead. To avoid table lookups for each eBPF instruction to verify, we remember the offdev attached to a netdev and modify bpf_offload_find_netdev() to avoid performing more than once a lookup for a given offload object. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10bpf: pass a struct with offload callbacks to bpf_offload_dev_create()Quentin Monnet
For passing device functions for offloaded eBPF programs, there used to be no place where to store the pointer without making the non-offloaded programs pay a memory price. As a consequence, three functions were called with ndo_bpf() through specific commands. Now that we have struct bpf_offload_dev, and since none of those operations rely on RTNL, we can turn these three commands into hooks inside the struct bpf_prog_offload_ops, and pass them as part of bpf_offload_dev_create(). This commit effectively passes a pointer to the struct to bpf_offload_dev_create(). We temporarily have two struct bpf_prog_offload_ops instances, one under offdev->ops and one under offload->dev_ops. The next patches will make the transition towards the former, so that offload->dev_ops can be removed, and callbacks relying on ndo_bpf() added to offdev->ops as well. While at it, rename "nfp_bpf_analyzer_ops" as "nfp_bpf_dev_ops" (and similarly for netdevsim). Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-11-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace fixes from Eric Biederman: "I believe all of these are simple obviously correct bug fixes. These fall into two groups: - Fixing the implementation of MNT_LOCKED which prevents lesser privileged users from seeing unders mounts created by more privileged users. - Fixing the extended uid and group mapping in user namespaces. As well as ensuring the code looks correct I have spot tested these changes as well and in my testing the fixes are working" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: mount: Prevent MNT_DETACH from disconnecting locked mounts mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts mount: Retest MNT_LOCKED in do_umount userns: also map extents in the reverse map to kernel IDs
2018-11-09bpf: let verifier to calculate and record max_pkt_offsetJiong Wang
In check_packet_access, update max_pkt_offset after the offset has passed __check_packet_access. It should be safe to use u32 for max_pkt_offset as explained in code comment. Also, when there is tail call, the max_pkt_offset of the called program is unknown, so conservatively set max_pkt_offset to MAX_PACKET_OFF for such case. Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-11-08srcu: Prevent __call_srcu() counter wrap with read-side critical sectionPaul E. McKenney
Ever since cdf7abc4610a ("srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context"), it has been permissible to use SRCU read-side critical sections in interrupt context. This allows __call_srcu() to use SRCU read-side critical sections to prevent a new SRCU grace period from ending before the call to either srcu_funnel_gp_start() or srcu_funnel_exp_start completes, thus preventing SRCU grace-period counter overflow during that time. Note that this does not permit removal of the counter-wrap checks in srcu_gp_end(). These check are necessary to handle the case where a given CPU does not interact at all with SRCU for an extended time period. This commit therefore adds an SRCU read-side critical section to __call_srcu() in order to prevent grace period counter wrap during the funnel-locking process. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-11-08sched/membarrier: Replace synchronize_sched() with synchronize_rcu()Paul E. McKenney
Now that synchronize_rcu() waits for preempt-disable regions of code as well as RCU read-side critical sections, the synchronize_sched() in sys_membarrier() can be replaced by synchronize_rcu(). This commit therefore makes this change. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
2018-11-08rcu: Consolidate the RCU update functions invoked by sync.cPaul E. McKenney
This commit retains all the various gp_ops[] entries, but makes their update functions all be synchronize_rcu(), call_rcu() and rcu_barrier(). The read-side checks remain consistent with the various RCU flavors, which still exist on the read side. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
2018-11-08rcu: Eliminate synchronize_rcu_mult()Paul E. McKenney
Now that synchronize_rcu() waits for both RCU read-side critical sections and preempt-disabled regions of code, the sole caller of synchronize_rcu_mult() can be replaced by synchronize_rcu(). This patch makes this change and removes synchronize_rcu_mult(). Note that _wait_rcu_gp() still supports synchronize_rcu_mult(), and thus might be simplified in the future to take only take a single call_rcu() function rather than the current list of them. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2018-11-08rcu: Fix rcu_{node,data} comments about gp_seq_neededJoel Fernandes (Google)
Recent changes have removed the old ->gp_seq_needed field from the rcu_state structure, which in turn obsoleted a couple of comments in the rcu_node and rcu_data structures. This commit therefore updates these comments accordingly. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: <kernel-team@android.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-08rcu: Remove unused rcu_state externsJoel Fernandes (Google)
The rcu_bh_state and rcu_sched_state variables were removed during the RCU flavor consolidations, but external declarations remain in tree.h. This commit therefore removes these obsolete declarations. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: <kernel-team@android.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2018-11-08rcu: Eliminate BUG_ON() for kernel/rcu/tree.cPaul E. McKenney
The tree.c file has a number of calls to BUG_ON(), which panics the kernel, which is not a good strategy for devices (like embedded) that don't have a way to capture console output. This commit therefore converts these BUG_ON() calls to WARN_ON_ONCE() and WARN_ONCE(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>