summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2024-11-12bpf: Support private stack for struct_ops progsYonghong Song
For struct_ops progs, whether a particular prog uses private stack depends on prog->aux->priv_stack_requested setting before actual insn-level verification for that prog. One particular implementation is to piggyback on struct_ops->check_member(). The next patch has an example for this. The struct_ops->check_member() sets prog->aux->priv_stack_requested to be true which enables private stack usage. The struct_ops prog follows the same rule as kprobe/tracing progs after function bpf_enable_priv_stack(). For example, even a struct_ops prog requests private stack, it could still use normal kernel stack if the stack size is small (< 64 bytes). Similar to tracing progs, nested same cpu same prog run will be skipped. A field (recursion_detected()) is added to bpf_prog_aux structure. If bpf_prog->aux->recursion_detected is implemented by the struct_ops subsystem and nested same cpu/prog happens, the function will be triggered to report an error, collect related info, etc. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163933.2224962-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Enable private stack for eligible subprogsYonghong Song
If private stack is used by any subprog, set that subprog prog->aux->jits_use_priv_stack to be true so later jit can allocate private stack for that subprog properly. Also set env->prog->aux->jits_use_priv_stack to be true if any subprog uses private stack. This is a use case for a single main prog (no subprogs) to use private stack, and also a use case for later struct-ops progs where env->prog->aux->jits_use_priv_stack will enable recursion check if any subprog uses private stack. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163912.2224007-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Find eligible subprogs for private stack supportYonghong Song
Private stack will be allocated with percpu allocator in jit time. To avoid complexity at runtime, only one copy of private stack is available per cpu per prog. So runtime recursion check is necessary to avoid stack corruption. Current private stack only supports kprobe/perf_event/tp/raw_tp which has recursion check in the kernel, and prog types that use bpf trampoline recursion check. For trampoline related prog types, currently only tracing progs have recursion checking. To avoid complexity, all async_cb subprogs use normal kernel stack including those subprogs used by both main prog subtree and async_cb subtree. Any prog having tail call also uses kernel stack. To avoid jit penalty with private stack support, a subprog stack size threshold is set such that only if the stack size is no less than the threshold, private stack is supported. The current threshold is 64 bytes. This avoids jit penality if the stack usage is small. A useless 'continue' is also removed from a loop in func check_max_stack_depth(). Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163907.2223839-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-11bpf: Drop special callback reference handlingKumar Kartikeya Dwivedi
Logic to prevent callbacks from acquiring new references for the program (i.e. leaving acquired references), and releasing caller references (i.e. those acquired in parent frames) was introduced in commit 9d9d00ac29d0 ("bpf: Fix reference state management for synchronous callbacks"). This was necessary because back then, the verifier simulated each callback once (that could potentially be executed N times, where N can be zero). This meant that callbacks that left lingering resources or cleared caller resources could do it more than once, operating on undefined state or leaking memory. With the fixes to callback verification in commit ab5cfac139ab ("bpf: verify callbacks as if they are called unknown number of times"), all of this extra logic is no longer necessary. Hence, drop it as part of this commit. Cc: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241109231430.2475236-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11bpf: Refactor active lock managementKumar Kartikeya Dwivedi
When bpf_spin_lock was introduced originally, there was deliberation on whether to use an array of lock IDs, but since bpf_spin_lock is limited to holding a single lock at any given time, we've been using a single ID to identify the held lock. In preparation for introducing spin locks that can be taken multiple times, introduce support for acquiring multiple lock IDs. For this purpose, reuse the acquired_refs array and store both lock and pointer references. We tag the entry with REF_TYPE_PTR or REF_TYPE_LOCK to disambiguate and find the relevant entry. The ptr field is used to track the map_ptr or btf (for bpf_obj_new allocations) to ensure locks can be matched with protected fields within the same "allocation", i.e. bpf_obj_new object or map value. The struct active_lock is changed to an int as the state is part of the acquired_refs array, and we only need active_lock as a cheap way of detecting lock presence. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241109231430.2475236-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11bpf: Call free_htab_elem() after htab_unlock_bucket()Hou Tao
For htab of maps, when the map is removed from the htab, it may hold the last reference of the map. bpf_map_fd_put_ptr() will invoke bpf_map_free_id() to free the id of the removed map element. However, bpf_map_fd_put_ptr() is invoked while holding a bucket lock (raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock (spinlock_t), triggering the following lockdep warning: ============================= [ BUG: Invalid wait context ] 6.11.0-rc4+ #49 Not tainted ----------------------------- test_maps/4881 is trying to lock: ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70 other info that might help us debug this: context-{5:5} 2 locks held by test_maps/4881: #0: ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270 #1: ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80 stack backtrace: CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ... Call Trace: <TASK> dump_stack_lvl+0x6e/0xb0 dump_stack+0x10/0x20 __lock_acquire+0x73e/0x36c0 lock_acquire+0x182/0x450 _raw_spin_lock_irqsave+0x43/0x70 bpf_map_free_id.part.0+0x21/0x70 bpf_map_put+0xcf/0x110 bpf_map_fd_put_ptr+0x9a/0xb0 free_htab_elem+0x69/0xe0 htab_map_update_elem+0x50f/0xa80 bpf_fd_htab_map_update_elem+0x131/0x270 htab_map_update_elem+0x50f/0xa80 bpf_fd_htab_map_update_elem+0x131/0x270 bpf_map_update_value+0x266/0x380 __sys_bpf+0x21bb/0x36b0 __x64_sys_bpf+0x45/0x60 x64_sys_call+0x1b2a/0x20d0 do_syscall_64+0x5d/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e One way to fix the lockdep warning is using raw_spinlock_t for map_idr_lock as well. However, bpf_map_alloc_id() invokes idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a similar lockdep warning because the slab's lock (s->cpu_slab->lock) is still a spinlock. Instead of changing map_idr_lock's type, fix the issue by invoking htab_put_fd_value() after htab_unlock_bucket(). However, only deferring the invocation of htab_put_fd_value() is not enough, because the old map pointers in htab of maps can not be saved during batched deletion. Therefore, also defer the invocation of free_htab_elem(), so these to-be-freed elements could be linked together similar to lru map. There are four callers for ->map_fd_put_ptr: (1) alloc_htab_elem() (through htab_put_fd_value()) It invokes ->map_fd_put_ptr() under a raw_spinlock_t. The invocation of htab_put_fd_value() can not simply move after htab_unlock_bucket(), because the old element has already been stashed in htab->extra_elems. It may be reused immediately after htab_unlock_bucket() and the invocation of htab_put_fd_value() after htab_unlock_bucket() may release the newly-added element incorrectly. Therefore, saving the map pointer of the old element for htab of maps before unlocking the bucket and releasing the map_ptr after unlock. Beside the map pointer in the old element, should do the same thing for the special fields in the old element as well. (2) free_htab_elem() (through htab_put_fd_value()) Its caller includes __htab_map_lookup_and_delete_elem(), htab_map_delete_elem() and __htab_map_lookup_and_delete_batch(). For htab_map_delete_elem(), simply invoke free_htab_elem() after htab_unlock_bucket(). For __htab_map_lookup_and_delete_batch(), just like lru map, linking the to-be-freed element into node_to_free list and invoking free_htab_elem() for these element after unlock. It is safe to reuse batch_flink as the link for node_to_free, because these elements have been removed from the hash llist. Because htab of maps doesn't support lookup_and_delete operation, __htab_map_lookup_and_delete_elem() doesn't have the problem, so kept it as is. (3) fd_htab_map_free() It invokes ->map_fd_put_ptr without raw_spinlock_t. (4) bpf_fd_htab_map_update_elem() It invokes ->map_fd_put_ptr without raw_spinlock_t. After moving free_htab_elem() outside htab bucket lock scope, using pcpu_freelist_push() instead of __pcpu_freelist_push() to disable the irq before freeing elements, and protecting the invocations of bpf_mem_cache_free() with migrate_{disable|enable} pair. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241106063542.357743-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11bpf: Add support for uprobe multi session attachJiri Olsa
Adding support to attach BPF program for entry and return probe of the same function. This is common use case which at the moment requires to create two uprobe multi links. Adding new BPF_TRACE_UPROBE_SESSION attach type that instructs kernel to attach single link program to both entry and exit probe. It's possible to control execution of the BPF program on return probe simply by returning zero or non zero from the entry BPF program execution to execute or not the BPF program on return probe respectively. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-4-jolsa@kernel.org
2024-11-11bpf: Allow return values 0 and 1 for kprobe sessionJiri Olsa
The kprobe session program can return only 0 or 1, instruct verifier to check for that. Fixes: 535a3692ba72 ("bpf: Add support for kprobe session attach") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241108134544.480660-2-jolsa@kernel.org
2024-11-04bpf: Mark raw_tp arguments with PTR_MAYBE_NULLKumar Kartikeya Dwivedi
Arguments to a raw tracepoint are tagged as trusted, which carries the semantics that the pointer will be non-NULL. However, in certain cases, a raw tracepoint argument may end up being NULL. More context about this issue is available in [0]. Thus, there is a discrepancy between the reality, that raw_tp arguments can actually be NULL, and the verifier's knowledge, that they are never NULL, causing explicit NULL checks to be deleted, and accesses to such pointers potentially crashing the kernel. To fix this, mark raw_tp arguments as PTR_MAYBE_NULL, and then special case the dereference and pointer arithmetic to permit it, and allow passing them into helpers/kfuncs; these exceptions are made for raw_tp programs only. Ensure that we don't do this when ref_obj_id > 0, as in that case this is an acquired object and doesn't need such adjustment. The reason we do mask_raw_tp_trusted_reg logic is because other will recheck in places whether the register is a trusted_reg, and then consider our register as untrusted when detecting the presence of the PTR_MAYBE_NULL flag. To allow safe dereference, we enable PROBE_MEM marking when we see loads into trusted pointers with PTR_MAYBE_NULL. While trusted raw_tp arguments can also be passed into helpers or kfuncs where such broken assumption may cause issues, a future patch set will tackle their case separately, as PTR_TO_BTF_ID (without PTR_TRUSTED) can already be passed into helpers and causes similar problems. Thus, they are left alone for now. It is possible that these checks also permit passing non-raw_tp args that are trusted PTR_TO_BTF_ID with null marking. In such a case, allowing dereference when pointer is NULL expands allowed behavior, so won't regress existing programs, and the case of passing these into helpers is the same as above and will be dealt with later. Also update the failure case in tp_btf_nullable selftest to capture the new behavior, as the verifier will no longer cause an error when directly dereference a raw tracepoint argument marked as __nullable. [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb Reviewed-by: Jiri Olsa <jolsa@kernel.org> Reported-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241104171959.2938862-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-03bpf: Unify resource leak checksKumar Kartikeya Dwivedi
There are similar checks for covering locks, references, RCU read sections and preempt_disable sections in 3 places in the verifer, i.e. for tail calls, bpf_ld_[abs, ind], and exit path (for BPF_EXIT and bpf_throw). Unify all of these into a common check_resource_leak function to avoid code duplication. Also update the error strings in selftests to the new ones in the same change to ensure clean bisection. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241103225940.1408302-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-03bpf: Tighten tail call checks for lingering locks, RCU, preempt_disableKumar Kartikeya Dwivedi
There are three situations when a program logically exits and transfers control to the kernel or another program: bpf_throw, BPF_EXIT, and tail calls. The former two check for any lingering locks and references, but tail calls currently do not. Expand the checks to check for spin locks, RCU read sections and preempt disabled sections. Spin locks are indirectly preventing tail calls as function calls are disallowed, but the checks for preemption and RCU are more relaxed, hence ensure tail calls are prevented in their presence. Fixes: 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()") Fixes: fc7566ad0a82 ("bpf: Introduce bpf_preempt_[disable,enable] kfuncs") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241103225940.1408302-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-01bpf: Add open coded version of kmem_cache iteratorNamhyung Kim
Add a new open coded iterator for kmem_cache which can be called from a BPF program like below. It doesn't take any argument and traverses all kmem_cache entries. struct kmem_cache *pos; bpf_for_each(kmem_cache, pos) { ... } As it needs to grab slab_mutex, it should be called from sleepable BPF programs only. Also update the existing iterator code to use the open coded version internally as suggested by Andrii. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241030222819.1800667-1-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. No conflicts. Adjacent changes in: include/linux/bpf.h include/uapi/linux/bpf.h kernel/bpf/btf.c kernel/bpf/helpers.c kernel/bpf/syscall.c kernel/bpf/verifier.c kernel/trace/bpf_trace.c mm/slab_common.c tools/include/uapi/linux/bpf.h tools/testing/selftests/bpf/Makefile Link: https://lore.kernel.org/all/20241024215724.60017-1-daniel@iogearbox.net/ Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Add uptr support in the map_value of the task local storage.Martin KaFai Lau
This patch adds uptr support in the map_value of the task local storage. struct map_value { struct user_data __uptr *uptr; }; struct { __uint(type, BPF_MAP_TYPE_TASK_STORAGE); __uint(map_flags, BPF_F_NO_PREALLOC); __type(key, int); __type(value, struct value_type); } datamap SEC(".maps"); A new bpf_obj_pin_uptrs() is added to pin the user page and also stores the kernel address back to the uptr for the bpf prog to use later. It currently does not support the uptr pointing to a user struct across two pages. It also excludes PageHighMem support to keep it simple. As of now, the 32bit bpf jit is missing other more crucial bpf features. For example, many important bpf features depend on bpf kfunc now but so far only one arch (x86-32) supports it which was added by me as an example when kfunc was first introduced to bpf. The uptr can only be stored to the task local storage by the syscall update_elem. Meaning the uptr will not be considered if it is provided by the bpf prog through bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE). This is enforced by only calling bpf_local_storage_update(swap_uptrs==true) in bpf_pid_task_storage_update_elem. Everywhere else will have swap_uptrs==false. This will pump down to bpf_selem_alloc(swap_uptrs==true). It is the only case that bpf_selem_alloc() will take the uptr value when updating the newly allocated selem. bpf_obj_swap_uptrs() is added to swap the uptr between the SDATA(selem)->data and the user provided map_value in "void *value". bpf_obj_swap_uptrs() makes the SDATA(selem)->data takes the ownership of the uptr and the user space provided map_value will have NULL in the uptr. The bpf_obj_unpin_uptrs() is called after map->ops->map_update_elem() returning error. If the map->ops->map_update_elem has reached a state that the local storage has taken the uptr ownership, the bpf_obj_unpin_uptrs() will be a no op because the uptr is NULL. A "__"bpf_obj_unpin_uptrs is added to make this error path unpin easier such that it does not have to check the map->record is NULL or not. BPF_F_LOCK is not supported when the map_value has uptr. This can be revisited later if there is a use case. A similar swap_uptrs idea can be considered. The final bit is to do unpin_user_page in the bpf_obj_free_fields(). The earlier patch has ensured that the bpf_obj_free_fields() has gone through the rcu gp when needed. Cc: linux-mm@kvack.org Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://lore.kernel.org/r/20241023234759.860539-7-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Postpone bpf_obj_free_fields to the rcu callbackMartin KaFai Lau
A later patch will enable the uptr usage in the task_local_storage map. This will require the unpin_user_page() to be done after the rcu task trace gp for the cases that the uptr may still be used by a bpf prog. The bpf_obj_free_fields() will be the one doing unpin_user_page(), so this patch is to postpone calling bpf_obj_free_fields() to the rcu callback. The bpf_obj_free_fields() is only required to be done in the rcu callback when bpf->bpf_ma==true and reuse_now==false. bpf->bpf_ma==true case is because uptr will only be enabled in task storage which has already been moved to bpf_mem_alloc. The bpf->bpf_ma==false case can be supported in the future also if there is a need. reuse_now==false when the selem (aka storage) is deleted by bpf prog (bpf_task_storage_delete) or by syscall delete_elem(). In both cases, bpf_obj_free_fields() needs to wait for rcu gp. A few words on reuse_now==true. reuse_now==true when the storage's owner (i.e. the task_struct) is destructing or the map itself is doing map_free(). In both cases, no bpf prog should have a hold on the selem and its uptrs, so there is no need to postpone bpf_obj_free_fields(). reuse_now==true should be the common case for local storage usage where the storage exists throughout the lifetime of its owner (task_struct). The bpf_obj_free_fields() needs to use the map->record. Doing bpf_obj_free_fields() in a rcu callback will require the bpf_local_storage_map_free() to wait for rcu_barrier. An optimization could be only waiting for rcu_barrier when the map has uptr in its map_value. This will require either yet another rcu callback function or adding a bool in the selem to flag if the SDATA(selem)->smap is still valid. This patch chooses to keep it simple and wait for rcu_barrier for maps that use bpf_mem_alloc. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-6-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Postpone bpf_selem_free() in bpf_selem_unlink_storage_nolock()Martin KaFai Lau
In a later patch, bpf_selem_free() will call unpin_user_page() through bpf_obj_free_fields(). unpin_user_page() may take spin_lock. However, some bpf_selem_free() call paths have held a raw_spin_lock. Like this: raw_spin_lock_irqsave() bpf_selem_unlink_storage_nolock() bpf_selem_free() unpin_user_page() spin_lock() To avoid spinlock nested in raw_spinlock, bpf_selem_free() should be done after releasing the raw_spinlock. The "bool reuse_now" arg is replaced with "struct hlist_head *free_selem_list" in bpf_selem_unlink_storage_nolock(). The bpf_selem_unlink_storage_nolock() will append the to-be-free selem at the free_selem_list. The caller of bpf_selem_unlink_storage_nolock() will need to call the new bpf_selem_free_list(free_selem_list, reuse_now) to free the selem after releasing the raw_spinlock. Note that the selem->snode cannot be reused for linking to the free_selem_list because the selem->snode is protected by the raw_spinlock that we want to avoid holding. A new "struct hlist_node free_node;" is union-ized with the rcu_head. Only the first one successfully hlist_del_init_rcu(&selem->snode) will be able to use the free_node. After succeeding hlist_del_init_rcu(&selem->snode), the free_node and rcu_head usage is serialized such that they can share the 16 bytes in a union. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-5-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Add "bool swap_uptrs" arg to bpf_local_storage_update() and ↵Martin KaFai Lau
bpf_selem_alloc() In a later patch, the task local storage will only accept uptr from the syscall update_elem and will not accept uptr from the bpf prog. The reason is the bpf prog does not have a way to provide a valid user space address. bpf_local_storage_update() and bpf_selem_alloc() are used by both bpf prog bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE) and bpf syscall update_elem. "bool swap_uptrs" arg is added to bpf_local_storage_update() and bpf_selem_alloc() to tell if it is called by the bpf prog or by the bpf syscall. When swap_uptrs==true, it is called by the syscall. The arg is named (swap_)uptrs because the later patch will swap the uptrs between the newly allocated selem and the user space provided map_value. It will make error handling easier in case map->ops->map_update_elem() fails and the caller can decide if it needs to unpin the uptr in the user space provided map_value or the bpf_local_storage_update() has already taken the uptr ownership and will take care of unpinning it also. Only swap_uptrs==false is passed now. The logic to handle the true case will be added in a later patch. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-4-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Handle BPF_UPTR in verifierKui-Feng Lee
This patch adds BPF_UPTR support to the verifier. Not that only the map_value will support the "__uptr" type tag. This patch enforces only BPF_LDX is allowed to the value of an uptr. After BPF_LDX, it will mark the dst_reg as PTR_TO_MEM | PTR_MAYBE_NULL with size deduced from the field.kptr.btf_id. This will make the dst_reg pointed memory to be readable and writable as scalar. There is a redundant "val_reg = reg_state(env, value_regno);" statement in the check_map_kptr_access(). This patch takes this chance to remove it also. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Support __uptr type tag in BTFKui-Feng Lee
This patch introduces the "__uptr" type tag to BTF. It is to define a pointer pointing to the user space memory. This patch adds BTF logic to pass the "__uptr" type tag. btf_find_kptr() is reused for the "__uptr" tag. The "__uptr" will only be supported in the map_value of the task storage map. However, btf_parse_struct_meta() also uses btf_find_kptr() but it is not interested in "__uptr". This patch adds a "field_mask" argument to btf_find_kptr() which will return BTF_FIELD_IGNORE if the caller is not interested in a “__uptr” field. btf_parse_kptr() is also reused to parse the uptr. The btf_check_and_fixup_fields() is changed to do extra checks on the uptr to ensure that its struct size is not larger than PAGE_SIZE. It is not clear how a uptr pointing to a CO-RE supported kernel struct will be used, so it is also not allowed now. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241023234759.860539-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24bpf: Check validity of link->type in bpf_link_show_fdinfo()Hou Tao
If a newly-added link type doesn't invoke BPF_LINK_TYPE(), accessing bpf_link_type_strs[link->type] may result in an out-of-bounds access. To spot such missed invocations early in the future, checking the validity of link->type in bpf_link_show_fdinfo() and emitting a warning when such invocations are missed. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241024013558.1135167-3-houtao@huaweicloud.com
2024-10-23bpf: fix do_misc_fixups() for bpf_get_branch_snapshot()Andrii Nakryiko
We need `goto next_insn;` at the end of patching instead of `continue;`. It currently works by accident by making verifier re-process patched instructions. Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Fixes: 314a53623cd4 ("bpf: inline bpf_get_branch_snapshot() helper") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20241023161916.2896274-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-22bpf: Fix overloading of MEM_UNINIT's meaningDaniel Borkmann
Lonial reported an issue in the BPF verifier where check_mem_size_reg() has the following code: if (!tnum_is_const(reg->var_off)) /* For unprivileged variable accesses, disable raw * mode so that the program is required to * initialize all the memory that the helper could * just partially fill up. */ meta = NULL; This means that writes are not checked when the register containing the size of the passed buffer has not a fixed size. Through this bug, a BPF program can write to a map which is marked as read-only, for example, .rodata global maps. The problem is that MEM_UNINIT's initial meaning that "the passed buffer to the BPF helper does not need to be initialized" which was added back in commit 435faee1aae9 ("bpf, verifier: add ARG_PTR_TO_RAW_STACK type") got overloaded over time with "the passed buffer is being written to". The problem however is that checks such as the above which were added later via 06c1c049721a ("bpf: allow helpers access to variable memory") set meta to NULL in order force the user to always initialize the passed buffer to the helper. Due to the current double meaning of MEM_UNINIT, this bypasses verifier write checks to the memory (not boundary checks though) and only assumes the latter memory is read instead. Fix this by reverting MEM_UNINIT back to its original meaning, and having MEM_WRITE as an annotation to BPF helpers in order to then trigger the BPF verifier checks for writing to memory. Some notes: check_arg_pair_ok() ensures that for ARG_CONST_SIZE{,_OR_ZERO} we can access fn->arg_type[arg - 1] since it must contain a preceding ARG_PTR_TO_MEM. For check_mem_reg() the meta argument can be removed altogether since we do check both BPF_READ and BPF_WRITE. Same for the equivalent check_kfunc_mem_size_reg(). Fixes: 7b3552d3f9f6 ("bpf: Reject writes for PTR_TO_MAP_KEY in check_helper_mem_access") Fixes: 97e6d7dab1ca ("bpf: Check PTR_TO_MEM | MEM_RDONLY in check_helper_mem_access") Fixes: 15baa55ff5b0 ("bpf/verifier: allow all functions to read user provided context") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241021152809.33343-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-22bpf: Add MEM_WRITE attributeDaniel Borkmann
Add a MEM_WRITE attribute for BPF helper functions which can be used in bpf_func_proto to annotate an argument type in order to let the verifier know that the helper writes into the memory passed as an argument. In the past MEM_UNINIT has been (ab)used for this function, but the latter merely tells the verifier that the passed memory can be uninitialized. There have been bugs with overloading the latter but aside from that there are also cases where the passed memory is read + written which currently cannot be expressed, see also 4b3786a6c539 ("bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241021152809.33343-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-22bpf: Preserve param->string when parsing mount optionsHou Tao
In bpf_parse_param(), keep the value of param->string intact so it can be freed later. Otherwise, the kmalloc area pointed to by param->string will be leaked as shown below: unreferenced object 0xffff888118c46d20 (size 8): comm "new_name", pid 12109, jiffies 4295580214 hex dump (first 8 bytes): 61 6e 79 00 38 c9 5c 7e any.8.\~ backtrace (crc e1b7f876): [<00000000c6848ac7>] kmemleak_alloc+0x4b/0x80 [<00000000de9f7d00>] __kmalloc_node_track_caller_noprof+0x36e/0x4a0 [<000000003e29b886>] memdup_user+0x32/0xa0 [<0000000007248326>] strndup_user+0x46/0x60 [<0000000035b3dd29>] __x64_sys_fsconfig+0x368/0x3d0 [<0000000018657927>] x64_sys_call+0xff/0x9f0 [<00000000c0cabc95>] do_syscall_64+0x3b/0xc0 [<000000002f331597>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: 6c1752e0b6ca ("bpf: Support symbolic BPF FS delegation mount options") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20241022130133.3798232-1-houtao@huaweicloud.com
2024-10-21bpf: Implement bpf_send_signal_task() kfuncPuranjay Mohan
Implement bpf_send_signal_task kfunc that is similar to bpf_send_signal_thread and bpf_send_signal helpers but can be used to send signals to other threads and processes. It also supports sending a cookie with the signal similar to sigqueue(). If the receiving process establishes a handler for the signal using the SA_SIGINFO flag to sigaction(), then it can obtain this cookie via the si_value field of the siginfo_t structure passed as the second argument to the handler. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241016084136.10305-2-puranjay@kernel.org
2024-10-18Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Daniel Borkmann: - Fix BPF verifier to not affect subreg_def marks in its range propagation (Eduard Zingerman) - Fix a truncation bug in the BPF verifier's handling of coerce_reg_to_size_sx (Dimitar Kanaliev) - Fix the BPF verifier's delta propagation between linked registers under 32-bit addition (Daniel Borkmann) - Fix a NULL pointer dereference in BPF devmap due to missing rxq information (Florian Kauer) - Fix a memory leak in bpf_core_apply (Jiri Olsa) - Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for arrays of nested structs (Hou Tao) - Fix build ID fetching where memory areas backing the file were created with memfd_secret (Andrii Nakryiko) - Fix BPF task iterator tid filtering which was incorrectly using pid instead of tid (Jordan Rome) - Several fixes for BPF sockmap and BPF sockhash redirection in combination with vsocks (Michal Luczaj) - Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered (Andrea Parri) - Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility of an infinite BPF tailcall (Pu Lehui) - Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot be resolved (Thomas Weißschuh) - Fix a bug in kfunc BTF caching for modules where the wrong BTF object was returned (Toke Høiland-Jørgensen) - Fix a BPF selftest compilation error in cgroup-related tests with musl libc (Tony Ambardar) - Several fixes to BPF link info dumps to fill missing fields (Tyrone Wu) - Add BPF selftests for kfuncs from multiple modules, checking that the correct kfuncs are called (Simon Sundberg) - Ensure that internal and user-facing bpf_redirect flags don't overlap (Toke Høiland-Jørgensen) - Switch to use kvzmalloc to allocate BPF verifier environment (Rik van Riel) - Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat under RT (Wander Lairson Costa) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (38 commits) lib/buildid: Handle memfd_secret() files in build_id_parse() selftests/bpf: Add test case for delta propagation bpf: Fix print_reg_state's constant scalar dump bpf: Fix incorrect delta propagation between linked registers bpf: Properly test iter/task tid filtering bpf: Fix iter/task tid filtering riscv, bpf: Make BPF_CMPXCHG fully ordered bpf, vsock: Drop static vsock_bpf_prot initialization vsock: Update msg_count on read_skb() vsock: Update rx_bytes on read_skb() bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock selftests/bpf: Add asserts for netfilter link info bpf: Fix link info netfilter flags to populate defrag flag selftests/bpf: Add test for sign extension in coerce_subreg_to_size_sx() selftests/bpf: Add test for truncation after sign extension in coerce_reg_to_size_sx() bpf: Fix truncation bug in coerce_reg_to_size_sx() selftests/bpf: Assert link info uprobe_multi count & path_size if unset bpf: Fix unpopulated path_size when uprobe_multi fields unset selftests/bpf: Fix cross-compiling urandom_read selftests/bpf: Add test for kfunc module order ...
2024-10-17bpf: Fix print_reg_state's constant scalar dumpDaniel Borkmann
print_reg_state() should not consider adding reg->off to reg->var_off.value when dumping scalars. Scalars can be produced with reg->off != 0 through BPF_ADD_CONST, and thus as-is this can skew the register log dump. Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: Nathaniel Theis <nathaniel.theis@nccgroup.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241016134913.32249-2-daniel@iogearbox.net
2024-10-17bpf: Fix incorrect delta propagation between linked registersDaniel Borkmann
Nathaniel reported a bug in the linked scalar delta tracking, which can lead to accepting a program with OOB access. The specific code is related to the sync_linked_regs() function and the BPF_ADD_CONST flag, which signifies a constant offset between two scalar registers tracked by the same register id. The verifier attempts to track "similar" scalars in order to propagate bounds information learned about one scalar to others. For instance, if r1 and r2 are known to contain the same value, then upon encountering 'if (r1 != 0x1234) goto xyz', not only does it know that r1 is equal to 0x1234 on the path where that conditional jump is not taken, it also knows that r2 is. Additionally, with env->bpf_capable set, the verifier will track scalars which should be a constant delta apart (if r1 is known to be one greater than r2, then if r1 is known to be equal to 0x1234, r2 must be equal to 0x1233.) The code path for the latter in adjust_reg_min_max_vals() is reached when processing both 32 and 64-bit addition operations. While adjust_reg_min_max_vals() knows whether dst_reg was produced by a 32 or a 64-bit addition (based on the alu32 bool), the only information saved in dst_reg is the id of the source register (reg->id, or'ed by BPF_ADD_CONST) and the value of the constant offset (reg->off). Later, the function sync_linked_regs() will attempt to use this information to propagate bounds information from one register (known_reg) to others, meaning, for all R in linked_regs, it copies known_reg range (and possibly adjusting delta) into R for the case of R->id == known_reg->id. For the delta adjustment, meaning, matching reg->id with BPF_ADD_CONST, the verifier adjusts the register as reg = known_reg; reg += delta where delta is computed as (s32)reg->off - (s32)known_reg->off and placed as a scalar into a fake_reg to then simulate the addition of reg += fake_reg. This is only correct, however, if the value in reg was created by a 64-bit addition. When reg contains the result of a 32-bit addition operation, its upper 32 bits will always be zero. sync_linked_regs() on the other hand, may cause the verifier to believe that the addition between fake_reg and reg overflows into those upper bits. For example, if reg was generated by adding the constant 1 to known_reg using a 32-bit alu operation, then reg->off is 1 and known_reg->off is 0. If known_reg is known to be the constant 0xFFFFFFFF, sync_linked_regs() will tell the verifier that reg is equal to the constant 0x100000000. This is incorrect as the actual value of reg will be 0, as the 32-bit addition will wrap around. Example: 0: (b7) r0 = 0; R0_w=0 1: (18) r1 = 0x80000001; R1_w=0x80000001 3: (37) r1 /= 1; R1_w=scalar() 4: (bf) r2 = r1; R1_w=scalar(id=1) R2_w=scalar(id=1) 5: (bf) r4 = r1; R1_w=scalar(id=1) R4_w=scalar(id=1) 6: (04) w2 += 2147483647; R2_w=scalar(id=1+2147483647,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 7: (04) w4 += 0 ; R4_w=scalar(id=1+0,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 8: (15) if r2 == 0x0 goto pc+1 10: R0=0 R1=0xffffffff80000001 R2=0x7fffffff R4=0xffffffff80000001 R10=fp0 What can be seen here is that r1 is copied to r2 and r4, such that {r1,r2,r4}.id are all the same which later lets sync_linked_regs() to be invoked. Then, in a next step constants are added with alu32 to r2 and r4, setting their ->off, as well as id |= BPF_ADD_CONST. Next, the conditional will bind r2 and propagate ranges to its linked registers. The verifier now believes the upper 32 bits of r4 are r4=0xffffffff80000001, while actually r4=r1=0x80000001. One approach for a simple fix suitable also for stable is to limit the constant delta tracking to only 64-bit alu addition. If necessary at some later point, BPF_ADD_CONST could be split into BPF_ADD_CONST64 and BPF_ADD_CONST32 to avoid mixing the two under the tradeoff to further complicate sync_linked_regs(). However, none of the added tests from dedf56d775c0 ("selftests/bpf: Add tests for add_const") make this necessary at this point, meaning, BPF CI also passes with just limiting tracking to 64-bit alu addition. Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: Nathaniel Theis <nathaniel.theis@nccgroup.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20241016134913.32249-1-daniel@iogearbox.net
2024-10-17bpf: Fix iter/task tid filteringJordan Rome
In userspace, you can add a tid filter by setting the "task.tid" field for "bpf_iter_link_info". However, `get_pid_task` when called for the `BPF_TASK_ITER_TID` type should have been using `PIDTYPE_PID` (tid) instead of `PIDTYPE_TGID` (pid). Fixes: f0d74c4da1f0 ("bpf: Parameterize task iterators.") Signed-off-by: Jordan Rome <linux@jordanrome.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241016210048.1213935-1-linux@jordanrome.com
2024-10-16bpf: Prevent tailcall infinite loop caused by freplaceLeon Hwang
There is a potential infinite loop issue that can occur when using a combination of tail calls and freplace. In an upcoming selftest, the attach target for entry_freplace of tailcall_freplace.c is subprog_tc of tc_bpf2bpf.c, while the tail call in entry_freplace leads to entry_tc. This results in an infinite loop: entry_tc -> subprog_tc -> entry_freplace --tailcall-> entry_tc. The problem arises because the tail_call_cnt in entry_freplace resets to zero each time entry_freplace is executed, causing the tail call mechanism to never terminate, eventually leading to a kernel panic. To fix this issue, the solution is twofold: 1. Prevent updating a program extended by an freplace program to a prog_array map. 2. Prevent extending a program that is already part of a prog_array map with an freplace program. This ensures that: * If a program or its subprogram has been extended by an freplace program, it can no longer be updated to a prog_array map. * If a program has been added to a prog_array map, neither it nor its subprograms can be extended by an freplace program. Moreover, an extension program should not be tailcalled. As such, return -EINVAL if the program has a type of BPF_PROG_TYPE_EXT when adding it to a prog_array map. Additionally, fix a minor code style issue by replacing eight spaces with a tab for proper formatting. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20241015150207.70264-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-16bpf: Add bpf_task_from_vpid() kfuncJuntong Deng
bpf_task_from_pid() that currently exists looks up the struct task_struct corresponding to the pid in the root pid namespace (init_pid_ns). This patch adds bpf_task_from_vpid() which looks up the struct task_struct corresponding to vpid in the pid namespace of the current process. This is useful for getting information about other processes in the same pid namespace. Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Link: https://lore.kernel.org/r/AM6PR03MB5848E50DA58F79CDE65433C399442@AM6PR03MB5848.eurprd03.prod.outlook.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-16mm/bpf: Add bpf_get_kmem_cache() kfuncNamhyung Kim
The bpf_get_kmem_cache() is to get a slab cache information from a virtual address like virt_to_cache(). If the address is a pointer to a slab object, it'd return a valid kmem_cache pointer, otherwise NULL is returned. It doesn't grab a reference count of the kmem_cache so the caller is responsible to manage the access. The returned point is marked as PTR_UNTRUSTED. The intended use case for now is to symbolize locks in slab objects from the lock contention tracepoints. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241010232505.1339892-3-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-15bpf: Fix truncation bug in coerce_reg_to_size_sx()Dimitar Kanaliev
coerce_reg_to_size_sx() updates the register state after a sign-extension operation. However, there's a bug in the assignment order of the unsigned min/max values, leading to incorrect truncation: 0: (85) call bpf_get_prandom_u32#7 ; R0_w=scalar() 1: (57) r0 &= 1 ; R0_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) 2: (07) r0 += 254 ; R0_w=scalar(smin=umin=smin32=umin32=254,smax=umax=smax32=umax32=255,var_off=(0xfe; 0x1)) 3: (bf) r0 = (s8)r0 ; R0_w=scalar(smin=smin32=-2,smax=smax32=-1,umin=umin32=0xfffffffe,umax=0xffffffff,var_off=(0xfffffffffffffffe; 0x1)) In the current implementation, the unsigned 32-bit min/max values (u32_min_value and u32_max_value) are assigned directly from the 64-bit signed min/max values (s64_min and s64_max): reg->umin_value = reg->u32_min_value = s64_min; reg->umax_value = reg->u32_max_value = s64_max; Due to the chain assigmnent, this is equivalent to: reg->u32_min_value = s64_min; // Unintended truncation reg->umin_value = reg->u32_min_value; reg->u32_max_value = s64_max; // Unintended truncation reg->umax_value = reg->u32_max_value; Fixes: 1f9a1ea821ff ("bpf: Support new sign-extension load insns") Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Reported-by: Zac Ecob <zacecob@protonmail.com> Signed-off-by: Dimitar Kanaliev <dimitar.kanaliev@siteground.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20241014121155.92887-2-dimitar.kanaliev@siteground.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-14bpf: Add kmem_cache iteratorNamhyung Kim
The new "kmem_cache" iterator will traverse the list of slab caches and call attached BPF programs for each entry. It should check the argument (ctx.s) if it's NULL before using it. Now the iteration grabs the slab_mutex only if it traverse the list and releases the mutex when it runs the BPF program. The kmem_cache entry is protected by a refcount during the execution. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> #slab Link: https://lore.kernel.org/r/20241010232505.1339892-2-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-10bpf: fix kfunc btf caching for modulesToke Høiland-Jørgensen
The verifier contains a cache for looking up module BTF objects when calling kfuncs defined in modules. This cache uses a 'struct bpf_kfunc_btf_tab', which contains a sorted list of BTF objects that were already seen in the current verifier run, and the BTF objects are looked up by the offset stored in the relocated call instruction using bsearch(). The first time a given offset is seen, the module BTF is loaded from the file descriptor passed in by libbpf, and stored into the cache. However, there's a bug in the code storing the new entry: it stores a pointer to the new cache entry, then calls sort() to keep the cache sorted for the next lookup using bsearch(), and then returns the entry that was just stored through the stored pointer. However, because sort() modifies the list of entries in place *by value*, the stored pointer may no longer point to the right entry, in which case the wrong BTF object will be returned. The end result of this is an intermittent bug where, if a BPF program calls two functions with the same signature in two different modules, the function from the wrong module may sometimes end up being called. Whether this happens depends on the order of the calls in the BPF program (as that affects whether sort() reorders the array of BTF objects), making it especially hard to track down. Simon, credited as reporter below, spent significant effort analysing and creating a reproducer for this issue. The reproducer is added as a selftest in a subsequent patch. The fix is straight forward: simply don't use the stored pointer after calling sort(). Since we already have an on-stack pointer to the BTF object itself at the point where the function return, just use that, and populate it from the cache entry in the branch where the lookup succeeds. Fixes: 2357672c54c3 ("bpf: Introduce BPF support for kernel module function calls") Reported-by: Simon Sundberg <simon.sundberg@kau.se> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20241010-fix-kfunc-btf-caching-for-modules-v2-1-745af6c1af98@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-10bpf: fix argument type in bpf_loop documentationMatteo Croce
The `index` argument to bpf_loop() is threaded as an u64. This lead in a subtle verifier denial where clang cloned the argument in another register[1]. [1] https://github.com/systemd/systemd/pull/34650#issuecomment-2401092895 Signed-off-by: Matteo Croce <teknoraver@meta.com> Link: https://lore.kernel.org/r/20241010035652.17830-1-technoboy85@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-09bpf: fix unpopulated name_len field in perf_event link infoTyrone Wu
Previously when retrieving `bpf_link_info.perf_event` for kprobe/uprobe/tracepoint, the `name_len` field was not populated by the kernel, leaving it to reflect the value initially set by the user. This behavior was inconsistent with how other input/output string buffer fields function (e.g. `raw_tracepoint.tp_name_len`). This patch fills `name_len` with the actual size of the string name. Fixes: 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event") Signed-off-by: Tyrone Wu <wudevelops@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20241008164312.46269-1-wudevelops@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-09bpf: use kvzmalloc to allocate BPF verifier environmentRik van Riel
The kzmalloc call in bpf_check can fail when memory is very fragmented, which in turn can lead to an OOM kill. Use kvzmalloc to fall back to vmalloc when memory is too fragmented to allocate an order 3 sized bpf verifier environment. Admittedly this is not a very common case, and only happens on systems where memory has already been squeezed close to the limit, but this does not seem like much of a hot path, and it's a simple enough fix. Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://lore.kernel.org/r/20241008170735.16766766@imladris.surriel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-09bpf: Check the remaining info_cnt before repeating btf fieldsHou Tao
When trying to repeat the btf fields for array of nested struct, it doesn't check the remaining info_cnt. The following splat will be reported when the value of ret * nelems is greater than BTF_FIELDS_MAX: ------------[ cut here ]------------ UBSAN: array-index-out-of-bounds in ../kernel/bpf/btf.c:3951:49 index 11 is out of range for type 'btf_field_info [11]' CPU: 6 UID: 0 PID: 411 Comm: test_progs ...... 6.11.0-rc4+ #1 Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ... Call Trace: <TASK> dump_stack_lvl+0x57/0x70 dump_stack+0x10/0x20 ubsan_epilogue+0x9/0x40 __ubsan_handle_out_of_bounds+0x6f/0x80 ? kallsyms_lookup_name+0x48/0xb0 btf_parse_fields+0x992/0xce0 map_create+0x591/0x770 __sys_bpf+0x229/0x2410 __x64_sys_bpf+0x1f/0x30 x64_sys_call+0x199/0x9f0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7fea56f2cc5d ...... </TASK> ---[ end trace ]--- Fix it by checking the remaining info_cnt in btf_repeat_fields() before repeating the btf fields. Fixes: 64e8ee814819 ("bpf: look into the types of the fields of a struct type recursively.") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241008071114.3718177-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-08bpf, lsm: Remove bpf_lsm_key_free hookThomas Weißschuh
The key_free LSM hook has been removed. Remove the corresponding BPF hook. Avoid warnings during the build: BTFIDS vmlinux WARN: resolve_btfids: unresolved symbol bpf_lsm_key_free Fixes: 5f8d28f6d7d5 ("lsm: infrastructure management of the key security blob") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20241005-lsm-key_free-v1-1-42ea801dbd63@weissschuh.net
2024-10-07bpf: Fix memory leak in bpf_core_applyJiri Olsa
We need to free specs properly. Fixes: 3d2786d65aaa ("bpf: correctly handle malformed BPF_CORE_TYPE_ID_LOCAL relos") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20241007160958.607434-1-jolsa@kernel.org
2024-10-03bpf: Use KF_FASTCALL to mark kfuncs supporting fastcall contractEduard Zingerman
In order to allow pahole add btf_decl_tag("bpf_fastcall") for kfuncs supporting bpf_fastcall, mark such functions with KF_FASTCALL in id_set8 objects. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240916091712.2929279-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-03bpf: Call kfree(obj) only once in free_one()Markus Elfring
A kfree() call is always used at the end of this function implementation. Thus specify such a function call only once instead of duplicating it in a previous if branch. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/08987123-668c-40f3-a8ee-c3038d94f069@web.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-03bpf: Constify struct btf_kind_operationsChristophe JAILLET
struct btf_kind_operations are not modified in BTF. Constifying this structures moves some data to a read-only section, so increase overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 184320 7091 548 191959 2edd7 kernel/bpf/btf.o After: ===== text data bss dec hex filename 184896 6515 548 191959 2edd7 kernel/bpf/btf.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/9192ab72b2e9c66aefd6520f359a20297186327f.1726417289.git.christophe.jaillet@wanadoo.fr Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-02move asm/unaligned.h to linux/unaligned.hAl Viro
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02bpf: devmap: provide rxq after redirectFlorian Kauer
rxq contains a pointer to the device from where the redirect happened. Currently, the BPF program that was executed after a redirect via BPF_MAP_TYPE_DEVMAP* does not have it set. This is particularly bad since accessing ingress_ifindex, e.g. SEC("xdp") int prog(struct xdp_md *pkt) { return bpf_redirect_map(&dev_redirect_map, 0, 0); } SEC("xdp/devmap") int prog_after_redirect(struct xdp_md *pkt) { bpf_printk("ifindex %i", pkt->ingress_ifindex); return XDP_PASS; } depends on access to rxq, so a NULL pointer gets dereferenced: <1>[ 574.475170] BUG: kernel NULL pointer dereference, address: 0000000000000000 <1>[ 574.475188] #PF: supervisor read access in kernel mode <1>[ 574.475194] #PF: error_code(0x0000) - not-present page <6>[ 574.475199] PGD 0 P4D 0 <4>[ 574.475207] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI <4>[ 574.475217] CPU: 4 UID: 0 PID: 217 Comm: kworker/4:1 Not tainted 6.11.0-rc5-reduced-00859-g780801200300 #23 <4>[ 574.475226] Hardware name: Intel(R) Client Systems NUC13ANHi7/NUC13ANBi7, BIOS ANRPL357.0026.2023.0314.1458 03/14/2023 <4>[ 574.475231] Workqueue: mld mld_ifc_work <4>[ 574.475247] RIP: 0010:bpf_prog_5e13354d9cf5018a_prog_after_redirect+0x17/0x3c <4>[ 574.475257] Code: cc cc cc cc cc cc cc 80 00 00 00 cc cc cc cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 66 90 55 48 89 e5 f3 0f 1e fa 48 8b 57 20 <48> 8b 52 00 8b 92 e0 00 00 00 48 bf f8 a6 d5 c4 5d a0 ff ff be 0b <4>[ 574.475263] RSP: 0018:ffffa62440280c98 EFLAGS: 00010206 <4>[ 574.475269] RAX: ffffa62440280cd8 RBX: 0000000000000001 RCX: 0000000000000000 <4>[ 574.475274] RDX: 0000000000000000 RSI: ffffa62440549048 RDI: ffffa62440280ce0 <4>[ 574.475278] RBP: ffffa62440280c98 R08: 0000000000000002 R09: 0000000000000001 <4>[ 574.475281] R10: ffffa05dc8b98000 R11: ffffa05f577fca40 R12: ffffa05dcab24000 <4>[ 574.475285] R13: ffffa62440280ce0 R14: ffffa62440549048 R15: ffffa62440549000 <4>[ 574.475289] FS: 0000000000000000(0000) GS:ffffa05f4f700000(0000) knlGS:0000000000000000 <4>[ 574.475294] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 574.475298] CR2: 0000000000000000 CR3: 000000025522e000 CR4: 0000000000f50ef0 <4>[ 574.475303] PKRU: 55555554 <4>[ 574.475306] Call Trace: <4>[ 574.475313] <IRQ> <4>[ 574.475318] ? __die+0x23/0x70 <4>[ 574.475329] ? page_fault_oops+0x180/0x4c0 <4>[ 574.475339] ? skb_pp_cow_data+0x34c/0x490 <4>[ 574.475346] ? kmem_cache_free+0x257/0x280 <4>[ 574.475357] ? exc_page_fault+0x67/0x150 <4>[ 574.475368] ? asm_exc_page_fault+0x26/0x30 <4>[ 574.475381] ? bpf_prog_5e13354d9cf5018a_prog_after_redirect+0x17/0x3c <4>[ 574.475386] bq_xmit_all+0x158/0x420 <4>[ 574.475397] __dev_flush+0x30/0x90 <4>[ 574.475407] veth_poll+0x216/0x250 [veth] <4>[ 574.475421] __napi_poll+0x28/0x1c0 <4>[ 574.475430] net_rx_action+0x32d/0x3a0 <4>[ 574.475441] handle_softirqs+0xcb/0x2c0 <4>[ 574.475451] do_softirq+0x40/0x60 <4>[ 574.475458] </IRQ> <4>[ 574.475461] <TASK> <4>[ 574.475464] __local_bh_enable_ip+0x66/0x70 <4>[ 574.475471] __dev_queue_xmit+0x268/0xe40 <4>[ 574.475480] ? selinux_ip_postroute+0x213/0x420 <4>[ 574.475491] ? alloc_skb_with_frags+0x4a/0x1d0 <4>[ 574.475502] ip6_finish_output2+0x2be/0x640 <4>[ 574.475512] ? nf_hook_slow+0x42/0xf0 <4>[ 574.475521] ip6_finish_output+0x194/0x300 <4>[ 574.475529] ? __pfx_ip6_finish_output+0x10/0x10 <4>[ 574.475538] mld_sendpack+0x17c/0x240 <4>[ 574.475548] mld_ifc_work+0x192/0x410 <4>[ 574.475557] process_one_work+0x15d/0x380 <4>[ 574.475566] worker_thread+0x29d/0x3a0 <4>[ 574.475573] ? __pfx_worker_thread+0x10/0x10 <4>[ 574.475580] ? __pfx_worker_thread+0x10/0x10 <4>[ 574.475587] kthread+0xcd/0x100 <4>[ 574.475597] ? __pfx_kthread+0x10/0x10 <4>[ 574.475606] ret_from_fork+0x31/0x50 <4>[ 574.475615] ? __pfx_kthread+0x10/0x10 <4>[ 574.475623] ret_from_fork_asm+0x1a/0x30 <4>[ 574.475635] </TASK> <4>[ 574.475637] Modules linked in: veth br_netfilter bridge stp llc iwlmvm x86_pkg_temp_thermal iwlwifi efivarfs nvme nvme_core <4>[ 574.475662] CR2: 0000000000000000 <4>[ 574.475668] ---[ end trace 0000000000000000 ]--- Therefore, provide it to the program by setting rxq properly. Fixes: cb261b594b41 ("bpf: Run devmap xdp_prog on flush instead of bulk enqueue") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Florian Kauer <florian.kauer@linutronix.de> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20240911-devel-koalo-fix-ingress-ifindex-v4-1-5c643ae10258@linutronix.de Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-01bpf: sync_linked_regs() must preserve subreg_defEduard Zingerman
Range propagation must not affect subreg_def marks, otherwise the following example is rewritten by verifier incorrectly when BPF_F_TEST_RND_HI32 flag is set: 0: call bpf_ktime_get_ns call bpf_ktime_get_ns 1: r0 &= 0x7fffffff after verifier r0 &= 0x7fffffff 2: w1 = w0 rewrites w1 = w0 3: if w0 < 10 goto +0 --------------> r11 = 0x2f5674a6 (r) 4: r1 >>= 32 r11 <<= 32 (r) 5: r0 = r1 r1 |= r11 (r) 6: exit; if w0 < 0xa goto pc+0 r1 >>= 32 r0 = r1 exit (or zero extension of w1 at (2) is missing for architectures that require zero extension for upper register half). The following happens w/o this patch: - r0 is marked as not a subreg at (0); - w1 is marked as subreg at (2); - w1 subreg_def is overridden at (3) by copy_register_state(); - w1 is read at (5) but mark_insn_zext() does not mark (2) for zero extension, because w1 subreg_def is not set; - because of BPF_F_TEST_RND_HI32 flag verifier inserts random value for hi32 bits of (2) (marked (r)); - this random value is read at (5). Fixes: 75748837b7e5 ("bpf: Propagate scalar ranges through register assignments.") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Closes: https://lore.kernel.org/bpf/7e2aa30a62d740db182c170fdd8f81c596df280d.camel@gmail.com Link: https://lore.kernel.org/bpf/20240924210844.1758441-1-eddyz87@gmail.com
2024-09-27[tree-wide] finally take no_llseek outAl Viro
no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-25bpf: Use raw_spinlock_t in ringbufWander Lairson Costa
The function __bpf_ringbuf_reserve is invoked from a tracepoint, which disables preemption. Using spinlock_t in this context can lead to a "sleep in atomic" warning in the RT variant. This issue is illustrated in the example below: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 556208, name: test_progs preempt_count: 1, expected: 0 RCU nest depth: 1, expected: 1 INFO: lockdep is turned off. Preemption disabled at: [<ffffd33a5c88ea44>] migrate_enable+0xc0/0x39c CPU: 7 PID: 556208 Comm: test_progs Tainted: G Hardware name: Qualcomm SA8775P Ride (DT) Call trace: dump_backtrace+0xac/0x130 show_stack+0x1c/0x30 dump_stack_lvl+0xac/0xe8 dump_stack+0x18/0x30 __might_resched+0x3bc/0x4fc rt_spin_lock+0x8c/0x1a4 __bpf_ringbuf_reserve+0xc4/0x254 bpf_ringbuf_reserve_dynptr+0x5c/0xdc bpf_prog_ac3d15160d62622a_test_read_write+0x104/0x238 trace_call_bpf+0x238/0x774 perf_call_bpf_enter.isra.0+0x104/0x194 perf_syscall_enter+0x2f8/0x510 trace_sys_enter+0x39c/0x564 syscall_trace_enter+0x220/0x3c0 do_el0_svc+0x138/0x1dc el0_svc+0x54/0x130 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 Switch the spinlock to raw_spinlock_t to avoid this error. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: Brian Grech <bgrech@redhat.com> Signed-off-by: Wander Lairson Costa <wander.lairson@gmail.com> Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20240920190700.617253-1-wander@redhat.com
2024-09-24Merge tag 'bpf-next-6.12-struct-fd' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf 'struct fd' updates from Alexei Starovoitov: "This includes struct_fd BPF changes from Al and Andrii" * tag 'bpf-next-6.12-struct-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: bpf: convert bpf_token_create() to CLASS(fd, ...) security,bpf: constify struct path in bpf_token_create() LSM hook bpf: more trivial fdget() conversions bpf: trivial conversions for fdget() bpf: switch maps to CLASS(fd, ...) bpf: factor out fetching bpf_map from FD and adding it to used_maps list bpf: switch fdget_raw() uses to CLASS(fd_raw, ...) bpf: convert __bpf_prog_get() to CLASS(fd, ...)