summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2025-03-18bpf: Reject attaching fexit/fmod_ret to __noreturn functionsYafang Shao
If we attach fexit/fmod_ret to __noreturn functions, it will cause an issue that the bpf trampoline image will be left over even if the bpf link has been destroyed. Take attaching do_exit() with fexit for example. The fexit works as follows, bpf_trampoline + __bpf_tramp_enter + percpu_ref_get(&tr->pcref); + call do_exit() + __bpf_tramp_exit + percpu_ref_put(&tr->pcref); Since do_exit() never returns, the refcnt of the trampoline image is never decremented, preventing it from being freed. That can be verified with as follows, $ bpftool link show <<<< nothing output $ grep "bpf_trampoline_[0-9]" /proc/kallsyms ffffffffc04cb000 t bpf_trampoline_6442526459 [bpf] <<<< leftover In this patch, all functions annotated with __noreturn are rejected, except for the following cases: - Functions that result in a system reboot, such as panic, machine_real_restart and rust_begin_unwind - Functions that are never executed by tasks, such as rest_init and cpu_startup_entry - Functions implemented in assembly, such as rewind_stack_and_make_dead and xen_cpu_bringup_again, lack an associated BTF ID. With this change, attaching fexit probes to functions like do_exit() will be rejected. $ ./fexit libbpf: prog 'fexit': BPF program load failed: -EINVAL libbpf: prog 'fexit': -- BEGIN PROG LOAD LOG -- Attaching fexit/fmod_ret to __noreturn functions is rejected. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20250318114447.75484-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18bpf: Only fails the busy counter check in bpf_cgrp_storage_get if it creates ↵Martin KaFai Lau
storage The current cgrp storage has a percpu counter, bpf_cgrp_storage_busy, to detect potential deadlock at a spin_lock that the local storage acquires during new storage creation. There are false positives. It turns out to be too noisy in production. For example, a bpf prog may be doing a bpf_cgrp_storage_get on map_a. An IRQ comes in and triggers another bpf_cgrp_storage_get on a different map_b. It will then trigger the false positive deadlock check in the percpu counter. On top of that, both are doing lookup only and no need to create new storage, so practically it does not need to acquire the spin_lock. The bpf_task_storage_get already has a strategy to minimize this false positive by only failing if the bpf_task_storage_get needs to create a new storage and the percpu counter is busy. Creating a new storage is the only time it must acquire the spin_lock. This patch borrows the same idea. Unlike task storage that has a separate variant for tracing (_recur) and non-tracing, this patch stays with one bpf_cgrp_storage_get helper to keep it simple for now in light of the upcoming res_spin_lock. The variable could potentially use a better name noTbusy instead of nobusy. This patch follows the same naming in bpf_task_storage_get for now. I have tested it by temporarily adding noinline to the cgroup_storage_lookup(), traced it by fentry, and the fentry program succeeded in calling bpf_cgrp_storage_get(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250318182759.3676094-1-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-18bpf: Make perf_event_read_output accessible in all program types.Emil Tsalapatis
The perf_event_read_event_output helper is currently only available to tracing protrams, but is useful for other BPF programs like sched_ext schedulers. When the helper is available, provide its bpf_func_proto directly from the bpf base_proto. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20250318030753.10949-1-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-17bpftool: Using the right format specifiersJiayuan Chen
Fixed some formatting specifiers errors, such as using %d for int and %u for unsigned int, as well as other byte-length types. Perform type cast using the type derived from the data type itself, for example, if it's originally an int, it will be cast to unsigned int if forced to unsigned. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250311112809.81901-3-jiayuan.chen@linux.dev
2025-03-17bpf: Return prog btf_id without capable checkMykyta Yatsenko
Return prog's btf_id from bpf_prog_get_info_by_fd regardless of capable check. This patch enables scenario, when freplace program, running from user namespace, requires to query target prog's btf. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/20250317174039.161275-3-mykyta.yatsenko5@gmail.com
2025-03-17bpf: BPF token support for BPF_BTF_GET_FD_BY_IDMykyta Yatsenko
Currently BPF_BTF_GET_FD_BY_ID requires CAP_SYS_ADMIN, which does not allow running it from user namespace. This creates a problem when freplace program running from user namespace needs to query target program BTF. This patch relaxes capable check from CAP_SYS_ADMIN to CAP_BPF and adds support for BPF token that can be passed in attributes to syscall. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250317174039.161275-2-mykyta.yatsenko5@gmail.com
2025-03-15bpf: Check map->record at the beginning of check_and_free_fields()Hou Tao
When there are no special fields in the map value, there is no need to invoke bpf_obj_free_fields(). Therefore, checking the validity of map->record in advance. After the change, the benchmark result of the per-cpu update case in map_perf_test increased by 40% under a 16-CPU VM. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250315150930.1511727-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: preload: Add MODULE_DESCRIPTIONArnd Bergmann
Modpost complains when extra warnings are enabled: WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/bpf/preload/bpf_preload.o Add a description from the Kconfig help text. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250310134920.4123633-1-arnd@kernel.org ---- Not sure if that description actually fits what the module does. If not, please add a different description instead. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15security: Propagate caller information in bpf hooksBlaise Boscaccy
Certain bpf syscall subcommands are available for usage from both userspace and the kernel. LSM modules or eBPF gatekeeper programs may need to take a different course of action depending on whether or not a BPF syscall originated from the kernel or userspace. Additionally, some of the bpf_attr struct fields contain pointers to arbitrary memory. Currently the functionality to determine whether or not a pointer refers to kernel memory or userspace memory is exposed to the bpf verifier, but that information is missing from various LSM hooks. Here we augment the LSM hooks to provide this data, by simply passing a boolean flag indicating whether or not the call originated in the kernel, in any hook that contains a bpf_attr struct that corresponds to a subcommand that may be called from the kernel. Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20250310221737.821889-2-bboscaccy@linux.microsoft.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: fix missing kdoc string fields in cpumask.cEmil Tsalapatis
Some bpf_cpumask-related kfuncs have kdoc strings that are missing return values. Add a the missing descriptions for the return values. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250309230427.26603-4-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: add kfunc for populating cpumask bitsEmil Tsalapatis
Add a helper kfunc that sets the bitmap of a bpf_cpumask from BPF memory. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Acked-by: Hou Tao <houtao1@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250309230427.26603-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: correct use/def for may_goto instructionEduard Zingerman
may_goto instruction does not use any registers, but in compute_insn_live_regs() it was treated as a regular conditional jump of kind BPF_K with r0 as source register. Thus unnecessarily marking r0 as used. Fixes: 14c8552db644 ("bpf: simple DFA-based live registers analysis") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250305085436.2731464-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: use register liveness information for func_states_equalEduard Zingerman
Liveness analysis DFA computes a set of registers live before each instruction. Leverage this information to skip comparison of dead registers in func_states_equal(). This helps with convergance of iterator processing loops, as bpf_reg_state->live marks can't be used when loops are processed. This has certain performance impact for selftests, here is a veristat listing using `-f "insns_pct>5" -f "!insns<200"` selftests: File Program States (A) States (B) States (DIFF) -------------------- ----------------------------- ---------- ---------- -------------- arena_htab.bpf.o arena_htab_llvm 37 35 -2 (-5.41%) arena_htab_asm.bpf.o arena_htab_asm 37 33 -4 (-10.81%) arena_list.bpf.o arena_list_add 37 22 -15 (-40.54%) dynptr_success.bpf.o test_dynptr_copy 22 16 -6 (-27.27%) dynptr_success.bpf.o test_dynptr_copy_xdp 68 58 -10 (-14.71%) iters.bpf.o checkpoint_states_deletion 918 40 -878 (-95.64%) iters.bpf.o clean_live_states 136 66 -70 (-51.47%) iters.bpf.o iter_nested_deeply_iters 43 37 -6 (-13.95%) iters.bpf.o iter_nested_iters 72 62 -10 (-13.89%) iters.bpf.o iter_pass_iter_ptr_to_subprog 30 26 -4 (-13.33%) iters.bpf.o iter_subprog_iters 68 59 -9 (-13.24%) iters.bpf.o loop_state_deps2 35 32 -3 (-8.57%) iters_css.bpf.o iter_css_for_each 32 29 -3 (-9.38%) pyperf600_iter.bpf.o on_event 286 192 -94 (-32.87%) Total progs: 3578 Old success: 2061 New success: 2061 States diff min: -95.64% States diff max: 0.00% -100 .. -90 %: 1 -55 .. -45 %: 3 -45 .. -35 %: 2 -35 .. -25 %: 5 -20 .. -10 %: 12 -10 .. 0 %: 6 sched_ext: File Program States (A) States (B) States (DIFF) ----------------- ---------------------- ---------- ---------- --------------- bpf.bpf.o lavd_dispatch 8950 7065 -1885 (-21.06%) bpf.bpf.o lavd_init 516 480 -36 (-6.98%) bpf.bpf.o layered_dispatch 662 501 -161 (-24.32%) bpf.bpf.o layered_dump 298 237 -61 (-20.47%) bpf.bpf.o layered_init 523 423 -100 (-19.12%) bpf.bpf.o layered_init_task 24 22 -2 (-8.33%) bpf.bpf.o layered_runnable 151 125 -26 (-17.22%) bpf.bpf.o p2dq_dispatch 66 53 -13 (-19.70%) bpf.bpf.o p2dq_init 170 142 -28 (-16.47%) bpf.bpf.o refresh_layer_cpumasks 120 78 -42 (-35.00%) bpf.bpf.o rustland_init 37 34 -3 (-8.11%) bpf.bpf.o rustland_init 37 34 -3 (-8.11%) bpf.bpf.o rusty_select_cpu 125 108 -17 (-13.60%) scx_central.bpf.o central_dispatch 59 43 -16 (-27.12%) scx_central.bpf.o central_init 39 28 -11 (-28.21%) scx_nest.bpf.o nest_init 58 51 -7 (-12.07%) scx_pair.bpf.o pair_dispatch 142 111 -31 (-21.83%) scx_qmap.bpf.o qmap_dispatch 174 141 -33 (-18.97%) scx_qmap.bpf.o qmap_init 768 654 -114 (-14.84%) Total progs: 216 Old success: 186 New success: 186 States diff min: -35.00% States diff max: 0.00% -35 .. -25 %: 3 -25 .. -20 %: 6 -20 .. -15 %: 6 -15 .. -5 %: 7 -5 .. 0 %: 6 Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: simple DFA-based live registers analysisEduard Zingerman
Compute may-live registers before each instruction in the program. The register is live before the instruction I if it is read by I or some instruction S following I during program execution and is not overwritten between I and S. This information would be used in the next patch as a hint in func_states_equal(). Use a simple algorithm described in [1] to compute this information: - define the following: - I.use : a set of all registers read by instruction I; - I.def : a set of all registers written by instruction I; - I.in : a set of all registers that may be alive before I execution; - I.out : a set of all registers that may be alive after I execution; - I.successors : a set of instructions S that might immediately follow I for some program execution; - associate separate empty sets 'I.in' and 'I.out' with each instruction; - visit each instruction in a postorder and update corresponding 'I.in' and 'I.out' sets as follows: I.out = U [S.in for S in I.successors] I.in = (I.out / I.def) U I.use (where U stands for set union, / stands for set difference) - repeat the computation while I.{in,out} changes for any instruction. On implementation side keep things as simple, as possible: - check_cfg() already marks instructions EXPLORED in post-order, modify it to save the index of each EXPLORED instruction in a vector; - represent I.{in,out,use,def} as bitmasks; - don't split the program into basic blocks and don't maintain the work queue, instead: - do fixed-point computation by visiting each instruction; - maintain a simple 'changed' flag if I.{in,out} for any instruction change; Measurements show that even such simplistic implementation does not add measurable verification time overhead (for selftests, at-least). Note on check_cfg() ex_insn_beg/ex_done change: To avoid out of bounds access to env->cfg.insn_postorder array, it should be guaranteed that instruction transitions to EXPLORED state only once. Previously this was not the fact for incorrect programs with direct calls to exception callbacks. The 'align' selftest needs adjustment to skip computed insn/live registers printout. Otherwise it matches lines from the live registers printout. [1] https://en.wikipedia.org/wiki/Live-variable_analysis Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: get_call_summary() utility functionEduard Zingerman
Refactor mark_fastcall_pattern_for_call() to extract a utility function get_call_summary(). For a helper or kfunc call this function fills the following information: {num_params, is_void, fastcall}. This function would be used in the next patch in order to get number of parameters of a helper or kfunc call. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: jmp_offset() and verbose_insn() utility functionsEduard Zingerman
Extract two utility functions: - One BPF jump instruction uses .imm field to encode jump offset, while the rest use .off. Encapsulate this detail as jmp_offset() function. - Avoid duplicating instruction printing callback definitions by defining a verbose_insn() function, which disassembles an instruction into the verifier log while hiding this detail. These functions will be used in the next patch. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Introduce load-acquire and store-release instructionsPeilin Ye
Introduce BPF instructions with load-acquire and store-release semantics, as discussed in [1]. Define 2 new flags: #define BPF_LOAD_ACQ 0x100 #define BPF_STORE_REL 0x110 A "load-acquire" is a BPF_STX | BPF_ATOMIC instruction with the 'imm' field set to BPF_LOAD_ACQ (0x100). Similarly, a "store-release" is a BPF_STX | BPF_ATOMIC instruction with the 'imm' field set to BPF_STORE_REL (0x110). Unlike existing atomic read-modify-write operations that only support BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and store-releases also support BPF_B (8-bit) and BPF_H (16-bit). As an exception, however, 64-bit load-acquires/store-releases are not supported on 32-bit architectures (to fix a build error reported by the kernel test robot). An 8- or 16-bit load-acquire zero-extends the value before writing it to a 32-bit register, just like ARM64 instruction LDARH and friends. Similar to existing atomic read-modify-write operations, misaligned load-acquires/store-releases are not allowed (even if BPF_F_ANY_ALIGNMENT is set). As an example, consider the following 64-bit load-acquire BPF instruction (assuming little-endian): db 10 00 00 00 01 00 00 r0 = load_acquire((u64 *)(r1 + 0x0)) opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX imm (0x00000100): BPF_LOAD_ACQ Similarly, a 16-bit BPF store-release: cb 21 00 00 10 01 00 00 store_release((u16 *)(r1 + 0x0), w2) opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX imm (0x00000110): BPF_STORE_REL In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have bpf_jit_supports_insn(..., /*in_arena=*/true) return false for the new instructions, until the corresponding JIT compiler supports them in arena. [1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@google.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/a217f46f0e445fbd573a1a024be5c6bf1d5fe716.1741049567.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Add verifier support for timed may_gotoKumar Kartikeya Dwivedi
Implement support in the verifier for replacing may_goto implementation from a counter-based approach to one which samples time on the local CPU to have a bigger loop bound. We implement it by maintaining 16-bytes per-stack frame, and using 8 bytes for maintaining the count for amortizing time sampling, and 8 bytes for the starting timestamp. To minimize overhead, we need to avoid spilling and filling of registers around this sequence, so we push this cost into the time sampling function 'arch_bpf_timed_may_goto'. This is a JIT-specific wrapper around bpf_check_timed_may_goto which returns us the count to store into the stack through BPF_REG_AX. All caller-saved registers (r0-r5) are guaranteed to remain untouched. The loop can be broken by returning count as 0, otherwise we dispatch into the function when the count drops to 0, and the runtime chooses to refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0 and aborting the loop on next iteration. Since the check for 0 is done right after loading the count from the stack, all subsequent cond_break sequences should immediately break as well, of the same loop or subsequent loops in the program. We pass in the stack_depth of the count (and thus the timestamp, by adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be passed in to bpf_check_timed_may_goto as an argument after r1 is saved, by adding the offset to r10/fp. This adjustment will be arch specific, and the next patch will introduce support for x86. Note that depending on loop complexity, time spent in the loop can be more than the current limit (250 ms), but imposing an upper bound on program runtime is an orthogonal problem which will be addressed when program cancellations are supported. The current time afforded by cond_break may not be enough for cases where BPF programs want to implement locking algorithms inline, and use cond_break as a promise to the verifier that they will eventually terminate. Below are some benchmarking numbers on the time taken per-iteration for an empty loop that counts the number of iterations until cond_break fires. For comparison, we compare it against bpf_for/bpf_repeat which is another way to achieve the same number of spins (BPF_MAX_LOOPS). The hardware used for benchmarking was a Sapphire Rapids Intel server with performance governor enabled, mitigations were enabled. +-----------------------------+--------------+--------------+------------------+ | Loop type | Iterations | Time (ms) | Time/iter (ns) | +-----------------------------|--------------+--------------+------------------+ | may_goto | 8388608 | 3 | 0.36 | | timed_may_goto (count=65535)| 589674932 | 250 | 0.42 | | bpf_for | 8388608 | 10 | 1.19 | +-----------------------------+--------------+--------------+------------------+ This gives a good approximation at low overhead while staying close to the current implementation. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250304003239.2390751-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Factor out check_load_mem() and check_store_reg()Peilin Ye
Extract BPF_LDX and most non-ATOMIC BPF_STX instruction handling logic in do_check() into helper functions to be used later. While we are here, make that comment about "reserved fields" more specific. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/8b39c94eac2bb7389ff12392ca666f939124ec4f.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Factor out check_atomic_rmw()Peilin Ye
Currently, check_atomic() only handles atomic read-modify-write (RMW) instructions. Since we are planning to introduce other types of atomic instructions (i.e., atomic load/store), extract the existing RMW handling logic into its own function named check_atomic_rmw(). Remove the @insn_idx parameter as it is not really necessary. Use 'env->insn_idx' instead, as in other places in verifier.c. Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/6323ac8e73a10a1c8ee547c77ed68cf8eb6b90e1.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Factor out atomic_ptr_type_ok()Peilin Ye
Factor out atomic_ptr_type_ok() as a helper function to be used later. Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/e5ef8b3116f3fffce78117a14060ddce05eba52a.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: no longer acquire map_idr_lock in bpf_map_inc_not_zero()Eric Dumazet
bpf_sk_storage_clone() is the only caller of bpf_map_inc_not_zero() and is holding rcu_read_lock(). map_idr_lock does not add any protection, just remove the cost for passive TCP flows. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kui-Feng Lee <kuifeng@meta.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250301191315.1532629-1-edumazet@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Summarize sleepable global subprogsKumar Kartikeya Dwivedi
The verifier currently does not permit global subprog calls when a lock is held, preemption is disabled, or when IRQs are disabled. This is because we don't know whether the global subprog calls sleepable functions or not. In case of locks, there's an additional reason: functions called by the global subprog may hold additional locks etc. The verifier won't know while verifying the global subprog whether it was called in context where a spin lock is already held by the program. Perform summarization of the sleepable nature of a global subprog just like changes_pkt_data and then allow calls to global subprogs for non-sleepable ones from atomic context. While making this change, I noticed that RCU read sections had no protection against sleepable global subprog calls, include it in the checks and fix this while we're at it. Care needs to be taken to not allow global subprog calls when regular bpf_spin_lock is held. When resilient spin locks is held, we want to potentially have this check relaxed, but not for now. Also make sure extensions freplacing global functions cannot do so in case the target is non-sleepable, but the extension is. The other combination is ok. Tests are included in the next patch to handle all special conditions. Fixes: 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250301151846.1552362-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Allow pre-ordering for bpf cgroup progsYonghong Song
Currently for bpf progs in a cgroup hierarchy, the effective prog array is computed from bottom cgroup to upper cgroups (post-ordering). For example, the following cgroup hierarchy root cgroup: p1, p2 subcgroup: p3, p4 have BPF_F_ALLOW_MULTI for both cgroup levels. The effective cgroup array ordering looks like p3 p4 p1 p2 and at run time, progs will execute based on that order. But in some cases, it is desirable to have root prog executes earlier than children progs (pre-ordering). For example, - prog p1 intends to collect original pkt dest addresses. - prog p3 will modify original pkt dest addresses to a proxy address for security reason. The end result is that prog p1 gets proxy address which is not what it wants. Putting p1 to every child cgroup is not desirable either as it will duplicate itself in many child cgroups. And this is exactly a use case we are encountering in Meta. To fix this issue, let us introduce a flag BPF_F_PREORDER. If the flag is specified at attachment time, the prog has higher priority and the ordering with that flag will be from top to bottom (pre-ordering). For example, in the above example, root cgroup: p1, p2 subcgroup: p3, p4 Let us say p2 and p4 are marked with BPF_F_PREORDER. The final effective array ordering will be p2 p4 p3 p1 Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250224230116.283071-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf/helpers: Introduce bpf_dynptr_copy kfuncMykyta Yatsenko
Introducing bpf_dynptr_copy kfunc allowing copying data from one dynptr to another. This functionality is useful in scenarios such as capturing XDP data to a ring buffer. The implementation consists of 4 branches: * A fast branch for contiguous buffer capacity in both source and destination dynptrs * 3 branches utilizing __bpf_dynptr_read and __bpf_dynptr_write to copy data to/from non-contiguous buffer Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250226183201.332713-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf/helpers: Refactor bpf_dynptr_read and bpf_dynptr_writeMykyta Yatsenko
Refactor bpf_dynptr_read and bpf_dynptr_write helpers: extract code into the static functions namely __bpf_dynptr_read and __bpf_dynptr_write, this allows calling these without compiler warnings. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250226183201.332713-2-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-08net: move misc netdev_lock flavors to a separate headerJakub Kicinski
Move the more esoteric helpers for netdev instance lock to a dedicated header. This avoids growing netdevice.h to infinity and makes rebuilding the kernel much faster (after touching the header with the helpers). The main netdev_lock() / netdev_unlock() functions are used in static inlines in netdevice.h and will probably be used most commonly, so keep them in netdevice.h. Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-07bpf: fix a possible NULL deref in bpf_map_offload_map_alloc()Eric Dumazet
Call bpf_dev_offload_check() before netdev_lock_ops(). This is needed if attr->map_ifindex is not valid. Oops: general protection fault, probably for non-canonical address 0xdffffc0000000197: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000cb8-0x0000000000000cbf] RIP: 0010:netdev_need_ops_lock include/linux/netdevice.h:2792 [inline] RIP: 0010:netdev_lock_ops include/linux/netdevice.h:2803 [inline] RIP: 0010:bpf_map_offload_map_alloc+0x19a/0x910 kernel/bpf/offload.c:533 Call Trace: <TASK> map_create+0x946/0x11c0 kernel/bpf/syscall.c:1455 __sys_bpf+0x6d3/0x820 kernel/bpf/syscall.c:5777 __do_sys_bpf kernel/bpf/syscall.c:5902 [inline] __se_sys_bpf kernel/bpf/syscall.c:5900 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5900 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 Fixes: 97246d6d21c2 ("net: hold netdev instance lock during ndo_bpf") Reported-by: syzbot+0c7bfd8cf3aecec92708@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/67caa2b1.050a0220.15b4b9.0077.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250307074303.1497911-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06net: hold netdev instance lock during ndo_bpfStanislav Fomichev
Cover the paths that come via bpf system call and XSK bind. Cc: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250305163732.2766420-10-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-04x86/smp: Move cpu number to percpu hot sectionBrian Gerst
No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-5-brgerst@gmail.com
2025-03-04Merge branch 'x86/cpu' into x86/asm, to pick up dependent commitsIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-02-27x86/bpf: Fix BPF percpu accessesBrian Gerst
Due to this recent commit in the x86 tree: 9d7de2aa8b41 ("Use relative percpu offsets") percpu addresses went from positive offsets from the GSBASE to negative kernel virtual addresses. The BPF verifier has an optimization for x86-64 that loads the address of cpu_number into a register, but was only doing a 32-bit load which truncates negative addresses. Change it to a 64-bit load so that the address is properly sign-extended. Fixes: 9d7de2aa8b41 ("Use relative percpu offsets") Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250227195302.1667654-1-brgerst@gmail.com
2025-02-27Change inode_operations.mkdir to return struct dentry *NeilBrown
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have complete control of sequencing on the actual filesystem (e.g. on a different server) and may find that the inode created for a mkdir request already exists in the icache and dcache by the time the mkdir request returns. For example, if the filesystem is mounted twice the directory could be visible on the other mount before it is on the original mount, and a pair of name_to_handle_at(), open_by_handle_at() calls could instantiate the directory inode with an IS_ROOT() dentry before the first mkdir returns. This means that the dentry passed to ->mkdir() may not be the one that is associated with the inode after the ->mkdir() completes. Some callers need to interact with the inode after the ->mkdir completes and they currently need to perform a lookup in the (rare) case that the dentry is no longer hashed. This lookup-after-mkdir requires that the directory remains locked to avoid races. Planned future patches to lock the dentry rather than the directory will mean that this lookup cannot be performed atomically with the mkdir. To remove this barrier, this patch changes ->mkdir to return the resulting dentry if it is different from the one passed in. Possible returns are: NULL - the directory was created and no other dentry was used ERR_PTR() - an error occurred non-NULL - this other dentry was spliced in This patch only changes file-systems to return "ERR_PTR(err)" instead of "err" or equivalent transformations. Subsequent patches will make further changes to some file-systems to return a correct dentry. Not all filesystems reliably result in a positive hashed dentry: - NFS, cifs, hostfs will sometimes need to perform a lookup of the name to get inode information. Races could result in this returning something different. Note that this lookup is non-atomic which is what we are trying to avoid. Placing the lookup in filesystem code means it only happens when the filesystem has no other option. - kernfs and tracefs leave the dentry negative and the ->revalidate operation ensures that lookup will be called to correctly populate the dentry. This could be fixed but I don't think it is important to any of the users of vfs_mkdir() which look at the dentry. The recommendation to use d_drop();d_splice_alias() is ugly but fits with current practice. A planned future patch will change this. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: NeilBrown <neilb@suse.de> Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc5). Conflicts: drivers/net/ethernet/cadence/macb_main.c fa52f15c745c ("net: cadence: macb: Synchronize stats calculations") 75696dd0fd72 ("net: cadence: macb: Convert to get_stats64") https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au Adjacent changes: drivers/net/ethernet/intel/ice/ice_sriov.c 79990cf5e7ad ("ice: Fix deinitializing VF in error path") a203163274a4 ("ice: simplify VF MSI-X managing") net/ipv4/tcp.c 18912c520674 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace") 297d389e9e5b ("net: prefix devmem specific helpers") net/mptcp/subflow.c 8668860b0ad3 ("mptcp: reset when MPTCP opts are dropped after join") c3349a22c200 ("mptcp: consolidate subflow cleanup") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27bpf: Use try_alloc_pages() to allocate pages for bpf needs.Alexei Starovoitov
Use try_alloc_pages() and free_pages_nolock() for BPF needs when context doesn't allow using normal alloc_pages. This is a prerequisite for further work. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20250222024427.30294-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-27bpf: cpumap: switch to napi_skb_cache_get_bulk()Alexander Lobakin
Now that cpumap uses GRO, which drops unused skb heads to the NAPI cache, use napi_skb_cache_get_bulk() to try to reuse cached entries and lower MM layer pressure. Always disable the BH before checking and running the cpumap-pinned XDP prog and don't re-enable it in between that and allocating an skb bulk, as we can access the NAPI caches only from the BH context. The better GRO aggregates packets, the less new skbs will be allocated. If an aggregated skb contains 16 frags, this means 15 skbs were returned to the cache, so next 15 skbs will be built without allocating anything. The same trafficgen UDP GRO test now shows: GRO off GRO on threaded GRO 2.3 4 Mpps thr bulk GRO 2.4 4.7 Mpps diff +4 +17 % Comparing to the baseline cpumap: baseline 2.7 N/A Mpps thr bulk GRO 2.4 4.7 Mpps diff -11 +74 % Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27bpf: cpumap: reuse skb array instead of a linked list to chain skbsAlexander Lobakin
cpumap still uses linked lists to store a list of skbs to pass to the stack. Now that we don't use listified Rx in favor of napi_gro_receive(), linked list is now an unneeded overhead. Inside the polling loop, we already have an array of skbs. Let's reuse it for skbs passed to cpumap (generic XDP) and keep there in case of XDP_PASS when a program is installed to the map itself. Don't list regular xdp_frames after converting them to skbs as well; store them in the mentioned array (but *before* generic skbs as the latters have lower priority) and call gro_receive_skb() for each array element after they're done. Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27bpf: cpumap: switch to GRO from netif_receive_skb_list()Alexander Lobakin
cpumap has its own BH context based on kthread. It has a sane batch size of 8 frames per one cycle. GRO can be used here on its own. Adjust cpumap calls to the upper stack to use GRO API instead of netif_receive_skb_list() which processes skbs by batches, but doesn't involve GRO layer at all. In plenty of tests, GRO performs better than listed receiving even given that it has to calculate full frame checksums on the CPU. As GRO passes the skbs to the upper stack in the batches of @gro_normal_batch, i.e. 8 by default, and skb->dev points to the device where the frame comes from, it is enough to disable GRO netdev feature on it to completely restore the original behaviour: untouched frames will be being bulked and passed to the upper stack by 8, as it was with netif_receive_skb_list(). Tested-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-25selftests/bpf: Test gen_pro/epilogue that generate kfuncsAmery Hung
Test gen_prologue and gen_epilogue that generate kfuncs that have not been seen in the main program. The main bpf program and return value checks are identical to pro_epilogue.c introduced in commit 47e69431b57a ("selftests/bpf: Test gen_prologue and gen_epilogue"). However, now when bpf_testmod_st_ops detects a program name with prefix "test_kfunc_", it generates slightly different prologue and epilogue: They still add 1000 to args->a in prologue, add 10000 to args->a and set r0 to 2 * args->a in epilogue, but involve kfuncs. At high level, the alternative version of prologue and epilogue look like this: cgrp = bpf_cgroup_from_id(0); if (cgrp) bpf_cgroup_release(cgrp); else /* Perform what original bpf_testmod_st_ops prologue or * epilogue does */ Since 0 is never a valid cgroup id, the original prologue or epilogue logic will be performed. As a result, the __retval check should expect the exact same return value. Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250225233545.285481-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-25bpf: Search and add kfuncs in struct_ops prologue and epilogueAmery Hung
Currently, add_kfunc_call() is only invoked once before the main verification loop. Therefore, the verifier could not find the bpf_kfunc_btf_tab of a new kfunc call which is not seen in user defined struct_ops operators but introduced in gen_prologue or gen_epilogue during do_misc_fixup(). Fix this by searching kfuncs in the patching instruction buffer and add them to prog->aux->kfunc_tab. Signed-off-by: Amery Hung <amery.hung@bytedance.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250225233545.285481-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-25bpf: abort verification if env->cur_state->loop_entry != NULLEduard Zingerman
In addition to warning abort verification with -EFAULT. If env->cur_state->loop_entry != NULL something is irrecoverably buggy. Fixes: bbbc02b7445e ("bpf: copy_verifier_state() should copy 'loop_entry' field") Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250225003838.135319-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-24bpf: Fix kmemleak warning for percpu hashmapYonghong Song
Vlad Poenaru reported the following kmemleak issue: unreferenced object 0x606fd7c44ac8 (size 32): backtrace (crc 0): pcpu_alloc_noprof+0x730/0xeb0 bpf_map_alloc_percpu+0x69/0xc0 prealloc_init+0x9d/0x1b0 htab_map_alloc+0x363/0x510 map_create+0x215/0x3a0 __sys_bpf+0x16b/0x3e0 __x64_sys_bpf+0x18/0x20 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Further investigation shows the reason is due to not 8-byte aligned store of percpu pointer in htab_elem_set_ptr(): *(void __percpu **)(l->key + key_size) = pptr; Note that the whole htab_elem alignment is 8 (for x86_64). If the key_size is 4, that means pptr is stored in a location which is 4 byte aligned but not 8 byte aligned. In mm/kmemleak.c, scan_block() scans the memory based on 8 byte stride, so it won't detect above pptr, hence reporting the memory leak. In htab_map_alloc(), we already have htab->elem_size = sizeof(struct htab_elem) + round_up(htab->map.key_size, 8); if (percpu) htab->elem_size += sizeof(void *); else htab->elem_size += round_up(htab->map.value_size, 8); So storing pptr with 8-byte alignment won't cause any problem and can fix kmemleak too. The issue can be reproduced with bpf selftest as well: 1. Enable CONFIG_DEBUG_KMEMLEAK config 2. Add a getchar() before skel destroy in test_hash_map() in prog_tests/for_each.c. The purpose is to keep map available so kmemleak can be detected. 3. run './test_progs -t for_each/hash_map &' and a kmemleak should be reported. Reported-by: Vlad Poenaru <thevlad@meta.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250224175514.2207227-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-23bpf: Refactor check_ctx_access()Amery Hung
Reduce the variable passing madness surrounding check_ctx_access(). Currently, check_mem_access() passes many pointers to local variables to check_ctx_access(). They are used to initialize "struct bpf_insn_access_aux info" in check_ctx_access() and then passed to is_valid_access(). Then, check_ctx_access() takes the data our from info and write them back the pointers to pass them back. This can be simpilified by moving info up to check_mem_access(). No functional change. Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20250221175644.1822383-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-20bpf: Do not allow tail call in strcut_ops program with __ref argumentAmery Hung
Reject struct_ops programs with refcounted kptr arguments (arguments tagged with __ref suffix) that tail call. Once a refcounted kptr is passed to a struct_ops program from the kernel, it can be freed or xchged into maps. As there is no guarantee a callee can get the same valid refcounted kptr in the ctx, we cannot allow such usage. Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250220221532.1079331-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf bpf-6.14-rc4Alexei Starovoitov
Cross-merge bpf fixes after downstream PR (bpf-6.14-rc4). Minor conflict: kernel/bpf/btf.c Adjacent changes: kernel/bpf/arena.c kernel/bpf/btf.c kernel/bpf/syscall.c kernel/bpf/verifier.c mm/memory.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-20Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull BPF fixes from Daniel Borkmann: - Fix a soft-lockup in BPF arena_map_free on 64k page size kernels (Alan Maguire) - Fix a missing allocation failure check in BPF verifier's acquire_lock_state (Kumar Kartikeya Dwivedi) - Fix a NULL-pointer dereference in trace_kfree_skb by adding kfree_skb to the raw_tp_null_args set (Kuniyuki Iwashima) - Fix a deadlock when freeing BPF cgroup storage (Abel Wu) - Fix a syzbot-reported deadlock when holding BPF map's freeze_mutex (Andrii Nakryiko) - Fix a use-after-free issue in bpf_test_init when eth_skb_pkt_type is accessing skb data not containing an Ethernet header (Shigeru Yoshida) - Fix skipping non-existing keys in generic_map_lookup_batch (Yan Zhai) - Several BPF sockmap fixes to address incorrect TCP copied_seq calculations, which prevented correct data reads from recv(2) in user space (Jiayuan Chen) - Two fixes for BPF map lookup nullness elision (Daniel Xu) - Fix a NULL-pointer dereference from vmlinux BTF lookup in bpf_sk_storage_tracing_allowed (Jared Kangas) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests: bpf: test batch lookup on array of maps with holes bpf: skip non exist keys in generic_map_lookup_batch bpf: Handle allocation failure in acquire_lock_state bpf: verifier: Disambiguate get_constant_map_key() errors bpf: selftests: Test constant key extraction on irrelevant maps bpf: verifier: Do not extract constant map keys for irrelevant maps bpf: Fix softlockup in arena_map_free on 64k page kernel net: Add rx_skb of kfree_skb to raw_tp_null_args[]. bpf: Fix deadlock when freeing cgroup storage selftests/bpf: Add strparser test for bpf selftests/bpf: Fix invalid flag of recv() bpf: Disable non stream socket for strparser bpf: Fix wrong copied_seq calculation strparser: Add read_sock callback bpf: avoid holding freeze_mutex during mmap operation bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic selftests/bpf: Adjust data size to have ETH_HLEN bpf, test_run: Fix use-after-free issue in eth_skb_pkt_type() bpf: Remove unnecessary BTF lookups in bpf_sk_storage_tracing_allowed
2025-02-20bpf: Support selective sampling for bpf timestampingJason Xing
Add the bpf_sock_ops_enable_tx_tstamp kfunc to allow BPF programs to selectively enable TX timestamping on a skb during tcp_sendmsg(). For example, BPF program will limit tracking X numbers of packets and then will stop there instead of tracing all the sendmsgs of matched flow all along. It would be helpful for users who cannot afford to calculate latencies from every sendmsg call probably due to the performance or storage space consideration. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-12-kerneljasonxing@gmail.com
2025-02-19bpf: Add bpf_copy_from_user_task_str() kfuncJordan Rome
This new kfunc will be able to copy a zero-terminated C strings from another task's address space. This is similar to `bpf_copy_from_user_str()` but reads memory of specified task. Signed-off-by: Jordan Rome <linux@jordanrome.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250213152125.1837400-2-linux@jordanrome.com
2025-02-18bpf: fix env->peak_states computationEduard Zingerman
Compute env->peak_states as a maximum value of sum of env->explored_states and env->free_list size. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250215110411.3236773-11-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18bpf: free verifier states when they are no longer referencedEduard Zingerman
When fixes from patches 1 and 3 are applied, Patrick Somaru reported an increase in memory consumption for sched_ext iterator-based programs hitting 1M instructions limit. For example, 2Gb VMs ran out of memory while verifying a program. Similar behaviour could be reproduced on current bpf-next master. Here is an example of such program: /* verification completes if given 16G or RAM, * final env->free_list size is 369,960 entries. */ SEC("raw_tp") __flag(BPF_F_TEST_STATE_FREQ) __success int free_list_bomb(const void *ctx) { volatile char buf[48] = {}; unsigned i, j; j = 0; bpf_for(i, 0, 10) { /* this forks verifier state: * - verification of current path continues and * creates a checkpoint after 'if'; * - verification of forked path hits the * checkpoint and marks it as loop_entry. */ if (bpf_get_prandom_u32()) asm volatile (""); /* this marks 'j' as precise, thus any checkpoint * created on current iteration would not be matched * on the next iteration. */ buf[j++] = 42; j %= ARRAY_SIZE(buf); } asm volatile (""::"r"(buf)); return 0; } Memory consumption increased due to more states being marked as loop entries and eventually added to env->free_list. This commit introduces logic to free states from env->free_list during verification. A state in env->free_list can be freed if: - it has no child states; - it is not used as a loop_entry. This commit: - updates bpf_verifier_state->used_as_loop_entry to be a counter that tracks how many states use this one as a loop entry; - adds a function maybe_free_verifier_state(), which: - frees a state if its ->branches and ->used_as_loop_entry counters are both zero; - if the state is freed, state->loop_entry->used_as_loop_entry is decremented, and an attempt is made to free state->loop_entry. In the example above, this approach reduces the maximum number of states in the free list from 369,960 to 16,223. However, this approach has its limitations. If the buf size in the example above is modified to 64, state caching overflows: the state for j=0 is evicted from the cache before it can be used to stop traversal. As a result, states in the free list accumulate because their branch counters do not reach zero. The effect of this patch on the selftests looks as follows: File Program Max free list (A) Max free list (B) Max free list (DIFF) -------------------------------- ------------------------------------ ----------------- ----------------- -------------------- arena_list.bpf.o arena_list_add 17 3 -14 (-82.35%) bpf_iter_task_stack.bpf.o dump_task_stack 39 9 -30 (-76.92%) iters.bpf.o checkpoint_states_deletion 265 89 -176 (-66.42%) iters.bpf.o clean_live_states 19 0 -19 (-100.00%) profiler2.bpf.o tracepoint__syscalls__sys_enter_kill 102 1 -101 (-99.02%) profiler3.bpf.o tracepoint__syscalls__sys_enter_kill 144 0 -144 (-100.00%) pyperf600_iter.bpf.o on_event 15 0 -15 (-100.00%) pyperf600_nounroll.bpf.o on_event 1170 1158 -12 (-1.03%) setget_sockopt.bpf.o skops_sockopt 18 0 -18 (-100.00%) strobemeta_nounroll1.bpf.o on_event 147 83 -64 (-43.54%) strobemeta_nounroll2.bpf.o on_event 312 209 -103 (-33.01%) strobemeta_subprogs.bpf.o on_event 124 86 -38 (-30.65%) test_cls_redirect_subprogs.bpf.o cls_redirect 15 0 -15 (-100.00%) timer.bpf.o test1 30 15 -15 (-50.00%) Measured using "do-not-submit" patches from here: https://github.com/eddyz87/bpf/tree/get-loop-entry-hungup Reported-by: Patrick Somaru <patsomaru@meta.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250215110411.3236773-10-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>