summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-05-24Merge branch 'gretap-mirroring-selftests'David S. Miller
Petr Machata says: ==================== selftests: forwarding: Additions to mirror-to-gretap tests This patchset is for a handful of edge cases in mirror-to-gretap scenarios: removal of mirrored-to netdevice (#1), removal of underlay route for tunnel remote endpoint (#2) and cessation of mirroring upon removal of flower mirroring rule (#3). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24selftests: forwarding: Test removal of mirroringPetr Machata
Test that when flower-based mirror action is removed, mirroring stops. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24selftests: forwarding: Test removal of underlay routePetr Machata
When underlay route is removed, the mirrored traffic should not be forwarded. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24selftests: forwarding: Test mirroring to deleted devicePetr Machata
Tests that the mirroring code catches up with deletion of a mirrored-to device. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24cxgb4: Check for kvzalloc allocation failureYueHaibing
t4_prep_fw doesn't check for card_fw pointer before store the read data, which could lead to a NULL pointer dereference if kvzalloc failed. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24Merge branch 'xdp_xmit-bulking'Alexei Starovoitov
Jesper Dangaard Brouer says: ==================== This patchset change ndo_xdp_xmit API to take a bulk of xdp frames. When kernel is compiled with CONFIG_RETPOLINE, every indirect function pointer (branch) call hurts performance. For XDP this have a huge negative performance impact. This patchset reduce the needed (indirect) calls to ndo_xdp_xmit, but also prepares for further optimizations. The DMA APIs use of indirect function pointer calls is the primary source the regression. It is left for a followup patchset, to use bulking calls towards the DMA API (via the scatter-gatter calls). The other advantage of this API change is that drivers can easier amortize the cost of any sync/locking scheme, over the bulk of packets. The assumption of the current API is that the driver implemementing the NDO will also allocate a dedicated XDP TX queue for every CPU in the system. Which is not always possible or practical to configure. E.g. ixgbe cannot load an XDP program on a machine with more than 96 CPUs, due to limited hardware TX queues. E.g. virtio_net is hard to configure as it requires manually increasing the queues. E.g. tun driver chooses to use a per XDP frame producer lock modulo smp_processor_id over avail queues. I'm considered adding 'flags' to ndo_xdp_xmit, but it's not part of this patchset. This will be a followup patchset, once we know if this will be needed (e.g. for non-map xdp_redirect flush-flag, and if AF_XDP chooses to use ndo_xdp_xmit for TX). --- V5: Fixed up issues spotted by Daniel and John V4: Splitout the patches from 4 to 8 patches. I cannot split the driver changes from the NDO change, but I've tried to isolated the NDO change together with the driver change as much as possible. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24samples/bpf: xdp_monitor use err code from tracepoint xdp:xdp_devmap_xmitJesper Dangaard Brouer
Update xdp_monitor to use the recently added err code introduced in tracepoint xdp:xdp_devmap_xmit, to show if the drop count is caused by some driver general delivery problem. Other kind of drops will likely just be more normal TX space issues. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24xdp/trace: extend tracepoint in devmap with an errJesper Dangaard Brouer
Extending tracepoint xdp:xdp_devmap_xmit in devmap with an err code allow people to easier identify the reason behind the ndo_xdp_xmit call to a given driver is failing. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24xdp: change ndo_xdp_xmit API to support bulkingJesper Dangaard Brouer
This patch change the API for ndo_xdp_xmit to support bulking xdp_frames. When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown. Most of the slowdown is caused by DMA API indirect function calls, but also the net_device->ndo_xdp_xmit() call. Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed performance improved: for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps With frames avail as a bulk inside the driver ndo_xdp_xmit call, further optimizations are possible, like bulk DMA-mapping for TX. Testing without CONFIG_RETPOLINE show the same performance for physical NIC drivers. The virtual NIC driver tun sees a huge performance boost, as it can avoid doing per frame producer locking, but instead amortize the locking cost over the bulk. V2: Fix compile errors reported by kbuild test robot <lkp@intel.com> V4: Isolated ndo, driver changes and callers. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24xdp: introduce xdp_return_frame_rx_napiJesper Dangaard Brouer
When sending an xdp_frame through xdp_do_redirect call, then error cases can happen where the xdp_frame needs to be dropped, and returning an -errno code isn't sufficient/possible any-longer (e.g. for cpumap case). This is already fully supported, by simply calling xdp_return_frame. This patch is an optimization, which provides xdp_return_frame_rx_napi, which is a faster variant for these error cases. It take advantage of the protection provided by XDP RX running under NAPI protection. This change is mostly relevant for drivers using the page_pool allocator as it can take advantage of this. (Tested with mlx5). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24samples/bpf: xdp_monitor use tracepoint xdp:xdp_devmap_xmitJesper Dangaard Brouer
The xdp_monitor sample/tool is updated to use the new tracepoint xdp:xdp_devmap_xmit the previous patch just introduced. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24xdp: add tracepoint for devmap like cpumap haveJesper Dangaard Brouer
Notice how this allow us get XDP statistic without affecting the XDP performance, as tracepoint is no-longer activated on a per packet basis. V5: Spotted by John Fastabend. Fix 'sent' also counted 'drops' in this patch, a later patch corrected this, but it was a mistake in this intermediate step. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24bpf: devmap prepare xdp frames for bulkingJesper Dangaard Brouer
Like cpumap create queue for xdp frames that will be bulked. For now, this patch simply invoke ndo_xdp_xmit foreach frame. This happens, either when the map flush operation is envoked, or when the limit DEV_MAP_BULK_SIZE is reached. V5: Avoid memleak on error path in dev_map_update_elem() Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24bpf: devmap introduce dev_map_enqueueJesper Dangaard Brouer
Functionality is the same, but the ndo_xdp_xmit call is now simply invoked from inside the devmap.c code. V2: Fix compile issue reported by kbuild test robot <lkp@intel.com> V5: Cleanups requested by Daniel - Newlines before func definition - Use BUILD_BUG_ON checks - Remove unnecessary use return value store in dev_map_enqueue Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24Merge branch 'bpf-task-fd-query'Alexei Starovoitov
Yonghong Song says: ==================== Currently, suppose a userspace application has loaded a bpf program and attached it to a tracepoint/kprobe/uprobe, and a bpf introspection tool, e.g., bpftool, wants to show which bpf program is attached to which tracepoint/kprobe/uprobe. Such attachment information will be really useful to understand the overall bpf deployment in the system. There is a name field (16 bytes) for each program, which could be used to encode the attachment point. There are some drawbacks for this approaches. First, bpftool user (e.g., an admin) may not really understand the association between the name and the attachment point. Second, if one program is attached to multiple places, encoding a proper name which can imply all these attachments becomes difficult. This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY. Given a pid and fd, this command will return bpf related information to user space. Right now it only supports tracepoint/kprobe/uprobe perf event fd's. For such a fd, BPF_TASK_FD_QUERY will return . prog_id . tracepoint name, or . k[ret]probe funcname + offset or kernel addr, or . u[ret]probe filename + offset to the userspace. The user can use "bpftool prog" to find more information about bpf program itself with prog_id. Patch #1 adds function perf_get_event() in kernel/events/core.c. Patch #2 implements the bpf subcommand BPF_TASK_FD_QUERY. Patch #3 syncs tools bpf.h header and also add bpf_task_fd_query() in the libbpf library for samples/selftests/bpftool to use. Patch #4 adds ksym_get_addr() utility function. Patch #5 add a test in samples/bpf for querying k[ret]probes and u[ret]probes. Patch #6 add a test in tools/testing/selftests/bpf for querying raw_tracepoint and tracepoint. Patch #7 add a new subcommand "perf" to bpftool. Changelogs: v4 -> v5: . return strlen(buf) instead of strlen(buf) + 1 in the attr.buf_len. As long as user provides non-empty buffer, it will be filed with empty string, truncated string, or full string based on the buffer size and the length of to-be-copied string. v3 -> v4: . made attr buf_len input/output. The length of actual buffter is written to buf_len so user space knows what is actually needed. If user provides a buffer with length >= 1 but less than required, do partial copy and return -ENOSPC. . code simplification with put_user. . changed query result attach_info to fd_type. . add tests at selftests/bpf to test zero len, null buf and insufficient buf. v2 -> v3: . made perf_get_event() return perf_event pointer const. this was to ensure that event fields are not meddled. . detect whether newly BPF_TASK_FD_QUERY is supported or not in "bpftool perf" and warn users if it is not. v1 -> v2: . changed bpf subcommand name from BPF_PERF_EVENT_QUERY to BPF_TASK_FD_QUERY. . fixed various "bpftool perf" issues and added documentation and auto-completion. ==================== Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24tools/bpftool: add perf subcommandYonghong Song
The new command "bpftool perf [show | list]" will traverse all processes under /proc, and if any fd is associated with a perf event, it will print out related perf event information. Documentation is also added. Below is an example to show the results using bcc commands. Running the following 4 bcc commands: kprobe: trace.py '__x64_sys_nanosleep' kretprobe: trace.py 'r::__x64_sys_nanosleep' tracepoint: trace.py 't:syscalls:sys_enter_nanosleep' uprobe: trace.py 'p:/home/yhs/a.out:main' The bpftool command line and result: $ bpftool perf pid 21711 fd 5: prog_id 5 kprobe func __x64_sys_write offset 0 pid 21765 fd 5: prog_id 7 kretprobe func __x64_sys_nanosleep offset 0 pid 21767 fd 5: prog_id 8 tracepoint sys_enter_nanosleep pid 21800 fd 5: prog_id 9 uprobe filename /home/yhs/a.out offset 1159 $ bpftool -j perf [{"pid":21711,"fd":5,"prog_id":5,"fd_type":"kprobe","func":"__x64_sys_write","offset":0}, \ {"pid":21765,"fd":5,"prog_id":7,"fd_type":"kretprobe","func":"__x64_sys_nanosleep","offset":0}, \ {"pid":21767,"fd":5,"prog_id":8,"fd_type":"tracepoint","tracepoint":"sys_enter_nanosleep"}, \ {"pid":21800,"fd":5,"prog_id":9,"fd_type":"uprobe","filename":"/home/yhs/a.out","offset":1159}] $ bpftool prog 5: kprobe name probe___x64_sys tag e495a0c82f2c7a8d gpl loaded_at 2018-05-15T04:46:37-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 4 7: kprobe name probe___x64_sys tag f2fdee479a503abf gpl loaded_at 2018-05-15T04:48:32-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 7 8: tracepoint name tracepoint__sys tag 5390badef2395fcf gpl loaded_at 2018-05-15T04:48:48-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 8 9: kprobe name probe_main_1 tag 0a87bdc2e2953b6d gpl loaded_at 2018-05-15T04:49:52-0700 uid 0 xlated 200B not jited memlock 4096B map_ids 9 $ ps ax | grep "python ./trace.py" 21711 pts/0 T 0:03 python ./trace.py __x64_sys_write 21765 pts/0 S+ 0:00 python ./trace.py r::__x64_sys_nanosleep 21767 pts/2 S+ 0:00 python ./trace.py t:syscalls:sys_enter_nanosleep 21800 pts/3 S+ 0:00 python ./trace.py p:/home/yhs/a.out:main 22374 pts/1 S+ 0:00 grep --color=auto python ./trace.py Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progsYonghong Song
The new tests are added to query perf_event information for raw_tracepoint and tracepoint attachment. For tracepoint, both syscalls and non-syscalls tracepoints are queries as they are treated slightly differently inside the kernel. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERYYonghong Song
This is mostly to test kprobe/uprobe which needs kernel headers. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24tools/bpf: add ksym_get_addr() in trace_helpersYonghong Song
Given a kernel function name, ksym_get_addr() will return the kernel address for this function, or 0 if it cannot find this function name in /proc/kallsyms. This function will be used later when a kernel address is used to initiate a kprobe perf event. Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24tools/bpf: sync kernel header bpf.h and add bpf_task_fd_query in libbpfYonghong Song
Sync kernel header bpf.h to tools/include/uapi/linux/bpf.h and implement bpf_task_fd_query() in libbpf. The test programs in samples/bpf and tools/testing/selftests/bpf, and later bpftool will use this libbpf function to query kernel. Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24bpf: introduce bpf subcommand BPF_TASK_FD_QUERYYonghong Song
Currently, suppose a userspace application has loaded a bpf program and attached it to a tracepoint/kprobe/uprobe, and a bpf introspection tool, e.g., bpftool, wants to show which bpf program is attached to which tracepoint/kprobe/uprobe. Such attachment information will be really useful to understand the overall bpf deployment in the system. There is a name field (16 bytes) for each program, which could be used to encode the attachment point. There are some drawbacks for this approaches. First, bpftool user (e.g., an admin) may not really understand the association between the name and the attachment point. Second, if one program is attached to multiple places, encoding a proper name which can imply all these attachments becomes difficult. This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY. Given a pid and fd, if the <pid, fd> is associated with a tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return . prog_id . tracepoint name, or . k[ret]probe funcname + offset or kernel addr, or . u[ret]probe filename + offset to the userspace. The user can use "bpftool prog" to find more information about bpf program itself with prog_id. Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24perf/core: add perf_get_event() to return perf_event given a struct fileYonghong Song
A new extern function, perf_get_event(), is added to return a perf event given a struct file. This function will be used in later patches. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-24net/mlx5e: Receive buffer support for DCBXHuy Nguyen
Add dcbnl's set/get buffer configuration callback that allows user to set/get buffer size configuration and priority to buffer mapping. By default, firmware controls receive buffer configuration and priority of buffer mapping based on the changes in pfc settings. When set buffer call back is triggered, the buffer configuration changes to manual mode. The manual mode means mlx5 driver will adjust the buffer configuration accordingly based on the changes in pfc settings. ConnectX buffer stride is 128 Bytes. If the buffer size is not multiple of 128, the buffer size will be rounded down to the nearest multiple of 128. Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-24net/mlx5e: Receive buffer configurationHuy Nguyen
Add APIs for buffer configuration based on the changes in pfc configuration, cable len, buffer size configuration, and priority to buffer mapping. Note that the xoff fomula is as below xoff = ((301+2.16 * len [m]) * speed [Gbps] + 2.72 MTU [B] xoff_threshold = buffer_size - xoff xon_threshold = xoff_threshold - MTU Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-24net/mlx5: PPTB and PBMC register firmware command supportHuy Nguyen
Add firmware command interface to read and write PPTB and PBMC registers. PPTB register enables mappings priority to a specific receive buffer. PBMC registers enables changing the receive buffer's configuration such as buffer size, xon/xoff thresholds, buffer's lossy property and buffer's shared property. Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-24net/mlx5: Add pbmc and pptb in the port_access_reg_cap_maskHuy Nguyen
Add pbmc and pptb in the port_access_reg_cap_mask. These two bits determine if device supports receive buffer configuration. Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-24net/mlx5e: Move port speed code from en_ethtool.c to en/port.cHuy Nguyen
Move four below functions from en_ethtool.c to en/port.c. These functions are used by both en_ethtool.c and en_main.c. Future code can use these functions without ethtool link mode dependency. u32 mlx5e_port_ptys2speed(u32 eth_proto_oper); int mlx5e_port_linkspeed(struct mlx5_core_dev *mdev, u32 *speed); int mlx5e_port_max_linkspeed(struct mlx5_core_dev *mdev, u32 *speed); u32 mlx5e_port_speed2linkmodes(u32 speed); Delete the speed field from table mlx5e_build_ptys2ethtool_map. This table only keeps the mapping between the mlx5e link mode and ethtool link mode. Add new table mlx5e_link_speed for translation from mlx5e link mode to actual speed. Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-24net/dcb: Add dcbnl buffer attributeHuy Nguyen
In this patch, we add dcbnl buffer attribute to allow user change the NIC's buffer configuration such as priority to buffer mapping and buffer size of individual buffer. This attribute combined with pfc attribute allows advanced user to fine tune the qos setting for specific priority queue. For example, user can give dedicated buffer for one or more priorities or user can give large buffer to certain priorities. The dcb buffer configuration will be controlled by lldptool. lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6 maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6 lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0 sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to choose dcbnl over devlink interface since this feature is intended to set port attributes which are governed by the netdev instance of that port, where devlink API is more suitable for global ASIC configurations. We present an use case scenario where dcbnl buffer attribute configured by advance user helps reduce the latency of messages of different sizes. Scenarios description: On ConnectX-5, we run latency sensitive traffic with small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive traffic with large messages sizes 512KB and 1MB. We group small, medium, and large message sizes to their own pfc enables priorities as follow. Priorities 1 & 2 (64B, 256B and 1KB) Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB) Priorities 5 & 6 (512KB and 1MB) By default, ConnectX-5 maps all pfc enabled priorities to a single lossless fixed buffer size of 50% of total available buffer space. The other 50% is assigned to lossy buffer. Using dcbnl buffer attribute, we create three equal size lossless buffers. Each buffer has 25% of total available buffer space. Thus, the lossy buffer size reduces to 25%. Priority to lossless buffer mappings are set as follow. Priorities 1 & 2 on lossless buffer #1 Priorities 3 & 4 on lossless buffer #2 Priorities 5 & 6 on lossless buffer #3 We observe improvements in latency for small and medium message sizes as follows. Please note that the large message sizes bandwidth performance is reduced but the total bandwidth remains the same. 256B message size (42 % latency reduction) 4K message size (21% latency reduction) 64K message size (16% latency reduction) CC: Ido Schimmel <idosch@idosch.org> CC: Jakub Kicinski <jakub.kicinski@netronome.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Or Gerlitz <gerlitz.or@gmail.com> CC: Parav Pandit <parav@mellanox.com> CC: Aron Silverton <aron.silverton@oracle.com> Signed-off-by: Huy Nguyen <huyn@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-24net: phy: replace bool members in struct phy_device with bit-fieldsHeiner Kallweit
In struct phy_device we have a number of flags being defined as type bool. Similar to e.g. struct pci_dev we can save some space by using bit-fields. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24Merge tag 'batadv-next-for-davem-20180524' of ↵David S. Miller
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This feature/cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - Disable batman-adv debugfs by default, by Sven Eckelmann - Improve handling mesh nodes with multicast optimizations disabled, by Linus Luessing - Avoid bool in structs, by Sven Eckelmann - Allocate less memory when debugfs is disabled, by Sven Eckelmann - Fix batadv_interface_tx return data type, by Luc Van Oostenryck - improve link speed handling for virtual interfaces, by Marek Lindner - Enable BATMAN V algorithm by default, by Marek Lindner ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24bpfilter: don't pass O_CREAT when opening console for debugJakub Kicinski
Passing O_CREAT (00000100) to open means we should also pass file mode as the third parameter. Creating /dev/console as a regular file may not be helpful anyway, so simply drop the flag when opening debug_fd. Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24bpfilter: fix build dependencyAlexei Starovoitov
BPFILTER could have been enabled without INET causing this build error: ERROR: "bpfilter_process_sockopt" [net/bpfilter/bpfilter.ko] undefined! Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module") Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-24Merge branch 'bpf-ipv6-seg6-bpf-action'Daniel Borkmann
Mathieu Xhonneux says: ==================== As of Linux 4.14, it is possible to define advanced local processing for IPv6 packets with a Segment Routing Header through the seg6local LWT infrastructure. This LWT implements the network programming principles defined in the IETF "SRv6 Network Programming" draft. The implemented operations are generic, and it would be very interesting to be able to implement user-specific seg6local actions, without having to modify the kernel directly. To do so, this patchset adds an End.BPF action to seg6local, powered by some specific Segment Routing-related helpers, which provide SR functionalities that can be applied on the packet. This BPF hook would then allow to implement specific actions at native kernel speed such as OAM features, advanced SR SDN policies, SRv6 actions like Segment Routing Header (SRH) encapsulation depending on the content of the packet, etc. This patchset is divided in 6 patches, whose main features are : - A new seg6local action End.BPF with the corresponding new BPF program type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be passed to the LWT seg6local through netlink, the same way as the LWT BPF hook operates. - 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/ shrink a SRH and apply on a packet some of the generic SRv6 actions. - 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through encapsulation (via IPv6 encapsulation or inlining if the packet contains already an IPv6 header). As this patchset adds a new LWT BPF hook, I took into account the result of the discussions when the LWT BPF infrastructure got merged. Hence, the seg6local BPF hook doesn't allow write access to skb->data directly, only the SRH can be modified through specific helpers, which ensures that the integrity of the packet is maintained. More details are available in the related patches messages. The performances of this BPF hook have been assessed with the BPF JIT enabled on an Intel Xeon X3440 processors with 4 cores and 8 threads clocked at 2.53 GHz. No throughput losses are noted with the seg6local BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes) drops the throughput to 410kpps, and inlining a SRH via bpf_lwt_seg6_action drops the throughput to 420kpps. All throughputs are stable. Changelog: v2: move the SRH integrity state from skb->cb to a per-cpu buffer v3: - document helpers in man-page style - fix kbuild bugs - un-break BPF LWT out hook - bpf_push_seg6_encap is now static - preempt_enable is now called when the packet is dropped in input_action_end_bpf v4: fix kbuild bugs when CONFIG_IPV6=m v5: fix kbuild sparse warnings when CONFIG_IPV6=m v6: fix skb pointers-related bugs in helpers v7: - fix memory leak in error path of End.BPF setup - add freeing of BPF data in seg6_local_destroy_state - new enums SEG6_LOCAL_BPF_* instead of re-using ones of lwt bpf for netlink nested bpf attributes - SEG6_LOCAL_BPF_PROG attr now contains prog->aux->id when dumping state ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24selftests/bpf: test for seg6local End.BPF actionMathieu Xhonneux
Add a new test for the seg6local End.BPF action. The following helpers are also tested: - bpf_lwt_push_encap within the LWT BPF IN hook - bpf_lwt_seg6_action - bpf_lwt_seg6_adjust_srh - bpf_lwt_seg6_store_bytes A chain of End.BPF actions is built. The SRH is injected through a LWT BPF IN hook before entering this chain. Each End.BPF action validates the previous one, otherwise the packet is dropped. The test succeeds if the last node in the chain receives the packet and the UDP datagram contained can be retrieved from userspace. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24ipv6: sr: Add seg6local action End.BPFMathieu Xhonneux
This patch adds the End.BPF action to the LWT seg6local infrastructure. This action works like any other seg6local End action, meaning that an IPv6 header with SRH is needed, whose DA has to be equal to the SID of the action. It will also advance the SRH to the next segment, the BPF program does not have to take care of this. Since the BPF program may not be a source of instability in the kernel, it is important to ensure that the integrity of the packet is maintained before yielding it back to the IPv6 layer. The hook hence keeps track if the SRH has been altered through the helpers, and re-validates its content if needed with seg6_validate_srh. The state kept for validation is stored in a per-CPU buffer. The BPF program is not allowed to directly write into the packet, and only some fields of the SRH can be altered through the helper bpf_lwt_seg6_store_bytes. Performances profiling has shown that the SRH re-validation does not induce a significant overhead. If the altered SRH is deemed as invalid, the packet is dropped. This validation is also done before executing any action through bpf_lwt_seg6_action, and will not be performed again if the SRH is not modified after calling the action. The BPF program may return 3 types of return codes: - BPF_OK: the End.BPF action will look up the next destination through seg6_lookup_nexthop. - BPF_REDIRECT: if an action has been executed through the bpf_lwt_seg6_action helper, the BPF program should return this value, as the skb's destination is already set and the default lookup should not be performed. - BPF_DROP : the packet will be dropped. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: Split lwt inout verifier structuresMathieu Xhonneux
The new bpf_lwt_push_encap helper should only be accessible within the LWT BPF IN hook, and not the OUT one, as this may lead to a skb under panic. At the moment, both LWT BPF IN and OUT share the same list of helpers, whose calls are authorized by the verifier. This patch separates the verifier ops for the IN and OUT hooks, and allows the IN hook to call the bpf_lwt_push_encap helper. This patch is also the occasion to put all lwt_*_func_proto functions together for clarity. At the moment, socks_op_func_proto is in the middle of lwt_inout_func_proto and lwt_xmit_func_proto. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: Add IPv6 Segment Routing helpersMathieu Xhonneux
The BPF seg6local hook should be powerful enough to enable users to implement most of the use-cases one could think of. After some thinking, we figured out that the following actions should be possible on a SRv6 packet, requiring 3 specific helpers : - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH (to add/delete TLVs) - bpf_lwt_seg6_action: Apply some SRv6 network programming actions (specifically End.X, End.T, End.B6 and End.B6.Encap) The specifications of these helpers are provided in the patch (see include/uapi/linux/bpf.h). The non-sensitive fields of the SRH are the following : flags, tag and TLVs. The other fields can not be modified, to maintain the SRH integrity. Flags, tag and TLVs can easily be modified as their validity can be checked afterwards via seg6_validate_srh. It is not allowed to modify the segments directly. If one wants to add segments on the path, he should stack a new SRH using the End.B6 action via bpf_lwt_seg6_action. Growing, shrinking or editing TLVs via the helpers will flag the SRH as invalid, and it will have to be re-validated before re-entering the IPv6 layer. This flag is stored in a per-CPU buffer, along with the current header length in bytes. Storing the SRH len in bytes in the control block is mandatory when using bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes boundary). When adding/deleting TLVs within the BPF program, the SRH may temporary be in an invalid state where its length cannot be rounded to 8 bytes without remainder, hence the need to store the length in bytes separately. The caller of the BPF program can then ensure that the SRH's final length is valid using this value. Again, a final SRH modified by a BPF program which doesn’t respect the 8-bytes boundary will be discarded as it will be considered as invalid. Finally, a fourth helper is provided, bpf_lwt_push_encap, which is available from the LWT BPF IN hook, but not from the seg6local BPF one. This helper allows to encapsulate a Segment Routing Header (either with a new outer IPv6 header, or by inlining it directly in the existing IPv6 header) into a non-SRv6 packet. This helper is required if we want to offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet, as the BPF seg6local hook only works on traffic already containing a SRH. This is the BPF equivalent of the seg6 LWT infrastructure, which achieves the same purpose but with a static SRH per route. These helpers require CONFIG_IPV6=y (and not =m). Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24ipv6: sr: export function lookup_nexthopMathieu Xhonneux
The function lookup_nexthop is essential to implement most of the seg6local actions. As we want to provide a BPF helper allowing to apply some of these actions on the packet being processed, the helper should be able to call this function, hence the need to make it public. Moreover, if one argument is incorrect or if the next hop can not be found, an error should be returned by the BPF helper so the BPF program can adapt its processing of the packet (return an error, properly force the drop, ...). This patch hence makes this function return dst->error to indicate a possible error. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24ipv6: sr: make seg6.h includable without IPv6Mathieu Xhonneux
include/net/seg6.h cannot be included in a source file if CONFIG_IPV6 is not enabled: include/net/seg6.h: In function 'seg6_pernet': >> include/net/seg6.h:52:14: error: 'struct net' has no member named 'ipv6'; did you mean 'ipv4'? return net->ipv6.seg6_data; ^~~~ ipv4 This commit makes seg6_pernet return NULL if IPv6 is not compiled, hence allowing seg6.h to be included regardless of the configuration. Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24Merge branch 'bpf-multi-prog-improvements'Daniel Borkmann
Sandipan Das says: ==================== [1] Support for bpf-to-bpf function calls in the powerpc64 JIT compiler. [2] Provide a way for resolving function calls because of the way JITed images are allocated in powerpc64. [3] Fix to get JITed instruction dumps for multi-function programs from the bpf system call. [4] Fix for bpftool to show delimited multi-function JITed image dumps. v4: - Incorporate review comments from Jakub. - Fix JSON output for bpftool. v3: - Change base tree tag to bpf-next. - Incorporate review comments from Alexei, Daniel and Jakub. - Make sure that the JITed image does not grow or shrink after the last pass due to the way the instruction sequence used to load a callee's address maybe optimized. - Make additional changes to the bpf system call and bpftool to make multi-function JITed dumps easier to correlate. v2: - Incorporate review comments from Jakub. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24tools: bpftool: add delimiters to multi-function JITed dumpsSandipan Das
This splits up the contiguous JITed dump obtained via the bpf system call into more relatable chunks for each function in the program. If the kernel symbols corresponding to these are known, they are printed in the header for each JIT image dump otherwise the masked start address is printed. Before applying this patch: # bpftool prog dump jited id 1 0: push %rbp 1: mov %rsp,%rbp ... 70: leaveq 71: retq 72: push %rbp 73: mov %rsp,%rbp ... dd: leaveq de: retq # bpftool -p prog dump jited id 1 [{ "pc": "0x0", "operation": "push", "operands": ["%rbp" ] },{ ... },{ "pc": "0x71", "operation": "retq", "operands": [null ] },{ "pc": "0x72", "operation": "push", "operands": ["%rbp" ] },{ ... },{ "pc": "0xde", "operation": "retq", "operands": [null ] } ] After applying this patch: # echo 0 > /proc/sys/net/core/bpf_jit_kallsyms # bpftool prog dump jited id 1 0xffffffffc02c7000: 0: push %rbp 1: mov %rsp,%rbp ... 70: leaveq 71: retq 0xffffffffc02cf000: 0: push %rbp 1: mov %rsp,%rbp ... 6b: leaveq 6c: retq # bpftool -p prog dump jited id 1 [{ "name": "0xffffffffc02c7000", "insns": [{ "pc": "0x0", "operation": "push", "operands": ["%rbp" ] },{ ... },{ "pc": "0x71", "operation": "retq", "operands": [null ] } ] },{ "name": "0xffffffffc02cf000", "insns": [{ "pc": "0x0", "operation": "push", "operands": ["%rbp" ] },{ ... },{ "pc": "0x6c", "operation": "retq", "operands": [null ] } ] } ] # echo 1 > /proc/sys/net/core/bpf_jit_kallsyms # bpftool prog dump jited id 1 bpf_prog_b811aab41a39ad3d_foo: 0: push %rbp 1: mov %rsp,%rbp ... 70: leaveq 71: retq bpf_prog_cf418ac8b67bebd9_F: 0: push %rbp 1: mov %rsp,%rbp ... 6b: leaveq 6c: retq # bpftool -p prog dump jited id 1 [{ "name": "bpf_prog_b811aab41a39ad3d_foo", "insns": [{ "pc": "0x0", "operation": "push", "operands": ["%rbp" ] },{ ... },{ "pc": "0x71", "operation": "retq", "operands": [null ] } ] },{ "name": "bpf_prog_cf418ac8b67bebd9_F", "insns": [{ "pc": "0x0", "operation": "push", "operands": ["%rbp" ] },{ ... },{ "pc": "0x6c", "operation": "retq", "operands": [null ] } ] } ] Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24tools: bpf: sync bpf uapi headerSandipan Das
Syncing the bpf.h uapi header with tools so that struct bpf_prog_info has the two new fields for passing on the JITed image lengths of each function in a multi-function program. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: get JITed image lengths of functions via syscallSandipan Das
This adds new two new fields to struct bpf_prog_info. For multi-function programs, these fields can be used to pass a list of the JITed image lengths of each function for a given program to userspace using the bpf system call with the BPF_OBJ_GET_INFO_BY_FD command. This can be used by userspace applications like bpftool to split up the contiguous JITed dump, also obtained via the system call, into more relatable chunks corresponding to each function. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: fix multi-function JITed dump obtained via syscallSandipan Das
Currently, for multi-function programs, we cannot get the JITed instructions using the bpf system call's BPF_OBJ_GET_INFO_BY_FD command. Because of this, userspace tools such as bpftool fail to identify a multi-function program as being JITed or not. With the JIT enabled and the test program running, this can be verified as follows: # cat /proc/sys/net/core/bpf_jit_enable 1 Before applying this patch: # bpftool prog list 1: kprobe name foo tag b811aab41a39ad3d gpl loaded_at 2018-05-16T11:43:38+0530 uid 0 xlated 216B not jited memlock 65536B ... # bpftool prog dump jited id 1 no instructions returned After applying this patch: # bpftool prog list 1: kprobe name foo tag b811aab41a39ad3d gpl loaded_at 2018-05-16T12:13:01+0530 uid 0 xlated 216B jited 308B memlock 65536B ... # bpftool prog dump jited id 1 0: nop 4: nop 8: mflr r0 c: std r0,16(r1) 10: stdu r1,-112(r1) 14: std r31,104(r1) 18: addi r31,r1,48 1c: li r3,10 ... Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24tools: bpftool: resolve calls without using imm fieldSandipan Das
Currently, we resolve the callee's address for a JITed function call by using the imm field of the call instruction as an offset from __bpf_call_base. If bpf_jit_kallsyms is enabled, we further use this address to get the callee's kernel symbol's name. For some architectures, such as powerpc64, the imm field is not large enough to hold this offset. So, instead of assigning this offset to the imm field, the verifier now assigns the subprog id. Also, a list of kernel symbol addresses for all the JITed functions is provided in the program info. We now use the imm field as an index for this list to lookup a callee's symbol's address and resolve its name. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24tools: bpf: sync bpf uapi headerSandipan Das
Syncing the bpf.h uapi header with tools so that struct bpf_prog_info has the two new fields for passing on the addresses of the kernel symbols corresponding to each function in a program. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: get kernel symbol addresses via syscallSandipan Das
This adds new two new fields to struct bpf_prog_info. For multi-function programs, these fields can be used to pass a list of kernel symbol addresses for all functions in a given program to userspace using the bpf system call with the BPF_OBJ_GET_INFO_BY_FD command. When bpf_jit_kallsyms is enabled, we can get the address of the corresponding kernel symbol for a callee function and resolve the symbol's name. The address is determined by adding the value of the call instruction's imm field to __bpf_call_base. This offset gets assigned to the imm field by the verifier. For some architectures, such as powerpc64, the imm field is not large enough to hold this offset. We resolve this by: [1] Assigning the subprog id to the imm field of a call instruction in the verifier instead of the offset of the callee's symbol's address from __bpf_call_base. [2] Determining the address of a callee's corresponding symbol by using the imm field as an index for the list of kernel symbol addresses now available from the program info. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: powerpc64: add JIT support for multi-function programsSandipan Das
This adds support for bpf-to-bpf function calls in the powerpc64 JIT compiler. The JIT compiler converts the bpf call instructions to native branch instructions. After a round of the usual passes, the start addresses of the JITed images for the callee functions are known. Finally, to fixup the branch target addresses, we need to perform an extra pass. Because of the address range in which JITed images are allocated on powerpc64, the offsets of the start addresses of these images from __bpf_call_base are as large as 64 bits. So, for a function call, we cannot use the imm field of the instruction to determine the callee's address. Instead, we use the alternative method of getting it from the list of function addresses in the auxiliary data of the caller by using the off field as an index. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: powerpc64: pad function address loads with NOPsSandipan Das
For multi-function programs, loading the address of a callee function to a register requires emitting instructions whose count varies from one to five depending on the nature of the address. Since we come to know of the callee's address only before the extra pass, the number of instructions required to load this address may vary from what was previously generated. This can make the JITed image grow or shrink. To avoid this, we should generate a constant five-instruction when loading function addresses by padding the optimized load sequence with NOPs. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-24bpf: support 64-bit offsets for bpf function callsSandipan Das
The imm field of a bpf instruction is a signed 32-bit integer. For JITed bpf-to-bpf function calls, it holds the offset of the start address of the callee's JITed image from __bpf_call_base. For some architectures, such as powerpc64, this offset may be as large as 64 bits and cannot be accomodated in the imm field without truncation. We resolve this by: [1] Additionally using the auxiliary data of each function to keep a list of start addresses of the JITed images for all functions determined by the verifier. [2] Retaining the subprog id inside the off field of the call instructions and using it to index into the list mentioned above and lookup the callee's address. To make sure that the existing JIT compilers continue to work without requiring changes, we keep the imm field as it is. Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>