summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2019-07-28tcp: fix tcp_set_congestion_control() use from bpf hookEric Dumazet
[ Upstream commit 8d650cdedaabb33e85e9b7c517c0c71fcecc1de9 ] Neal reported incorrect use of ns_capable() from bpf hook. bpf_setsockopt(...TCP_CONGESTION...) -> tcp_set_congestion_control() -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) -> ns_capable_common() -> current_cred() -> rcu_dereference_protected(current->cred, 1) Accessing 'current' in bpf context makes no sense, since packets are processed from softirq context. As Neal stated : The capability check in tcp_set_congestion_control() was written assuming a system call context, and then was reused from a BPF call site. The fix is to add a new parameter to tcp_set_congestion_control(), so that the ns_capable() call is only performed under the right context. Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lawrence Brakmo <brakmo@fb.com> Reported-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-28net: neigh: fix multiple neigh timer schedulingLorenzo Bianconi
[ Upstream commit 071c37983d99da07797294ea78e9da1a6e287144 ] Neigh timer can be scheduled multiple times from userspace adding multiple neigh entries and forcing the neigh timer scheduling passing NTF_USE in the netlink requests. This will result in a refcount leak and in the following dump stack: [ 32.465295] NEIGH: BUG, double timer add, state is 8 [ 32.465308] CPU: 0 PID: 416 Comm: double_timer_ad Not tainted 5.2.0+ #65 [ 32.465311] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 [ 32.465313] Call Trace: [ 32.465318] dump_stack+0x7c/0xc0 [ 32.465323] __neigh_event_send+0x20c/0x880 [ 32.465326] ? ___neigh_create+0x846/0xfb0 [ 32.465329] ? neigh_lookup+0x2a9/0x410 [ 32.465332] ? neightbl_fill_info.constprop.0+0x800/0x800 [ 32.465334] neigh_add+0x4f8/0x5e0 [ 32.465337] ? neigh_xmit+0x620/0x620 [ 32.465341] ? find_held_lock+0x85/0xa0 [ 32.465345] rtnetlink_rcv_msg+0x204/0x570 [ 32.465348] ? rtnl_dellink+0x450/0x450 [ 32.465351] ? mark_held_locks+0x90/0x90 [ 32.465354] ? match_held_lock+0x1b/0x230 [ 32.465357] netlink_rcv_skb+0xc4/0x1d0 [ 32.465360] ? rtnl_dellink+0x450/0x450 [ 32.465363] ? netlink_ack+0x420/0x420 [ 32.465366] ? netlink_deliver_tap+0x115/0x560 [ 32.465369] ? __alloc_skb+0xc9/0x2f0 [ 32.465372] netlink_unicast+0x270/0x330 [ 32.465375] ? netlink_attachskb+0x2f0/0x2f0 [ 32.465378] netlink_sendmsg+0x34f/0x5a0 [ 32.465381] ? netlink_unicast+0x330/0x330 [ 32.465385] ? move_addr_to_kernel.part.0+0x20/0x20 [ 32.465388] ? netlink_unicast+0x330/0x330 [ 32.465391] sock_sendmsg+0x91/0xa0 [ 32.465394] ___sys_sendmsg+0x407/0x480 [ 32.465397] ? copy_msghdr_from_user+0x200/0x200 [ 32.465401] ? _raw_spin_unlock_irqrestore+0x37/0x40 [ 32.465404] ? lockdep_hardirqs_on+0x17d/0x250 [ 32.465407] ? __wake_up_common_lock+0xcb/0x110 [ 32.465410] ? __wake_up_common+0x230/0x230 [ 32.465413] ? netlink_bind+0x3e1/0x490 [ 32.465416] ? netlink_setsockopt+0x540/0x540 [ 32.465420] ? __fget_light+0x9c/0xf0 [ 32.465423] ? sockfd_lookup_light+0x8c/0xb0 [ 32.465426] __sys_sendmsg+0xa5/0x110 [ 32.465429] ? __ia32_sys_shutdown+0x30/0x30 [ 32.465432] ? __fd_install+0xe1/0x2c0 [ 32.465435] ? lockdep_hardirqs_off+0xb5/0x100 [ 32.465438] ? mark_held_locks+0x24/0x90 [ 32.465441] ? do_syscall_64+0xf/0x270 [ 32.465444] do_syscall_64+0x63/0x270 [ 32.465448] entry_SYSCALL_64_after_hwframe+0x49/0xbe Fix the issue unscheduling neigh_timer if selected entry is in 'IN_TIMER' receiving a netlink request with NTF_USE flag set Reported-by: Marek Majkowski <marek@cloudflare.com> Fixes: 0c5c2d308906 ("neigh: Allow for user space users of the neighbour table") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-14bpf: sockmap, fix use after free from sleep in psock backlog workqueueJohn Fastabend
[ Upstream commit bd95e678e0f6e18351ecdc147ca819145db9ed7b ] Backlog work for psock (sk_psock_backlog) might sleep while waiting for memory to free up when sending packets. However, while sleeping the socket may be closed and removed from the map by the user space side. This breaks an assumption in sk_stream_wait_memory, which expects the wait queue to be still there when it wakes up resulting in a use-after-free shown below. To fix his mark sendmsg as MSG_DONTWAIT to avoid the sleep altogether. We already set the flag for the sendpage case but we missed the case were sendmsg is used. Sockmap is currently the only user of skb_send_sock_locked() so only the sockmap paths should be impacted. ================================================================== BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70 Write of size 8 at addr ffff888069a0c4e8 by task kworker/0:2/110 CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 Workqueue: events sk_psock_backlog Call Trace: print_address_description+0x6e/0x2b0 ? remove_wait_queue+0x31/0x70 kasan_report+0xfd/0x177 ? remove_wait_queue+0x31/0x70 ? remove_wait_queue+0x31/0x70 remove_wait_queue+0x31/0x70 sk_stream_wait_memory+0x4dd/0x5f0 ? sk_stream_wait_close+0x1b0/0x1b0 ? wait_woken+0xc0/0xc0 ? tcp_current_mss+0xc5/0x110 tcp_sendmsg_locked+0x634/0x15d0 ? tcp_set_state+0x2e0/0x2e0 ? __kasan_slab_free+0x1d1/0x230 ? kmem_cache_free+0x70/0x140 ? sk_psock_backlog+0x40c/0x4b0 ? process_one_work+0x40b/0x660 ? worker_thread+0x82/0x680 ? kthread+0x1b9/0x1e0 ? ret_from_fork+0x1f/0x30 ? check_preempt_curr+0xaf/0x130 ? iov_iter_kvec+0x5f/0x70 ? kernel_sendmsg_locked+0xa0/0xe0 skb_send_sock_locked+0x273/0x3c0 ? skb_splice_bits+0x180/0x180 ? start_thread+0xe0/0xe0 ? update_min_vruntime.constprop.27+0x88/0xc0 sk_psock_backlog+0xb3/0x4b0 ? strscpy+0xbf/0x1e0 process_one_work+0x40b/0x660 worker_thread+0x82/0x680 ? process_one_work+0x660/0x660 kthread+0x1b9/0x1e0 ? __kthread_create_on_node+0x250/0x250 ret_from_fork+0x1f/0x30 Fixes: 20bf50de3028c ("skbuff: Function to send an skbuf on a socket") Reported-by: Jakub Sitnicki <jakub@cloudflare.com> Tested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-03bpf: fix unconnected udp hooksDaniel Borkmann
commit 983695fa676568fc0fe5ddd995c7267aabc24632 upstream. Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in 7828f20e3779 ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03net: remove duplicate fetch in sock_getsockoptJingYi Hou
[ Upstream commit d0bae4a0e3d8c5690a885204d7eb2341a5b4884d ] In sock_getsockopt(), 'optlen' is fetched the first time from userspace. 'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is fetched the second time from userspace. If change it between two fetches may cause security problems or unexpected behaivor, and there is no reason to fetch it a second time. To fix this, we need to remove the second fetch. Signed-off-by: JingYi Hou <houjingyi647@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-22net: ethtool: Allow matching on vlan DEI bitMaxime Chevallier
[ Upstream commit f0d2ca1531377e7da888913e277eefac05a59b6f ] Using ethtool, users can specify a classification action matching on the full vlan tag, which includes the DEI bit (also previously called CFI). However, when converting the ethool_flow_spec to a flow_rule, we use dissector keys to represent the matching patterns. Since the vlan dissector key doesn't include the DEI bit, this information was silently discarded when translating the ethtool flow spec in to a flow_rule. This commit adds the DEI bit into the vlan dissector key, and allows propagating the information to the driver when parsing the ethtool flow spec. Fixes: eca4205f9ec3 ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator") Reported-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-22neigh: fix use-after-free read in pneigh_get_nextEric Dumazet
[ Upstream commit f3e92cb8e2eb8c27d109e6fd73d3a69a8c09e288 ] Nine years ago, I added RCU handling to neighbours, not pneighbours. (pneigh are not commonly used) Unfortunately I missed that /proc dump operations would use a common entry and exit point : neigh_seq_start() and neigh_seq_stop() We need to read_lock(tbl->lock) or risk use-after-free while iterating the pneigh structures. We might later convert pneigh to RCU and revert this patch. sysbot reported : BUG: KASAN: use-after-free in pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158 Read of size 8 at addr ffff888097f2a700 by task syz-executor.0/9825 CPU: 1 PID: 9825 Comm: syz-executor.0 Not tainted 5.2.0-rc4+ #32 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 pneigh_get_next.isra.0+0x24b/0x280 net/core/neighbour.c:3158 neigh_seq_next+0xdb/0x210 net/core/neighbour.c:3240 seq_read+0x9cf/0x1110 fs/seq_file.c:258 proc_reg_read+0x1fc/0x2c0 fs/proc/inode.c:221 do_loop_readv_writev fs/read_write.c:714 [inline] do_loop_readv_writev fs/read_write.c:701 [inline] do_iter_read+0x4a4/0x660 fs/read_write.c:935 vfs_readv+0xf0/0x160 fs/read_write.c:997 kernel_readv fs/splice.c:359 [inline] default_file_splice_read+0x475/0x890 fs/splice.c:414 do_splice_to+0x127/0x180 fs/splice.c:877 splice_direct_to_actor+0x2d2/0x970 fs/splice.c:954 do_splice_direct+0x1da/0x2a0 fs/splice.c:1063 do_sendfile+0x597/0xd00 fs/read_write.c:1464 __do_sys_sendfile64 fs/read_write.c:1525 [inline] __se_sys_sendfile64 fs/read_write.c:1511 [inline] __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4592c9 Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f4aab51dc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028 RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000004592c9 RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000005 RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000080000000 R11: 0000000000000246 R12: 00007f4aab51e6d4 R13: 00000000004c689d R14: 00000000004db828 R15: 00000000ffffffff Allocated by task 9827: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503 __do_kmalloc mm/slab.c:3660 [inline] __kmalloc+0x15c/0x740 mm/slab.c:3669 kmalloc include/linux/slab.h:552 [inline] pneigh_lookup+0x19c/0x4a0 net/core/neighbour.c:731 arp_req_set_public net/ipv4/arp.c:1010 [inline] arp_req_set+0x613/0x720 net/ipv4/arp.c:1026 arp_ioctl+0x652/0x7f0 net/ipv4/arp.c:1226 inet_ioctl+0x2a0/0x340 net/ipv4/af_inet.c:926 sock_do_ioctl+0xd8/0x2f0 net/socket.c:1043 sock_ioctl+0x3ed/0x780 net/socket.c:1194 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 9824: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3432 [inline] kfree+0xcf/0x220 mm/slab.c:3755 pneigh_ifdown_and_unlock net/core/neighbour.c:812 [inline] __neigh_ifdown+0x236/0x2f0 net/core/neighbour.c:356 neigh_ifdown+0x20/0x30 net/core/neighbour.c:372 arp_ifdown+0x1d/0x21 net/ipv4/arp.c:1274 inetdev_destroy net/ipv4/devinet.c:319 [inline] inetdev_event+0xa14/0x11f0 net/ipv4/devinet.c:1544 notifier_call_chain+0xc2/0x230 kernel/notifier.c:95 __raw_notifier_call_chain kernel/notifier.c:396 [inline] raw_notifier_call_chain+0x2e/0x40 kernel/notifier.c:403 call_netdevice_notifiers_info+0x3f/0x90 net/core/dev.c:1749 call_netdevice_notifiers_extack net/core/dev.c:1761 [inline] call_netdevice_notifiers net/core/dev.c:1775 [inline] rollback_registered_many+0x9b9/0xfc0 net/core/dev.c:8178 rollback_registered+0x109/0x1d0 net/core/dev.c:8220 unregister_netdevice_queue net/core/dev.c:9267 [inline] unregister_netdevice_queue+0x1ee/0x2c0 net/core/dev.c:9260 unregister_netdevice include/linux/netdevice.h:2631 [inline] __tun_detach+0xd8a/0x1040 drivers/net/tun.c:724 tun_detach drivers/net/tun.c:741 [inline] tun_chr_close+0xe0/0x180 drivers/net/tun.c:3451 __fput+0x2ff/0x890 fs/file_table.c:280 ____fput+0x16/0x20 fs/file_table.c:313 task_work_run+0x145/0x1c0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop+0x273/0x2c0 arch/x86/entry/common.c:168 prepare_exit_to_usermode arch/x86/entry/common.c:199 [inline] syscall_return_slowpath arch/x86/entry/common.c:279 [inline] do_syscall_64+0x58e/0x680 arch/x86/entry/common.c:304 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff888097f2a700 which belongs to the cache kmalloc-64 of size 64 The buggy address is located 0 bytes inside of 64-byte region [ffff888097f2a700, ffff888097f2a740) The buggy address belongs to the page: page:ffffea00025fca80 refcount:1 mapcount:0 mapping:ffff8880aa400340 index:0x0 flags: 0x1fffc0000000200(slab) raw: 01fffc0000000200 ffffea000250d548 ffffea00025726c8 ffff8880aa400340 raw: 0000000000000000 ffff888097f2a000 0000000100000020 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888097f2a600: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc ffff888097f2a680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc >ffff888097f2a700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ^ ffff888097f2a780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff888097f2a800: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc Fixes: 767e97e1e0db ("neigh: RCU conversion of struct neighbour") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19bpf: sockmap fix msg->sg.size account on ingress skbJohn Fastabend
[ Upstream commit cabede8b4f2b746232aa25730a0b752de1cb82ca ] When converting a skb to msg->sg we forget to set the size after the latest ktls/tls code conversion. This patch can be reached by doing a redir into ingress path from BPF skb sock recv hook. Then trying to read the size fails. Fix this by setting the size. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-19bpf: sockmap, only stop/flush strp if it was enabled at some pointJohn Fastabend
[ Upstream commit 014894360ec95abe868e94416b3dd6569f6e2c0c ] If we try to call strp_done on a parser that has never been initialized, because the sockmap user is only using TX side for example we get the following error. [ 883.422081] WARNING: CPU: 1 PID: 208 at kernel/workqueue.c:3030 __flush_work+0x1ca/0x1e0 ... [ 883.422095] Workqueue: events sk_psock_destroy_deferred [ 883.422097] RIP: 0010:__flush_work+0x1ca/0x1e0 This had been wrapped in a 'if (psock->parser.enabled)' logic which was broken because the strp_done() was never actually being called because we do a strp_stop() earlier in the tear down logic will set parser.enabled to false. This could result in a use after free if work was still in the queue and was resolved by the patch here, 1d79895aef18f ("sk_msg: Always cancel strp work before freeing the psock"). However, calling strp_stop(), done by the patch marked in the fixes tag, only is useful if we never initialized a strp parser program and never initialized the strp to start with. Because if we had initialized a stream parser strp_stop() would have been called by sk_psock_drop() earlier in the tear down process. By forcing the strp to stop we get past the WARNING in strp_done that checks the stopped flag but calling cancel_work_sync on work that has never been initialized is also wrong and generates the warning above. To fix check if the parser program exists. If the program exists then the strp work has been initialized and must be sync'd and cancelled before free'ing any structures. If no program exists we never initialized the stream parser in the first place so skip the sync/cancel logic implemented by strp_done. Finally, remove the strp_done its not needed and in the case where we are using the stream parser has already been called. Fixes: e8e3437762ad9 ("bpf: Stop the psock parser before canceling its work") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-06-11neighbor: Call __ipv4_neigh_lookup_noref in neigh_xmitDavid Ahern
[ Upstream commit 4b2a2bfeb3f056461a90bd621e8bd7d03fa47f60 ] Commit cd9ff4de0107 changed the key for IFF_POINTOPOINT devices to INADDR_ANY but neigh_xmit which is used for MPLS encapsulations was not updated to use the altered key. The result is that every packet Tx does a lookup on the gateway address which does not find an entry, a new one is created only to find the existing one in the table right before the insert since arp_constructor was updated to reset the primary key. This is seen in the allocs and destroys counters: ip -s -4 ntable show | head -10 | grep alloc which increase for each packet showing the unnecessary overhread. Fix by having neigh_xmit use __ipv4_neigh_lookup_noref for NEIGH_ARP_TABLE. Fixes: cd9ff4de0107 ("ipv4: Make neigh lookup keys for loopback/point-to-point devices be INADDR_ANY") Reported-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: David Ahern <dsahern@gmail.com> Tested-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11neighbor: Reset gc_entries counter if new entry is released before insertDavid Ahern
[ Upstream commit 64c6f4bbca748c3b2101469a76d88b7cd1c00476 ] Ian and Alan both reported seeing overflows after upgrades to 5.x kernels: neighbour: arp_cache: neighbor table overflow! Alan's mpls script helped get to the bottom of this bug. When a new entry is created the gc_entries counter is bumped in neigh_alloc to check if a new one is allowed to be created. ___neigh_create then searches for an existing entry before inserting the just allocated one. If an entry already exists, the new one is dropped in favor of the existing one. In this case the cleanup path needs to drop the gc_entries counter. There is no memory leak, only a counter leak. Fixes: 58956317c8d ("neighbor: Improve garbage collection") Reported-by: Ian Kumlien <ian.kumlien@gmail.com> Reported-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: David Ahern <dsahern@gmail.com> Tested-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11Revert "fib_rules: return 0 directly if an exactly same rule exists when ↵Hangbin Liu
NLM_F_EXCL not supplied" [ Upstream commit 4970b42d5c362bf873982db7d93245c5281e58f4 ] This reverts commit e9919a24d3022f72bcadc407e73a6ef17093a849. Nathan reported the new behaviour breaks Android, as Android just add new rules and delete old ones. If we return 0 without adding dup rules, Android will remove the new added rules and causing system to soft-reboot. Fixes: e9919a24d302 ("fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied") Reported-by: Nathan Chancellor <natechancellor@gmail.com> Reported-by: Yaro Slav <yaro330@gmail.com> Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Tested-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11pktgen: do not sleep with the thread lock held.Paolo Abeni
[ Upstream commit 720f1de4021f09898b8c8443f3b3e995991b6e3a ] Currently, the process issuing a "start" command on the pktgen procfs interface, acquires the pktgen thread lock and never release it, until all pktgen threads are completed. The above can blocks indefinitely any other pktgen command and any (even unrelated) netdevice removal - as the pktgen netdev notifier acquires the same lock. The issue is demonstrated by the following script, reported by Matteo: ip -b - <<'EOF' link add type dummy link add type veth link set dummy0 up EOF modprobe pktgen echo reset >/proc/net/pktgen/pgctrl { echo rem_device_all echo add_device dummy0 } >/proc/net/pktgen/kpktgend_0 echo count 0 >/proc/net/pktgen/dummy0 echo start >/proc/net/pktgen/pgctrl & sleep 1 rmmod veth Fix the above releasing the thread lock around the sleep call. Additionally we must prevent racing with forcefull rmmod - as the thread lock no more protects from them. Instead, acquire a self-reference before waiting for any thread. As a side effect, running rmmod pktgen while some thread is running now fails with "module in use" error, before this patch such command hanged indefinitely. Note: the issue predates the commit reported in the fixes tag, but this fix can't be applied before the mentioned commit. v1 -> v2: - no need to check for thread existence after flipping the lock, pktgen threads are freed only at net exit time - Fixes: 6146e6a43b35 ("[PKTGEN]: Removes thread_{un,}lock() macros.") Reported-and-tested-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-11ethtool: fix potential userspace buffer overflowVivien Didelot
[ Upstream commit 0ee4e76937d69128a6a66861ba393ebdc2ffc8a2 ] ethtool_get_regs() allocates a buffer of size ops->get_regs_len(), and pass it to the kernel driver via ops->get_regs() for filling. There is no restriction about what the kernel drivers can or cannot do with the open ethtool_regs structure. They usually set regs->version and ignore regs->len or set it to the same size as ops->get_regs_len(). But if userspace allocates a smaller buffer for the registers dump, we would cause a userspace buffer overflow in the final copy_to_user() call, which uses the regs.len value potentially reset by the driver. To fix this, make this case obvious and store regs.len before calling ops->get_regs(), to only copy as much data as requested by userspace, up to the value returned by ops->get_regs_len(). While at it, remove the redundant check for non-null regbuf. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-04net: correct zerocopy refcnt with udp MSG_MOREWillem de Bruijn
[ Upstream commit 100f6d8e09905c59be45b6316f8f369c0be1b2d8 ] TCP zerocopy takes a uarg reference for every skb, plus one for the tcp_sendmsg_locked datapath temporarily, to avoid reaching refcnt zero as it builds, sends and frees skbs inside its inner loop. UDP and RAW zerocopy do not send inside the inner loop so do not need the extra sock_zerocopy_get + sock_zerocopy_put pair. Commit 52900d22288ed ("udp: elide zerocopy operation in hot path") introduced extra_uref to pass the initial reference taken in sock_zerocopy_alloc to the first generated skb. But, sock_zerocopy_realloc takes this extra reference at the start of every call. With MSG_MORE, no new skb may be generated to attach the extra_uref to, so refcnt is incorrectly 2 with only one skb. Do not take the extra ref if uarg && !tcp, which implies MSG_MORE. Update extra_uref accordingly. This conditional assignment triggers a false positive may be used uninitialized warning, so have to initialize extra_uref at define. Changes v1->v2: fix typo in Fixes SHA1 Fixes: 52900d22288e7 ("udp: elide zerocopy operation in hot path") Reported-by: syzbot <syzkaller@googlegroups.com> Diagnosed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-04net-gro: fix use-after-free read in napi_gro_frags()Eric Dumazet
[ Upstream commit a4270d6795b0580287453ea55974d948393e66ef ] If a network driver provides to napi_gro_frags() an skb with a page fragment of exactly 14 bytes, the call to gro_pull_from_frag0() will 'consume' the fragment by calling skb_frag_unref(skb, 0), and the page might be freed and reused. Reading eth->h_proto at the end of napi_frags_skb() might read mangled data, or crash under specific debugging features. BUG: KASAN: use-after-free in napi_frags_skb net/core/dev.c:5833 [inline] BUG: KASAN: use-after-free in napi_gro_frags+0xc6f/0xd10 net/core/dev.c:5841 Read of size 2 at addr ffff88809366840c by task syz-executor599/8957 CPU: 1 PID: 8957 Comm: syz-executor599 Not tainted 5.2.0-rc1+ #32 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load_n_noabort+0xf/0x20 mm/kasan/generic_report.c:142 napi_frags_skb net/core/dev.c:5833 [inline] napi_gro_frags+0xc6f/0xd10 net/core/dev.c:5841 tun_get_user+0x2f3c/0x3ff0 drivers/net/tun.c:1991 tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2037 call_write_iter include/linux/fs.h:1872 [inline] do_iter_readv_writev+0x5f8/0x8f0 fs/read_write.c:693 do_iter_write fs/read_write.c:970 [inline] do_iter_write+0x184/0x610 fs/read_write.c:951 vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015 do_writev+0x15b/0x330 fs/read_write.c:1058 Fixes: a50e233c50db ("net-gro: restore frag0 optimization") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-04ethtool: Check for vlan etype or vlan tci when parsing flow_ruleMaxime Chevallier
[ Upstream commit b73484b2fc0d0ba84a13e9d86eb4adcae718163b ] When parsing an ethtool flow spec to build a flow_rule, the code checks if both the vlan etype and the vlan tci are specified by the user to add a FLOW_DISSECTOR_KEY_VLAN match. However, when the user only specified a vlan etype or a vlan tci, this check silently ignores these parameters. For example, the following rule : ethtool -N eth0 flow-type udp4 vlan 0x0010 action -1 loc 0 will result in no error being issued, but the equivalent rule will be created and passed to the NIC driver : ethtool -N eth0 flow-type udp4 action -1 loc 0 In the end, neither the NIC driver using the rule nor the end user have a way to know that these keys were dropped along the way, or that incorrect parameters were entered. This kind of check should be left to either the driver, or the ethtool flow spec layer. This commit makes so that ethtool parameters are forwarded as-is to the NIC driver. Since none of the users of ethtool_rx_flow_rule_create are using the VLAN dissector, I don't think this qualifies as a regression. Fixes: eca4205f9ec3 ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator") Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Acked-by: Pablo Neira Ayuso <pablo@gnumonks.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25flow_offload: support CVLAN matchEdward Cree
[ Upstream commit bae9ed69029c7d499c57485593b2faae475fd704 ] Plumb it through from the flow_dissector. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25rtnetlink: always put IFLA_LINK for links with a link-netnsidSabrina Dubroca
[ Upstream commit feadc4b6cf42a53a8a93c918a569a0b7e62bd350 ] Currently, nla_put_iflink() doesn't put the IFLA_LINK attribute when iflink == ifindex. In some cases, a device can be created in a different netns with the same ifindex as its parent. That device will not dump its IFLA_LINK attribute, which can confuse some userspace software that expects it. For example, if the last ifindex created in init_net and foo are both 8, these commands will trigger the issue: ip link add parent type dummy # ifindex 9 ip link add link parent netns foo type macvlan # ifindex 9 in ns foo So, in case a device puts the IFLA_LINK_NETNSID attribute in a dump, always put the IFLA_LINK attribute as well. Thanks to Dan Winship for analyzing the original OpenShift bug down to the missing netlink attribute. v2: change Fixes tag, it's been here forever, as Nicolas Dichtel said add Nicolas' ack v3: change Fixes tag fix subject typo, spotted by Edward Cree Analyzed-by: Dan Winship <danw@redhat.com> Fixes: d8a5ec672768 ("[NET]: netlink support for moving devices between network namespaces.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25net: avoid weird emergency messageEric Dumazet
[ Upstream commit d7c04b05c9ca14c55309eb139430283a45c4c25f ] When host is under high stress, it is very possible thread running netdev_wait_allrefs() returns from msleep(250) 10 seconds late. This leads to these messages in the syslog : [...] unregister_netdevice: waiting for syz_tun to become free. Usage count = 0 If the device refcount is zero, the wait is over. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-16flow_dissector: disable preemption around BPF callsEric Dumazet
[ Upstream commit b1c17a9a353878602fd5bfe9103e4afe5e9a3f96 ] Various things in eBPF really require us to disable preemption before running an eBPF program. syzbot reported : BUG: assuming atomic context at net/core/flow_dissector.c:737 in_atomic(): 0, irqs_disabled(): 0, pid: 24710, name: syz-executor.3 2 locks held by syz-executor.3/24710: #0: 00000000e81a4bf1 (&tfile->napi_mutex){+.+.}, at: tun_get_user+0x168e/0x3ff0 drivers/net/tun.c:1850 #1: 00000000254afebd (rcu_read_lock){....}, at: __skb_flow_dissect+0x1e1/0x4bb0 net/core/flow_dissector.c:822 CPU: 1 PID: 24710 Comm: syz-executor.3 Not tainted 5.1.0+ #6 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 __cant_sleep kernel/sched/core.c:6165 [inline] __cant_sleep.cold+0xa3/0xbb kernel/sched/core.c:6142 bpf_flow_dissect+0xfe/0x390 net/core/flow_dissector.c:737 __skb_flow_dissect+0x362/0x4bb0 net/core/flow_dissector.c:853 skb_flow_dissect_flow_keys_basic include/linux/skbuff.h:1322 [inline] skb_probe_transport_header include/linux/skbuff.h:2500 [inline] skb_probe_transport_header include/linux/skbuff.h:2493 [inline] tun_get_user+0x2cfe/0x3ff0 drivers/net/tun.c:1940 tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2037 call_write_iter include/linux/fs.h:1872 [inline] do_iter_readv_writev+0x5fd/0x900 fs/read_write.c:693 do_iter_write fs/read_write.c:970 [inline] do_iter_write+0x184/0x610 fs/read_write.c:951 vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015 do_writev+0x15b/0x330 fs/read_write.c:1058 __do_sys_writev fs/read_write.c:1131 [inline] __se_sys_writev fs/read_write.c:1128 [inline] __x64_sys_writev+0x75/0xb0 fs/read_write.c:1128 do_syscall_64+0x103/0x670 arch/x86/entry/common.c:298 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Petar Penkov <ppenkov@google.com> Cc: Stanislav Fomichev <sdf@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-16fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL ↵Hangbin Liu
not supplied [ Upstream commit e9919a24d3022f72bcadc407e73a6ef17093a849 ] With commit 153380ec4b9 ("fib_rules: Added NLM_F_EXCL support to fib_nl_newrule") we now able to check if a rule already exists. But this only works with iproute2. For other tools like libnl, NetworkManager, it still could add duplicate rules with only NLM_F_CREATE flag, like [localhost ~ ]# ip rule 0: from all lookup local 32766: from all lookup main 32767: from all lookup default 100000: from 192.168.7.5 lookup 5 100000: from 192.168.7.5 lookup 5 As it doesn't make sense to create two duplicate rules, let's just return 0 if the rule exists. Fixes: 153380ec4b9 ("fib_rules: Added NLM_F_EXCL support to fib_nl_newrule") Reported-by: Thomas Haller <thaller@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-16socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEWArnd Bergmann
It looks like the new socket options only work correctly for native execution, but in case of compat mode fall back to the old behavior as we ignore the 'old_timeval' flag. Rework so we treat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW the same way in compat and native 32-bit mode. Cc: Deepa Dinamani <deepa.kernel@gmail.com> Fixes: a9beb86ae6e5 ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16net: Fix missing meta data in skb with vlan packetYuya Kusakabe
skb_reorder_vlan_header() should move XDP meta data with ethernet header if XDP meta data exists. Fixes: de8f3a83b0a0 ("bpf: add meta pointer for direct access") Signed-off-by: Yuya Kusakabe <yuya.kusakabe@gmail.com> Signed-off-by: Takeru Hayasaka <taketarou2@gmail.com> Co-developed-by: Takeru Hayasaka <taketarou2@gmail.com> Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-16net/core: work around section mismatch warning for ptp_classifierArd Biesheuvel
The routine ptp_classifier_init() uses an initializer for an automatic struct type variable which refers to an __initdata symbol. This is perfectly legal, but may trigger a section mismatch warning when running the compiler in -fpic mode, due to the fact that the initializer may be emitted into an anonymous .data section thats lack the __init annotation. So work around it by using assignments instead. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15Revert "net-sysfs: Fix memory leak in netdev_register_kobject"Wang Hai
This reverts commit 6b70fc94afd165342876e53fc4b2f7d085009945. The reverted bugfix will cause another issue. Reported by syzbot+6024817a931b2830bc93@syzkaller.appspotmail.com. See https://syzkaller.appspot.com/x/log.txt?x=1737671b200000 for details. Signed-off-by: Wang Hai <wanghai26@huawei.com> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-14rtnetlink: fix rtnl_valid_stats_req() nlmsg_len checkEric Dumazet
Jakub forgot to either use nlmsg_len() or nlmsg_msg_size(), allowing KMSAN to detect a possible uninit-value in rtnl_stats_get BUG: KMSAN: uninit-value in rtnl_stats_get+0x6d9/0x11d0 net/core/rtnetlink.c:4997 CPU: 0 PID: 10428 Comm: syz-executor034 Not tainted 5.1.0-rc2+ #24 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x173/0x1d0 lib/dump_stack.c:113 kmsan_report+0x131/0x2a0 mm/kmsan/kmsan.c:619 __msan_warning+0x7a/0xf0 mm/kmsan/kmsan_instr.c:310 rtnl_stats_get+0x6d9/0x11d0 net/core/rtnetlink.c:4997 rtnetlink_rcv_msg+0x115b/0x1550 net/core/rtnetlink.c:5192 netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2485 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5210 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline] netlink_unicast+0xf3e/0x1020 net/netlink/af_netlink.c:1336 netlink_sendmsg+0x127f/0x1300 net/netlink/af_netlink.c:1925 sock_sendmsg_nosec net/socket.c:622 [inline] sock_sendmsg net/socket.c:632 [inline] ___sys_sendmsg+0xdb3/0x1220 net/socket.c:2137 __sys_sendmsg net/socket.c:2175 [inline] __do_sys_sendmsg net/socket.c:2184 [inline] __se_sys_sendmsg+0x305/0x460 net/socket.c:2182 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2182 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291 entry_SYSCALL_64_after_hwframe+0x63/0xe7 Fixes: 51bc860d4a99 ("rtnetlink: stats: validate attributes in get as well as dumps") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12bpf: Check address length before reading address familyTetsuo Handa
KMSAN will complain if valid address length passed to bpf_bind() is shorter than sizeof("struct sockaddr"->sa_family) bytes. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10failover: allow name change on IFF_UP slave interfacesSi-Wei Liu
When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. It's less risky to lift up the rename restriction on failover slave which is already UP. Although it's possible this change may potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to listen for the rename events on failover slaves. Userspace component interacting with slaves is expected to be changed to operate on failover master interface instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and userspace components should only deal with failover master in the long run. Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module") Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-04-04 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Batch of fixes to the existing BPF flow dissector API to support calling BPF programs from the eth_get_headlen context (support for latter is planned to be added in bpf-next), from Stanislav. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03net-gro: Fix GRO flush when receiving a GSO packet.Steffen Klassert
Currently we may merge incorrectly a received GSO packet or a packet with frag_list into a packet sitting in the gro_hash list. skb_segment() may crash case because the assumptions on the skb layout are not met. The correct behaviour would be to flush the packet in the gro_hash list and send the received GSO packet directly afterwards. Commit d61d072e87c8e ("net-gro: avoid reorders") sets NAPI_GRO_CB(skb)->flush in this case, but this is not checked before merging. This patch makes sure to check this flag and to not merge in that case. Fixes: d61d072e87c8e ("net-gro: avoid reorders") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03flow_dissector: allow access only to a subset of __sk_buff fieldsStanislav Fomichev
Use whitelist instead of a blacklist and allow only a small set of fields that might be relevant in the context of flow dissector: * data * data_end * flow_keys This is required for the eth_get_headlen case where we have only a chunk of data to dissect (i.e. trying to read the other skb fields doesn't make sense). Note, that it is a breaking API change! However, we've provided flow_keys->n_proto as a substitute for skb->protocol; and there is no need to manually handle skb->vlan_present. So even if we break somebody, the migration is trivial. Unfortunately, we can't support eth_get_headlen use-case without those breaking changes. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-03flow_dissector: fix clamping of BPF flow_keys for non-zero nhoffStanislav Fomichev
Don't allow BPF program to set flow_keys->nhoff to less than initial value. We currently don't read the value afterwards in anything but the tests, but it's still a good practice to return consistent values to the test programs. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-03net/flow_dissector: pass flow_keys->n_proto to BPF programsStanislav Fomichev
This is a preparation for the next commit that would prohibit access to the most fields of __sk_buff from the BPF programs. Instead of requiring BPF flow dissector programs to look into skb, pass all input data in the flow_keys. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-03-29net: ethtool: not call vzalloc for zero sized memory requestLi RongQing
NULL or ZERO_SIZE_PTR will be returned for zero sized memory request, and derefencing them will lead to a segfault so it is unnecessory to call vzalloc for zero sized memory request and not call functions which maybe derefence the NULL allocated memory this also fixes a possible memory leak if phy_ethtool_get_stats returns error, memory should be freed before exit Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Wang Li <wangli39@baidu.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28net: core: netif_receive_skb_list: unlist skb before passing to pt->funcAlexander Lobakin
__netif_receive_skb_list_ptype() leaves skb->next poisoned before passing it to pt_prev->func handler, what may produce (in certain cases, e.g. DSA setup) crashes like: [ 88.606777] CPU 0 Unable to handle kernel paging request at virtual address 0000000e, epc == 80687078, ra == 8052cc7c [ 88.618666] Oops[#1]: [ 88.621196] CPU: 0 PID: 0 Comm: swapper Not tainted 5.1.0-rc2-dlink-00206-g4192a172-dirty #1473 [ 88.630885] $ 0 : 00000000 10000400 00000002 864d7850 [ 88.636709] $ 4 : 87c0ddf0 864d7800 87c0ddf0 00000000 [ 88.642526] $ 8 : 00000000 49600000 00000001 00000001 [ 88.648342] $12 : 00000000 c288617b dadbee27 25d17c41 [ 88.654159] $16 : 87c0ddf0 85cff080 80790000 fffffffd [ 88.659975] $20 : 80797b20 ffffffff 00000001 864d7800 [ 88.665793] $24 : 00000000 8011e658 [ 88.671609] $28 : 80790000 87c0dbc0 87cabf00 8052cc7c [ 88.677427] Hi : 00000003 [ 88.680622] Lo : 7b5b4220 [ 88.683840] epc : 80687078 vlan_dev_hard_start_xmit+0x1c/0x1a0 [ 88.690532] ra : 8052cc7c dev_hard_start_xmit+0xac/0x188 [ 88.696734] Status: 10000404 IEp [ 88.700422] Cause : 50000008 (ExcCode 02) [ 88.704874] BadVA : 0000000e [ 88.708069] PrId : 0001a120 (MIPS interAptiv (multi)) [ 88.713005] Modules linked in: [ 88.716407] Process swapper (pid: 0, threadinfo=(ptrval), task=(ptrval), tls=00000000) [ 88.725219] Stack : 85f61c28 00000000 0000000e 80780000 87c0ddf0 85cff080 80790000 8052cc7c [ 88.734529] 87cabf00 00000000 00000001 85f5fb40 807b0000 864d7850 87cabf00 807d0000 [ 88.743839] 864d7800 8655f600 00000000 85cff080 87c1c000 0000006a 00000000 8052d96c [ 88.753149] 807a0000 8057adb8 87c0dcc8 87c0dc50 85cfff08 00000558 87cabf00 85f58c50 [ 88.762460] 00000002 85f58c00 864d7800 80543308 fffffff4 00000001 85f58c00 864d7800 [ 88.771770] ... [ 88.774483] Call Trace: [ 88.777199] [<80687078>] vlan_dev_hard_start_xmit+0x1c/0x1a0 [ 88.783504] [<8052cc7c>] dev_hard_start_xmit+0xac/0x188 [ 88.789326] [<8052d96c>] __dev_queue_xmit+0x6e8/0x7d4 [ 88.794955] [<805a8640>] ip_finish_output2+0x238/0x4d0 [ 88.800677] [<805ab6a0>] ip_output+0xc8/0x140 [ 88.805526] [<805a68f4>] ip_forward+0x364/0x560 [ 88.810567] [<805a4ff8>] ip_rcv+0x48/0xe4 [ 88.815030] [<80528d44>] __netif_receive_skb_one_core+0x44/0x58 [ 88.821635] [<8067f220>] dsa_switch_rcv+0x108/0x1ac [ 88.827067] [<80528f80>] __netif_receive_skb_list_core+0x228/0x26c [ 88.833951] [<8052ed84>] netif_receive_skb_list+0x1d4/0x394 [ 88.840160] [<80355a88>] lunar_rx_poll+0x38c/0x828 [ 88.845496] [<8052fa78>] net_rx_action+0x14c/0x3cc [ 88.850835] [<806ad300>] __do_softirq+0x178/0x338 [ 88.856077] [<8012a2d4>] irq_exit+0xbc/0x100 [ 88.860846] [<802f8b70>] plat_irq_dispatch+0xc0/0x144 [ 88.866477] [<80105974>] handle_int+0x14c/0x158 [ 88.871516] [<806acfb0>] r4k_wait+0x30/0x40 [ 88.876462] Code: afb10014 8c8200a0 00803025 <9443000c> 94a20468 00000000 10620042 00a08025 9605046a [ 88.887332] [ 88.888982] ---[ end trace eb863d007da11cf1 ]--- [ 88.894122] Kernel panic - not syncing: Fatal exception in interrupt [ 88.901202] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Fix this by pulling skb off the sublist and zeroing skb->next pointer before calling ptype callback. Fixes: 88eb1944e18c ("net: core: propagate SKB lists through packet_type lookup") Reviewed-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexander Lobakin <alobakin@dlink.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28netns: provide pure entropy for net_hash_mix()Eric Dumazet
net_hash_mix() currently uses kernel address of a struct net, and is used in many places that could be used to reveal this address to a patient attacker, thus defeating KASLR, for the typical case (initial net namespace, &init_net is not dynamically allocated) I believe the original implementation tried to avoid spending too many cycles in this function, but security comes first. Also provide entropy regardless of CONFIG_NET_NS. Fixes: 0b4419162aa6 ("netns: introduce the net_hash_mix "salt" for hashes") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27net: datagram: fix unbounded loop in __skb_try_recv_datagram()Paolo Abeni
Christoph reported a stall while peeking datagram with an offset when busy polling is enabled. __skb_try_recv_datagram() uses as the loop termination condition 'queue empty'. When peeking, the socket queue can be not empty, even when no additional packets are received. Address the issue explicitly checking for receive queue changes, as currently done by __skb_wait_for_more_packets(). Fixes: 2b5cd0dfa384 ("net: Change return type of sk_busy_loop from bool to void") Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-26net: devlink: skip info_get op call if it is not defined in dumpitJiri Pirko
In dumpit, unlike doit, the check for info_get op being defined is missing. Add it and avoid null pointer dereference in case driver does not define this op. Fixes: f9cf22882c60 ("devlink: add device information API") Reported-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net-sysfs: Fix memory leak in netdev_register_kobjectWang Hai
When registering struct net_device, it will call register_netdevice -> netdev_register_kobject -> device_initialize(dev); dev_set_name(dev, "%s", ndev->name) device_add(dev) register_queue_kobjects(ndev) In netdev_register_kobject(), if device_add(dev) or register_queue_kobjects(ndev) failed. Register_netdevice() will return error, causing netdev_freemem(ndev) to be called to free net_device, however put_device(&dev->dev)->..-> kobject_cleanup() won't be called, resulting in a memory leak. syzkaller report this: BUG: memory leak unreferenced object 0xffff8881f4fad168 (size 8): comm "syz-executor.0", pid 3575, jiffies 4294778002 (age 20.134s) hex dump (first 8 bytes): 77 70 61 6e 30 00 ff ff wpan0... backtrace: [<000000006d2d91d7>] kstrdup_const+0x3d/0x50 mm/util.c:73 [<00000000ba9ff953>] kvasprintf_const+0x112/0x170 lib/kasprintf.c:48 [<000000005555ec09>] kobject_set_name_vargs+0x55/0x130 lib/kobject.c:281 [<0000000098d28ec3>] dev_set_name+0xbb/0xf0 drivers/base/core.c:1915 [<00000000b7553017>] netdev_register_kobject+0xc0/0x410 net/core/net-sysfs.c:1727 [<00000000c826a797>] register_netdevice+0xa51/0xeb0 net/core/dev.c:8711 [<00000000857bfcfd>] cfg802154_update_iface_num.isra.2+0x13/0x90 [ieee802154] [<000000003126e453>] ieee802154_llsec_fill_key_id+0x1d5/0x570 [ieee802154] [<00000000e4b3df51>] 0xffffffffc1500e0e [<00000000b4319776>] platform_drv_probe+0xc6/0x180 drivers/base/platform.c:614 [<0000000037669347>] really_probe+0x491/0x7c0 drivers/base/dd.c:509 [<000000008fed8862>] driver_probe_device+0xdc/0x240 drivers/base/dd.c:671 [<00000000baf52041>] device_driver_attach+0xf2/0x130 drivers/base/dd.c:945 [<00000000c7cc8dec>] __driver_attach+0x10e/0x210 drivers/base/dd.c:1022 [<0000000057a757c2>] bus_for_each_dev+0x154/0x1e0 drivers/base/bus.c:304 [<000000005f5ae04b>] bus_add_driver+0x427/0x5e0 drivers/base/bus.c:645 Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: 1fa5ae857bb1 ("driver core: get rid of struct device's bus_id string array") Signed-off-by: Wang Hai <wanghai26@huawei.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19net-sysfs: call dev_hold if kobject_init_and_add successYueHaibing
In netdev_queue_add_kobject and rx_queue_add_kobject, if sysfs_create_group failed, kobject_put will call netdev_queue_release to decrease dev refcont, however dev_hold has not be called. So we will see this while unregistering dev: unregister_netdevice: waiting for bcsh0 to become free. Usage count = -1 Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: d0d668371679 ("net: don't decrement kobj reference count on init failure") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-03-16 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a umem memory leak on cleanup in AF_XDP, from Björn. 2) Fix BTF to properly resolve forward-declared enums into their corresponding full enum definition types during deduplication, from Andrii. 3) Fix libbpf to reject invalid flags in xsk_socket__create(), from Magnus. 4) Fix accessing invalid pointer returned from bpf_tcp_sock() and bpf_sk_fullsock() after bpf_sk_release() was called, from Martin. 5) Fix generation of load/store DW instructions in PPC JIT, from Naveen. 6) Various fixes in BPF helper function documentation in bpf.h UAPI header used to bpf-helpers(7) man page, from Quentin. 7) Fix segfault in BPF test_progs when prog loading failed, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-13bpf: Add bpf_get_listener_sock(struct bpf_sock *sk) helperMartin KaFai Lau
Add a new helper "struct bpf_sock *bpf_get_listener_sock(struct bpf_sock *sk)" which returns a bpf_sock in TCP_LISTEN state. It will trace back to the listener sk from a request_sock if possible. It returns NULL for all other cases. No reference is taken because the helper ensures the sk is in SOCK_RCU_FREE (where the TCP_LISTEN sock should be in). Hence, bpf_sk_release() is unnecessary and the verifier does not allow bpf_sk_release(listen_sk) to be called either. The following is also allowed because the bpf_prog is run under rcu_read_lock(): sk = bpf_sk_lookup_tcp(); /* if (!sk) { ... } */ listen_sk = bpf_get_listener_sock(sk); /* if (!listen_sk) { ... } */ bpf_sk_release(sk); src_port = listen_sk->src_port; /* Allowed */ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-13bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_releaseMartin KaFai Lau
Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk) can still be accessed after bpf_sk_release(sk). Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue. This patch addresses them together. A simple reproducer looks like this: sk = bpf_sk_lookup_tcp(); /* if (!sk) ... */ tp = bpf_tcp_sock(sk); /* if (!tp) ... */ bpf_sk_release(sk); snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */ The problem is the verifier did not scrub the register's states of the tcp_sock ptr (tp) after bpf_sk_release(sk). [ Note that when calling bpf_tcp_sock(sk), the sk is not always refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works fine for this case. ] Currently, the verifier does not track if a helper's return ptr (in REG_0) is "carry"-ing one of its argument's refcount status. To carry this info, the reg1->id needs to be stored in reg0. One approach was tried, like "reg0->id = reg1->id", when calling "bpf_tcp_sock()". The main idea was to avoid adding another "ref_obj_id" for the same reg. However, overlapping the NULL marking and ref tracking purpose in one "id" does not work well: ref_sk = bpf_sk_lookup_tcp(); fullsock = bpf_sk_fullsock(ref_sk); tp = bpf_tcp_sock(ref_sk); if (!fullsock) { bpf_sk_release(ref_sk); return 0; } /* fullsock_reg->id is marked for NOT-NULL. * Same for tp_reg->id because they have the same id. */ /* oops. verifier did not complain about the missing !tp check */ snd_cwnd = tp->snd_cwnd; Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state". With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can scrub all reg states which has a ref_obj_id match. It is done with the changes in release_reg_references() in this patch. While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and bpf_sk_fullsock() to avoid these helpers from returning another ptr. It will make bpf_sk_release(tp) possible: sk = bpf_sk_lookup_tcp(); /* if (!sk) ... */ tp = bpf_tcp_sock(sk); /* if (!tp) ... */ bpf_sk_release(tp); A separate helper "bpf_get_listener_sock()" will be added in a later patch to do sk_to_full_sk(). Misc change notes: - To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON. ARG_PTR_TO_SOCKET is removed from bpf.h since no helper is using it. - arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted() because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not refcounted. All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock() take ARG_PTR_TO_SOCK_COMMON. - check_refcount_ok() ensures is_acquire_function() cannot take arg_type_may_be_refcounted() as its argument. - The check_func_arg() can only allow one refcount-ed arg. It is guaranteed by check_refcount_ok() which ensures at most one arg can be refcounted. Hence, it is a verifier internal error if >1 refcount arg found in check_func_arg(). - In release_reference(), release_reference_state() is called first to ensure a match on "reg->ref_obj_id" can be found before scrubbing the reg states with release_reg_references(). - reg_is_refcounted() is no longer needed. 1. In mark_ptr_or_null_regs(), its usage is replaced by "ref_obj_id && ref_obj_id == id" because, when is_null == true, release_reference_state() should only be called on the ref_obj_id obtained by a acquire helper (i.e. is_acquire_function() == true). Otherwise, the following would happen: sk = bpf_sk_lookup_tcp(); /* if (!sk) { ... } */ fullsock = bpf_sk_fullsock(sk); if (!fullsock) { /* * release_reference_state(fullsock_reg->ref_obj_id) * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id. * * Hence, the following bpf_sk_release(sk) will fail * because the ref state has already been released in the * earlier release_reference_state(fullsock_reg->ref_obj_id). */ bpf_sk_release(sk); } 2. In release_reg_references(), the current reg_is_refcounted() call is unnecessary because the id check is enough. - The type_is_refcounted() and type_is_refcounted_or_null() are no longer needed also because reg_is_refcounted() is removed. Fixes: 655a51e536c0 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock") Reported-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-03-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "First batch of fixes in the new merge window: 1) Double dst_cache free in act_tunnel_key, from Wenxu. 2) Avoid NULL deref in IN_DEV_MFORWARD() by failing early in the ip_route_input_rcu() path, from Paolo Abeni. 3) Fix appletalk compile regression, from Arnd Bergmann. 4) If SLAB objects reach the TCP sendpage method we are in serious trouble, so put a debugging check there. From Vasily Averin. 5) Memory leak in hsr layer, from Mao Wenan. 6) Only test GSO type on GSO packets, from Willem de Bruijn. 7) Fix crash in xsk_diag_put_umem(), from Eric Dumazet. 8) Fix VNIC mailbox length in nfp, from Dirk van der Merwe. 9) Fix race in ipv4 route exception handling, from Xin Long. 10) Missing DMA memory barrier in hns3 driver, from Jian Shen. 11) Use after free in __tcf_chain_put(), from Vlad Buslov. 12) Handle inet_csk_reqsk_queue_add() failures, from Guillaume Nault. 13) Return value correction when ip_mc_may_pull() fails, from Eric Dumazet. 14) Use after free in x25_device_event(), also from Eric" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits) gro_cells: make sure device is up in gro_cells_receive() vxlan: test dev->flags & IFF_UP before calling gro_cells_receive() net/x25: fix use-after-free in x25_device_event() isdn: mISDNinfineon: fix potential NULL pointer dereference net: hns3: fix to stop multiple HNS reset due to the AER changes ip: fix ip_mc_may_pull() return value net: keep refcount warning in reqsk_free() net: stmmac: Avoid one more sometimes uninitialized Clang warning net: dsa: mv88e6xxx: Set correct interface mode for CPU/DSA ports rxrpc: Fix client call queueing, waiting for channel tcp: handle inet_csk_reqsk_queue_add() failures net: ethernet: sun: Zero initialize class in default case in niu_add_ethtool_tcam_entry 8139too : Add support for U.S. Robotics USR997901A 10/100 Cardbus NIC fou, fou6: avoid uninit-value in gue_err() and gue6_err() net: sched: fix potential use-after-free in __tcf_chain_put() vhost: silence an unused-variable warning vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock connector: fix unsafe usage of ->real_parent vxlan: do not need BH again in vxlan_cleanup() net: hns3: add dma_rmb() for rx description ...
2019-03-10gro_cells: make sure device is up in gro_cells_receive()Eric Dumazet
We keep receiving syzbot reports [1] that show that tunnels do not play the rcu/IFF_UP rules properly. At device dismantle phase, gro_cells_destroy() will be called only after a full rcu grace period is observed after IFF_UP has been cleared. This means that IFF_UP needs to be tested before queueing packets into netif_rx() or gro_cells. This patch implements the test in gro_cells_receive() because too many callers do not seem to bother enough. [1] BUG: unable to handle kernel paging request at fffff4ca0b9ffffe PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 21 Comm: kworker/u4:1 Not tainted 5.0.0+ #97 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net RIP: 0010:__skb_unlink include/linux/skbuff.h:1929 [inline] RIP: 0010:__skb_dequeue include/linux/skbuff.h:1945 [inline] RIP: 0010:__skb_queue_purge include/linux/skbuff.h:2656 [inline] RIP: 0010:gro_cells_destroy net/core/gro_cells.c:89 [inline] RIP: 0010:gro_cells_destroy+0x19d/0x360 net/core/gro_cells.c:78 Code: 03 42 80 3c 20 00 0f 85 53 01 00 00 48 8d 7a 08 49 8b 47 08 49 c7 07 00 00 00 00 48 89 f9 49 c7 47 08 00 00 00 00 48 c1 e9 03 <42> 80 3c 21 00 0f 85 10 01 00 00 48 89 c1 48 89 42 08 48 c1 e9 03 RSP: 0018:ffff8880aa3f79a8 EFLAGS: 00010a02 RAX: 00ffffffffffffe8 RBX: ffffe8ffffc64b70 RCX: 1ffff8ca0b9ffffe RDX: ffffc6505cffffe8 RSI: ffffffff858410ca RDI: ffffc6505cfffff0 RBP: ffff8880aa3f7a08 R08: ffff8880aa3e8580 R09: fffffbfff1263645 R10: fffffbfff1263644 R11: ffffffff8931b223 R12: dffffc0000000000 R13: 0000000000000000 R14: ffffe8ffffc64b80 R15: ffffe8ffffc64b75 kobject: 'loop2' (000000004bd7d84a): kobject_uevent_env FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffff4ca0b9ffffe CR3: 0000000094941000 CR4: 00000000001406f0 Call Trace: kobject: 'loop2' (000000004bd7d84a): fill_kobj_path: path = '/devices/virtual/block/loop2' ip_tunnel_dev_free+0x19/0x60 net/ipv4/ip_tunnel.c:1010 netdev_run_todo+0x51c/0x7d0 net/core/dev.c:8970 rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:116 ip_tunnel_delete_nets+0x423/0x5f0 net/ipv4/ip_tunnel.c:1124 vti_exit_batch_net+0x23/0x30 net/ipv4/ip_vti.c:495 ops_exit_list.isra.0+0x105/0x160 net/core/net_namespace.c:156 cleanup_net+0x3fb/0x960 net/core/net_namespace.c:551 process_one_work+0x98e/0x1790 kernel/workqueue.c:2173 worker_thread+0x98/0xe40 kernel/workqueue.c:2319 kthread+0x357/0x430 kernel/kthread.c:246 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352 Modules linked in: CR2: fffff4ca0b9ffffe [ end trace 513fc9c1338d1cb3 ] RIP: 0010:__skb_unlink include/linux/skbuff.h:1929 [inline] RIP: 0010:__skb_dequeue include/linux/skbuff.h:1945 [inline] RIP: 0010:__skb_queue_purge include/linux/skbuff.h:2656 [inline] RIP: 0010:gro_cells_destroy net/core/gro_cells.c:89 [inline] RIP: 0010:gro_cells_destroy+0x19d/0x360 net/core/gro_cells.c:78 Code: 03 42 80 3c 20 00 0f 85 53 01 00 00 48 8d 7a 08 49 8b 47 08 49 c7 07 00 00 00 00 48 89 f9 49 c7 47 08 00 00 00 00 48 c1 e9 03 <42> 80 3c 21 00 0f 85 10 01 00 00 48 89 c1 48 89 42 08 48 c1 e9 03 RSP: 0018:ffff8880aa3f79a8 EFLAGS: 00010a02 RAX: 00ffffffffffffe8 RBX: ffffe8ffffc64b70 RCX: 1ffff8ca0b9ffffe RDX: ffffc6505cffffe8 RSI: ffffffff858410ca RDI: ffffc6505cfffff0 RBP: ffff8880aa3f7a08 R08: ffff8880aa3e8580 R09: fffffbfff1263645 R10: fffffbfff1263644 R11: ffffffff8931b223 R12: dffffc0000000000 kobject: 'loop3' (00000000e4ee57a6): kobject_uevent_env R13: 0000000000000000 R14: ffffe8ffffc64b80 R15: ffffe8ffffc64b75 FS: 0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffff4ca0b9ffffe CR3: 0000000094941000 CR4: 00000000001406f0 Fixes: c9e6bc644e55 ("net: add gro_cells infrastructure") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-03-09 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a crash in AF_XDP's xsk_diag_put_ring() which was passing wrong queue argument, from Eric. 2) Fix a regression due to wrong test for TCP GSO packets used in various BPF helpers like NAT64, from Willem. 3) Fix a sk_msg strparser warning which asserts that strparser must be stopped first, from Jakub. 4) Fix rejection of invalid options/bind flags in AF_XDP, from Björn. 5) Fix GSO in bpf_lwt_push_ip_encap() which must properly set inner headers and inner protocol, from Peter. 6) Fix a libbpf leak when kernel does not support BTF, from Nikita. 7) Various BPF selftest and libbpf build fixes to make out-of-tree compilation work and to properly resolve dependencies via fixdep target, from Stanislav. 8) Fix rejection of invalid ldimm64 imm field, from Daniel. 9) Fix bpf stats sysctl compile warning of unused helper function proc_dointvec_minmax_bpf_stats() under some configs, from Arnd. 10) Fix couple of warnings about using plain integer as NULL, from Bo. 11) Fix some BPF sample spelling mistakes, from Colin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-07ethtool: reduce stack usage with clangArnd Bergmann
clang inlines the dev_ethtool() more aggressively than gcc does, leading to a larger amount of used stack space: net/core/ethtool.c:2536:24: error: stack frame size of 1216 bytes in function 'dev_ethtool' [-Werror,-Wframe-larger-than=] Marking the sub-functions that require the most stack space as noinline_for_stack gives us reasonable behavior on all compilers. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-07bpf: Stop the psock parser before canceling its workJakub Sitnicki
We might have never enabled (started) the psock's parser, in which case it will not get stopped when destroying the psock. This leads to a warning when trying to cancel parser's work from psock's deferred destructor: [ 405.325769] WARNING: CPU: 1 PID: 3216 at net/strparser/strparser.c:526 strp_done+0x3c/0x40 [ 405.326712] Modules linked in: [last unloaded: test_bpf] [ 405.327359] CPU: 1 PID: 3216 Comm: kworker/1:164 Tainted: G W 5.0.0 #42 [ 405.328294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014 [ 405.329712] Workqueue: events sk_psock_destroy_deferred [ 405.330254] RIP: 0010:strp_done+0x3c/0x40 [ 405.330706] Code: 28 e8 b8 d5 6b ff 48 8d bb 80 00 00 00 e8 9c d5 6b ff 48 8b 7b 18 48 85 ff 74 0d e8 1e a5 e8 ff 48 c7 43 18 00 00 00 00 5b c3 <0f> 0b eb cf 66 66 66 66 90 55 89 f5 53 48 89 fb 48 83 c7 28 e8 0b [ 405.332862] RSP: 0018:ffffc900026bbe50 EFLAGS: 00010246 [ 405.333482] RAX: ffffffff819323e0 RBX: ffff88812cb83640 RCX: ffff88812cb829e8 [ 405.334228] RDX: 0000000000000001 RSI: ffff88812cb837e8 RDI: ffff88812cb83640 [ 405.335366] RBP: ffff88813fd22680 R08: 0000000000000000 R09: 000073746e657665 [ 405.336472] R10: 8080808080808080 R11: 0000000000000001 R12: ffff88812cb83600 [ 405.337760] R13: 0000000000000000 R14: ffff88811f401780 R15: ffff88812cb837e8 [ 405.338777] FS: 0000000000000000(0000) GS:ffff88813fd00000(0000) knlGS:0000000000000000 [ 405.339903] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 405.340821] CR2: 00007fb11489a6b8 CR3: 000000012d4d6000 CR4: 00000000000406e0 [ 405.341981] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 405.343131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 405.344415] Call Trace: [ 405.344821] sk_psock_destroy_deferred+0x23/0x1b0 [ 405.345585] process_one_work+0x1ae/0x3e0 [ 405.346110] worker_thread+0x3c/0x3b0 [ 405.346576] ? pwq_unbound_release_workfn+0xd0/0xd0 [ 405.347187] kthread+0x11d/0x140 [ 405.347601] ? __kthread_parkme+0x80/0x80 [ 405.348108] ret_from_fork+0x35/0x40 [ 405.348566] ---[ end trace a4a3af4026a327d4 ]--- Stop psock's parser just before canceling its work. Fixes: 1d79895aef18 ("sk_msg: Always cancel strp work before freeing the psock") Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-03-07net: fix GSO in bpf_lwt_push_ip_encapPeter Oskolkov
GSO needs inner headers and inner protocol set properly to work. skb->inner_mac_header: skb_reset_inner_headers() assigns the current mac header value to inner_mac_header; but it is not set at the point, so we need to call skb_reset_inner_mac_header, otherwise gre_gso_segment fails: it does int tnl_hlen = skb_inner_mac_header(skb) - skb_transport_header(skb); ... if (unlikely(!pskb_may_pull(skb, tnl_hlen))) ... skb->inner_protocol should also be correctly set. Fixes: ca78801a81e0 ("bpf: handle GSO in bpf_lwt_push_encap") Signed-off-by: Peter Oskolkov <posk@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>