summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-01-30net: stmmac: Send TSO packets always from Queue 0Jose Abreu
The number of TSO enabled channels in HW can be different than the number of total channels. There is no way to determined, at runtime, the number of TSO capable channels and its safe to assume that if TSO is enabled then at least channel 0 will be TSO capable. Lets always send TSO packets from Queue 0. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Cc: Joao Pinto <jpinto@synopsys.com> Cc: David S. Miller <davem@davemloft.net> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com> Cc: Alexandre Torgue <alexandre.torgue@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30net: stmmac: Fallback to Platform Data clock in Watchdog conversionJose Abreu
If we don't have DT then stmmac_clk will not be available. Let's add a new Platform Data field so that we can specify the refclk by this mean. This way we can still use the coalesce command in PCI based setups. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Cc: Joao Pinto <jpinto@synopsys.com> Cc: David S. Miller <davem@davemloft.net> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com> Cc: Alexandre Torgue <alexandre.torgue@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30ipvlan, l3mdev: fix broken l3s mode wrt local routesDaniel Borkmann
While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin, I ran into the issue that while l3 mode is working fine, l3s mode does not have any connectivity to kube-apiserver and hence all pods end up in Error state as well. The ipvlan master device sits on top of a bond device and hostns traffic to kube-apiserver (also running in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573 where the latter is the address of the bond0. While in l3 mode, a curl to https://10.152.183.1:443 or to https://139.178.29.207:37573 works fine from hostns, neither of them do in case of l3s. In the latter only a curl to https://127.0.0.1:37573 appeared to work where for local addresses of bond0 I saw kernel suddenly starting to emit ARP requests to query HW address of bond0 which remained unanswered and neighbor entries in INCOMPLETE state. These ARP requests only happen while in l3s. Debugging this further, I found the issue is that l3s mode is piggy- backing on l3 master device, and in this case local routes are using l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback"). I found that reverting them back into using the net->loopback_dev fixed ipvlan l3s connectivity and got everything working for the CNI. Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the l3mdev paper in [0] the only sole reason why ipvlan l3s is relying on l3 master device is to get the l3mdev_ip_rcv() receive hook for setting the dst entry of the input route without adding its own ipvlan specific hacks into the receive path, however, any l3 domain semantics beyond just that are breaking l3s operation. Note that ipvlan also has the ability to dynamically switch its internal operation from l3 to l3s for all ports via ipvlan_set_port_mode() at runtime. In any case, l3 vs l3s soley distinguishes itself by 'de-confusing' netfilter through switching skb->dev to ipvlan slave device late in NF_INET_LOCAL_IN before handing the skb to L4. Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which, if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook without any additional l3mdev semantics on top. This should also have minimal impact since dev->priv_flags is already hot in cache. With this set, l3s mode is working fine and I also get things like masquerading pod traffic on the ipvlan master properly working. [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback") Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Mahesh Bandewar <maheshb@google.com> Cc: David Ahern <dsa@cumulusnetworks.com> Cc: Florian Westphal <fw@strlen.de> Cc: Martynas Pumputis <m@lambda.lt> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30l2tp: fix reading optional fields of L2TPv3Jacob Wen
Use pskb_may_pull() to make sure the optional fields are in skb linear parts, so we can safely read them later. It's easy to reproduce the issue with a net driver that supports paged skb data. Just create a L2TPv3 over IP tunnel and then generates some network traffic. Once reproduced, rx err in /sys/kernel/debug/l2tp/tunnels will increase. Changes in v4: 1. s/l2tp_v3_pull_opt/l2tp_v3_ensure_opt_in_linear/ 2. s/tunnel->version != L2TP_HDR_VER_2/tunnel->version == L2TP_HDR_VER_3/ 3. Add 'Fixes' in commit messages. Changes in v3: 1. To keep consistency, move the code out of l2tp_recv_common. 2. Use "net" instead of "net-next", since this is a bug fix. Changes in v2: 1. Only fix L2TPv3 to make code simple. To fix both L2TPv3 and L2TPv2, we'd better refactor l2tp_recv_common. It's complicated to do so. 2. Reloading pointers after pskb_may_pull Fixes: f7faffa3ff8e ("l2tp: Add L2TPv3 protocol support") Fixes: 0d76751fad77 ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support") Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6") Signed-off-by: Jacob Wen <jian.w.wen@oracle.com> Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30tun: move the call to tun_set_real_num_queuesGeorge Amanakis
Call tun_set_real_num_queues() after the increment of tun->numqueues since the former depends on it. Otherwise, the number of queues is not correctly accounted for, which results to warnings similar to: "vnet0 selects TX queue 11, but real number of TX queues is 11". Fixes: 0b7959b62573 ("tun: publish tfile after it's fully initialized") Reported-and-tested-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30ipv6: sr: clear IP6CB(skb) on SRH ip4ip6 encapsulationYohei Kanemaru
skb->cb may contain data from previous layers (in an observed case IPv4 with L3 Master Device). In the observed scenario, the data in IPCB(skb)->frags was misinterpreted as IP6CB(skb)->frag_max_size, eventually caused an unexpected IPv6 fragmentation in ip6_fragment() through ip6_finish_output(). This patch clears IP6CB(skb), which potentially contains garbage data, on the SRH ip4ip6 encapsulation. Fixes: 32d99d0b6702 ("ipv6: sr: add support for ip4ip6 encapsulation") Signed-off-by: Yohei Kanemaru <yohei.kanemaru@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30Merge branch 'virtio_net-Fix-problems-around-XDP-tx-and-napi_tx'David S. Miller
Toshiaki Makita says: ==================== virtio_net: Fix problems around XDP tx and napi_tx While I'm looking into how to account standard tx counters on XDP tx processing, I found several bugs around XDP tx and napi_tx. Patch1: Fix oops on error path. Patch2 depends on this. Patch2: Fix memory corruption on freeing xdp_frames with napi_tx enabled. Patch3: Minor fix patch5 depends on. Patch4: Fix memory corruption on processing xdp_frames when XDP is disabled. Also patch5 depends on this. Patch5: Fix memory corruption on processing xdp_frames while XDP is being disabled. Patch6: Minor fix patch7 depends on. Patch7: Fix memory corruption on freeing sk_buff or xdp_frames when a normal queue is reused for XDP and vise versa. v2: - patch5: Make rcu_assign_pointer/synchronize_net conditional instead of _virtnet_set_queues. - patch7: Use napi_consume_skb() instead of dev_consume_skb_any() ==================== Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Differentiate sk_buff and xdp_frame on freeingToshiaki Makita
We do not reset or free up unused buffers when enabling/disabling XDP, so it can happen that xdp_frames are freed after disabling XDP or sk_buffs are freed after enabling XDP on xdp tx queues. Thus we need to handle both forms (xdp_frames and sk_buffs) regardless of XDP setting. One way to trigger this problem is to disable XDP when napi_tx is enabled. In that case, virtnet_xdp_set() calls virtnet_napi_enable() which kicks NAPI. The NAPI handler will call virtnet_poll_cleantx() which invokes free_old_xmit_skbs() for queues which have been used by XDP. Note that even with this change we need to keep skipping free_old_xmit_skbs() from NAPI handlers when XDP is enabled, because XDP tx queues do not aquire queue locks. - v2: Use napi_consume_skb() instead of dev_consume_skb_any() Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Use xdp_return_frame to free xdp_frames on destroying vqsToshiaki Makita
put_page() can work as a fallback for freeing xdp_frames, but the appropriate way is to use xdp_return_frame(). Fixes: cac320c850ef ("virtio_net: convert to use generic xdp_frame and xdp_return_frame API") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Don't process redirected XDP frames when XDP is disabledToshiaki Makita
Commit 8dcc5b0ab0ec ("virtio_net: fix ndo_xdp_xmit crash towards dev not ready for XDP") tried to avoid access to unexpected sq while XDP is disabled, but was not complete. There was a small window which causes out of bounds sq access in virtnet_xdp_xmit() while disabling XDP. An example case of - curr_queue_pairs = 6 (2 for SKB and 4 for XDP) - online_cpu_num = xdp_queue_paris = 4 when XDP is enabled: CPU 0 CPU 1 (Disabling XDP) (Processing redirected XDP frames) virtnet_xdp_xmit() virtnet_xdp_set() _virtnet_set_queues() set curr_queue_pairs (2) check if rq->xdp_prog is not NULL virtnet_xdp_sq(vi) qp = curr_queue_pairs - xdp_queue_pairs + smp_processor_id() = 2 - 4 + 1 = -1 sq = &vi->sq[qp] // out of bounds access set xdp_queue_pairs (0) rq->xdp_prog = NULL Basically we should not change curr_queue_pairs and xdp_queue_pairs while someone can read the values. Thus, when disabling XDP, assign NULL to rq->xdp_prog first, and wait for RCU grace period, then change xxx_queue_pairs. Note that we need to keep the current order when enabling XDP though. - v2: Make rcu_assign_pointer/synchronize_net conditional instead of _virtnet_set_queues. Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Fix out of bounds access of sqToshiaki Makita
When XDP is disabled, curr_queue_pairs + smp_processor_id() can be larger than max_queue_pairs. There is no guarantee that we have enough XDP send queues dedicated for each cpu when XDP is disabled, so do not count drops on sq in that case. Fixes: 5b8f3c8d30a6 ("virtio_net: Add XDP related stats") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Fix not restoring real_num_rx_queuesToshiaki Makita
When _virtnet_set_queues() failed we did not restore real_num_rx_queues. Fix this by placing the change of real_num_rx_queues after _virtnet_set_queues(). This order is also in line with virtnet_set_channels(). Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Don't call free_old_xmit_skbs for xdp_framesToshiaki Makita
When napi_tx is enabled, virtnet_poll_cleantx() called free_old_xmit_skbs() even for xdp send queue. This is bogus since the queue has xdp_frames, not sk_buffs, thus mangled device tx bytes counters because skb->len is meaningless value, and even triggered oops due to general protection fault on freeing them. Since xdp send queues do not aquire locks, old xdp_frames should be freed only in virtnet_xdp_xmit(), so just skip free_old_xmit_skbs() for xdp send queues. Similarly virtnet_poll_tx() called free_old_xmit_skbs(). This NAPI handler is called even without calling start_xmit() because cb for tx is by default enabled. Once the handler is called, it enabled the cb again, and then the handler would be called again. We don't need this handler for XDP, so don't enable cb as well as not calling free_old_xmit_skbs(). Also, we need to disable tx NAPI when disabling XDP, so virtnet_poll_tx() can safely access curr_queue_pairs and xdp_queue_pairs, which are not atomically updated while disabling XDP. Fixes: b92f1e6751a6 ("virtio-net: transmit napi") Fixes: 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx napi") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30virtio_net: Don't enable NAPI when interface is downToshiaki Makita
Commit 4e09ff536284 ("virtio-net: disable NAPI only when enabled during XDP set") tried to fix inappropriate NAPI enabling/disabling when !netif_running(), but was not complete. On error path virtio_net could enable NAPI even when !netif_running(). This can cause enabling NAPI twice on virtnet_open(), which would trigger BUG_ON() in napi_enable(). Fixes: 4941d472bf95b ("virtio-net: do not reset during XDP set") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30Merge branch 'erspan-always-reports-output-key-to-userspace'David S. Miller
Lorenzo Bianconi says: ==================== erspan: always reports output key to userspace Erspan protocol relies on output key to set session id header field. However TUNNEL_KEY bit is cleared in order to not add key field to the external GRE header and so the configured o_key is not reported to userspace. Fix the issue adding TUNNEL_KEY bit to the o_flags parameter dumping device info ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30net: ip6_gre: always reports o_key to userspaceLorenzo Bianconi
As Erspan_v4, Erspan_v6 protocol relies on o_key to configure session id header field. However TUNNEL_KEY bit is cleared in ip6erspan_tunnel_xmit since ERSPAN protocol does not set the key field of the external GRE header and so the configured o_key is not reported to userspace. The issue can be triggered with the following reproducer: $ip link add ip6erspan1 type ip6erspan local 2000::1 remote 2000::2 \ key 1 seq erspan_ver 1 $ip link set ip6erspan1 up ip -d link sh ip6erspan1 ip6erspan1@NONE: <BROADCAST,MULTICAST> mtu 1422 qdisc noop state DOWN mode DEFAULT link/ether ba:ff:09:24:c3:0e brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500 ip6erspan remote 2000::2 local 2000::1 encaplimit 4 flowlabel 0x00000 ikey 0.0.0.1 iseq oseq Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in ip6gre_fill_info Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30net: ip_gre: always reports o_key to userspaceLorenzo Bianconi
Erspan protocol (version 1 and 2) relies on o_key to configure session id header field. However TUNNEL_KEY bit is cleared in erspan_xmit since ERSPAN protocol does not set the key field of the external GRE header and so the configured o_key is not reported to userspace. The issue can be triggered with the following reproducer: $ip link add erspan1 type erspan local 192.168.0.1 remote 192.168.0.2 \ key 1 seq erspan_ver 1 $ip link set erspan1 up $ip -d link sh erspan1 erspan1@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UNKNOWN mode DEFAULT link/ether 52:aa:99:95:9a:b5 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500 erspan remote 192.168.0.2 local 192.168.0.1 ttl inherit ikey 0.0.0.1 iseq oseq erspan_index 0 Fix the issue adding TUNNEL_KEY bit to the o_flags parameter in ipgre_fill_info Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30ucc_geth: Reset BQL queue when stopping deviceMathias Thore
After a timeout event caused by for example a broadcast storm, when the MAC and PHY are reset, the BQL TX queue needs to be reset as well. Otherwise, the device will exhibit severe performance issues even after the storm has ended. Co-authored-by: David Gounaris <david.gounaris@infinera.com> Signed-off-by: Mathias Thore <mathias.thore@infinera.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30Merge branch 'net-various-compat-ioctl-fixes'David S. Miller
Johannes Berg says: ==================== various compat ioctl fixes Back a long time ago, I already fixed a few of these by passing the size of the struct ifreq to do_sock_ioctl(). However, Robert found more cases, and now it won't be as simple because we'd have to pass that down all the way to e.g. bond_do_ioctl() which isn't really feasible. Therefore, restore the old code. While looking at why SIOCGIFNAME was broken, I realized that Al had removed that case - which had been handled in an explicit separate function - as well, and looking through his work at the time I saw that bond ioctls were also affected by the erroneous removal. I've restored SIOCGIFNAME and bond ioctls by going through the (now renamed) dev_ifsioc() instead of reintroducing their own helper functions, which I hope is correct but have only tested with SIOCGIFNAME. ==================== Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30net: socket: make bond ioctls go through compat_ifreq_ioctl()Johannes Berg
Same story as before, these use struct ifreq and thus need to be read with the shorter version to not cause faults. Cc: stable@vger.kernel.org Fixes: f92d4fc95341 ("kill bond_ioctl()") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30net: socket: fix SIOCGIFNAME in compatJohannes Berg
As reported by Robert O'Callahan in https://bugzilla.kernel.org/show_bug.cgi?id=202273 reverting the previous changes in this area broke the SIOCGIFNAME ioctl in compat again (I'd previously fixed it after his previous report of breakage in https://bugzilla.kernel.org/show_bug.cgi?id=199469). This is obviously because I fixed SIOCGIFNAME more or less by accident. Fix it explicitly now by making it pass through the restored compat translation code. Cc: stable@vger.kernel.org Fixes: 4cf808e7ac32 ("kill dev_ifname32()") Reported-by: Robert O'Callahan <robert@ocallahan.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30Revert "kill dev_ifsioc()"Johannes Berg
This reverts commit bf4405737f9f ("kill dev_ifsioc()"). This wasn't really unused as implied by the original commit, it still handles the copy to/from user differently, and the commit thus caused issues such as https://bugzilla.kernel.org/show_bug.cgi?id=199469 and https://bugzilla.kernel.org/show_bug.cgi?id=202273 However, deviating from a strict revert, rename dev_ifsioc() to compat_ifreq_ioctl() to be clearer as to its purpose and add a comment. Cc: stable@vger.kernel.org Fixes: bf4405737f9f ("kill dev_ifsioc()") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30Revert "socket: fix struct ifreq size in compat ioctl"Johannes Berg
This reverts commit 1cebf8f143c2 ("socket: fix struct ifreq size in compat ioctl"), it's a bugfix for another commit that I'll revert next. This is not a 'perfect' revert, I'm keeping some coding style intact rather than revert to the state with indentation errors. Cc: stable@vger.kernel.org Fixes: 1cebf8f143c2 ("socket: fix struct ifreq size in compat ioctl") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Need to save away the IV across tls async operations, from Dave Watson. 2) Upon successful packet processing, we should liberate the SKB with dev_consume_skb{_irq}(). From Yang Wei. 3) Only apply RX hang workaround on effected macb chips, from Harini Katakam. 4) Dummy netdev need a proper namespace assigned to them, from Josh Elsasser. 5) Some paths of nft_compat run lockless now, and thus we need to use a proper refcnt_t. From Florian Westphal. 6) Avoid deadlock in mlx5 by doing IRQ locking, from Moni Shoua. 7) netrom does not refcount sockets properly wrt. timers, fix that by using the sock timer API. From Cong Wang. 8) Fix locking of inexact inserts of xfrm policies, from Florian Westphal. 9) Missing xfrm hash generation bump, also from Florian. 10) Missing of_node_put() in hns driver, from Yonglong Liu. 11) Fix DN_IFREQ_SIZE, from Johannes Berg. 12) ip6mr notifier is invoked during traversal of wrong table, from Nir Dotan. 13) TX promisc settings not performed correctly in qed, from Manish Chopra. 14) Fix OOB access in vhost, from Jason Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits) MAINTAINERS: Add entry for XDP (eXpress Data Path) net: set default network namespace in init_dummy_netdev() net: b44: replace dev_kfree_skb_xxx by dev_consume_skb_xxx for drop profiles net: caif: call dev_consume_skb_any when skb xmit done net: 8139cp: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles net: macb: Apply RXUBR workaround only to versions with errata net: ti: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles net: apple: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles net: amd8111e: replace dev_kfree_skb_irq by dev_consume_skb_irq net: alteon: replace dev_kfree_skb_irq by dev_consume_skb_irq net: tls: Fix deadlock in free_resources tx net: tls: Save iv in tls_rec for async crypto requests vhost: fix OOB in get_rx_bufs() qed: Fix stack out of bounds bug qed: Fix system crash in ll2 xmit qed: Fix VF probe failure while FLR qed: Fix LACP pdu drops for VFs qed: Fix bug in tx promiscuous mode settings net: i825xx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles netfilter: ipt_CLUSTERIP: fix warning unused variable cn ...
2019-01-29MAINTAINERS: Add entry for XDP (eXpress Data Path)Jesper Dangaard Brouer
Add multiple people as maintainers for XDP, sorted alphabetically. XDP is also tied to driver level support and code, but we cannot add all drivers to the list. Instead K: and N: match on 'xdp' in hope to catch some of those changes in drivers. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29net: set default network namespace in init_dummy_netdev()Josh Elsasser
Assign a default net namespace to netdevs created by init_dummy_netdev(). Fixes a NULL pointer dereference caused by busy-polling a socket bound to an iwlwifi wireless device, which bumps the per-net BUSYPOLLRXPACKETS stat if napi_poll() received packets: BUG: unable to handle kernel NULL pointer dereference at 0000000000000190 IP: napi_busy_loop+0xd6/0x200 Call Trace: sock_poll+0x5e/0x80 do_sys_poll+0x324/0x5a0 SyS_poll+0x6c/0xf0 do_syscall_64+0x6b/0x1f0 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: 7db6b048da3b ("net: Commonize busy polling code to focus on napi_id instead of socket") Signed-off-by: Josh Elsasser <jelsasser@appneta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29net: b44: replace dev_kfree_skb_xxx by dev_consume_skb_xxx for drop profilesYang Wei
The skb should be freed by dev_consume_skb_any() in b44_start_xmit() when bounce_skb is used. The skb is be replaced by bounce_skb, so the original skb should be consumed(not drop). dev_consume_skb_irq() should be called in b44_tx() when skb xmit done. It makes drop profiles(dropwatch, perf) more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29net: caif: call dev_consume_skb_any when skb xmit doneYang Wei
The skb shouled be consumed when xmit done, it makes drop profiles (dropwatch, perf) more friendly. dev_kfree_skb_irq()/kfree_skb() shouled be replaced by dev_consume_skb_any(), it makes code cleaner. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29net: 8139cp: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profilesYang Wei
dev_consume_skb_irq() should be called in cp_tx() when skb xmit done. It makes drop profiles(dropwatch, perf) more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29net: macb: Apply RXUBR workaround only to versions with errataHarini Katakam
The interrupt handler contains a workaround for RX hang applicable to Zynq and AT91RM9200 only. Subsequent versions do not need this workaround. This workaround unnecessarily resets RX whenever RX used bit read is observed, which can be often under heavy traffic. There is no other action performed on RX UBR interrupt. Hence introduce a CAPS mask; enable this interrupt and workaround only on affected versions. Signed-off-by: Harini Katakam <harini.katakam@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: ti: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profilesYang Wei
dev_consume_skb_irq() should be called in cpmac_end_xmit() when xmit done. It makes drop profiles more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: apple: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profilesYang Wei
dev_consume_skb_irq() should be called in bmac_txdma_intr() when xmit done. It makes drop profiles more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: amd8111e: replace dev_kfree_skb_irq by dev_consume_skb_irqYang Wei
dev_consume_skb_irq() should be called in amd8111e_tx() when xmit done. It makes drop profiles more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: alteon: replace dev_kfree_skb_irq by dev_consume_skb_irqYang Wei
dev_consume_skb_irq() should be called in ace_tx_int() when xmit done. It makes drop profiles more friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: tls: Fix deadlock in free_resources txDave Watson
If there are outstanding async tx requests (when crypto returns EINPROGRESS), there is a potential deadlock: the tx work acquires the lock, while we cancel_delayed_work_sync() while holding the lock. Drop the lock while waiting for the work to complete. Fixes: a42055e8d2c30 ("Add support for async encryption of records...") Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: tls: Save iv in tls_rec for async crypto requestsDave Watson
aead_request_set_crypt takes an iv pointer, and we change the iv soon after setting it. Some async crypto algorithms don't save the iv, so we need to save it in the tls_rec for async requests. Found by hardcoding x64 aesni to use async crypto manager (to test the async codepath), however I don't think this combination can happen in the wild. Presumably other hardware offloads will need this fix, but there have been no user reports. Fixes: a42055e8d2c30 ("Add support for async encryption of records...") Signed-off-by: Dave Watson <davejwatson@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28vhost: fix OOB in get_rx_bufs()Jason Wang
After batched used ring updating was introduced in commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"). We tend to batch heads in vq->heads for more than one packet. But the quota passed to get_rx_bufs() was not correctly limited, which can result a OOB write in vq->heads. headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx, vhost_len, &in, vq_log, &log, likely(mergeable) ? UIO_MAXIOV : 1); UIO_MAXIOV was still used which is wrong since we could have batched used in vq->heads, this will cause OOB if the next buffer needs more than 960 (1024 (UIO_MAXIOV) - 64 (VHOST_NET_BATCH)) heads after we've batched 64 (VHOST_NET_BATCH) heads: Acked-by: Stefan Hajnoczi <stefanha@redhat.com> ============================================================================= BUG kmalloc-8k (Tainted: G B ): Redzone overwritten ----------------------------------------------------------------------------- INFO: 0x00000000fd93b7a2-0x00000000f0713384. First byte 0xa9 instead of 0xcc INFO: Allocated in alloc_pd+0x22/0x60 age=3933677 cpu=2 pid=2674 kmem_cache_alloc_trace+0xbb/0x140 alloc_pd+0x22/0x60 gen8_ppgtt_create+0x11d/0x5f0 i915_ppgtt_create+0x16/0x80 i915_gem_create_context+0x248/0x390 i915_gem_context_create_ioctl+0x4b/0xe0 drm_ioctl_kernel+0xa5/0xf0 drm_ioctl+0x2ed/0x3a0 do_vfs_ioctl+0x9f/0x620 ksys_ioctl+0x6b/0x80 __x64_sys_ioctl+0x11/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Slab 0x00000000d13e87af objects=3 used=3 fp=0x (null) flags=0x200000000010201 INFO: Object 0x0000000003278802 @offset=17064 fp=0x00000000e2e6652b Fixing this by allocating UIO_MAXIOV + VHOST_NET_BATCH iovs for vhost-net. This is done through set the limitation through vhost_dev_init(), then set_owner can allocate the number of iov in a per device manner. This fixes CVE-2018-16880. Fixes: e2b3b35eb989 ("vhost_net: batch used ring update in rx") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28Merge branch 'qed-Bug-fixes'David S. Miller
Manish Chopra says: ==================== qed: Bug fixes This series have SR-IOV and some general fixes. Please consider applying it to "net" ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28qed: Fix stack out of bounds bugManish Chopra
KASAN reported following bug in qed_init_qm_get_idx_from_flags due to inappropriate casting of "pq_flags". Fix the type of "pq_flags". [ 196.624707] BUG: KASAN: stack-out-of-bounds in qed_init_qm_get_idx_from_flags+0x1a4/0x1b8 [qed] [ 196.624712] Read of size 8 at addr ffff809b00bc7360 by task kworker/0:9/1712 [ 196.624714] [ 196.624720] CPU: 0 PID: 1712 Comm: kworker/0:9 Not tainted 4.18.0-60.el8.aarch64+debug #1 [ 196.624723] Hardware name: To be filled by O.E.M. Saber/Saber, BIOS 0ACKL024 09/26/2018 [ 196.624733] Workqueue: events work_for_cpu_fn [ 196.624738] Call trace: [ 196.624742] dump_backtrace+0x0/0x2f8 [ 196.624745] show_stack+0x24/0x30 [ 196.624749] dump_stack+0xe0/0x11c [ 196.624755] print_address_description+0x68/0x260 [ 196.624759] kasan_report+0x178/0x340 [ 196.624762] __asan_report_load_n_noabort+0x38/0x48 [ 196.624786] qed_init_qm_get_idx_from_flags+0x1a4/0x1b8 [qed] [ 196.624808] qed_init_qm_info+0xec0/0x2200 [qed] [ 196.624830] qed_resc_alloc+0x284/0x7e8 [qed] [ 196.624853] qed_slowpath_start+0x6cc/0x1ae8 [qed] [ 196.624864] __qede_probe.isra.10+0x1cc/0x12c0 [qede] [ 196.624874] qede_probe+0x78/0xf0 [qede] [ 196.624879] local_pci_probe+0xc4/0x180 [ 196.624882] work_for_cpu_fn+0x54/0x98 [ 196.624885] process_one_work+0x758/0x1900 [ 196.624888] worker_thread+0x4e0/0xd18 [ 196.624892] kthread+0x2c8/0x350 [ 196.624897] ret_from_fork+0x10/0x18 [ 196.624899] [ 196.624902] Allocated by task 2: [ 196.624906] kasan_kmalloc.part.1+0x40/0x108 [ 196.624909] kasan_kmalloc+0xb4/0xc8 [ 196.624913] kasan_slab_alloc+0x14/0x20 [ 196.624916] kmem_cache_alloc_node+0x1dc/0x480 [ 196.624921] copy_process.isra.1.part.2+0x1d8/0x4a98 [ 196.624924] _do_fork+0x150/0xfa0 [ 196.624926] kernel_thread+0x48/0x58 [ 196.624930] kthreadd+0x3a4/0x5a0 [ 196.624932] ret_from_fork+0x10/0x18 [ 196.624934] [ 196.624937] Freed by task 0: [ 196.624938] (stack is not available) [ 196.624940] [ 196.624943] The buggy address belongs to the object at ffff809b00bc0000 [ 196.624943] which belongs to the cache thread_stack of size 32768 [ 196.624946] The buggy address is located 29536 bytes inside of [ 196.624946] 32768-byte region [ffff809b00bc0000, ffff809b00bc8000) [ 196.624948] The buggy address belongs to the page: [ 196.624952] page:ffff7fe026c02e00 count:1 mapcount:0 mapping:ffff809b4001c000 index:0x0 compound_mapcount: 0 [ 196.624960] flags: 0xfffff8000008100(slab|head) [ 196.624967] raw: 0fffff8000008100 dead000000000100 dead000000000200 ffff809b4001c000 [ 196.624970] raw: 0000000000000000 0000000000080008 00000001ffffffff 0000000000000000 [ 196.624973] page dumped because: kasan: bad access detected [ 196.624974] [ 196.624976] Memory state around the buggy address: [ 196.624980] ffff809b00bc7200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 196.624983] ffff809b00bc7280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 196.624985] >ffff809b00bc7300: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 04 f2 f2 f2 [ 196.624988] ^ [ 196.624990] ffff809b00bc7380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 196.624993] ffff809b00bc7400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 196.624995] ================================================================== Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28qed: Fix system crash in ll2 xmitManish Chopra
Cache number of fragments in the skb locally as in case of linear skb (with zero fragments), tx completion (or freeing of skb) may happen before driver tries to get number of frgaments from the skb which could lead to stale access to an already freed skb. Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28qed: Fix VF probe failure while FLRManish Chopra
VFs may hit VF-PF channel timeout while probing, as in some cases it was observed that VF FLR and VF "acquire" message transaction (i.e first message from VF to PF in VF's probe flow) could occur simultaneously which could lead VF to fail sending "acquire" message to PF as VF is marked disabled from HW perspective due to FLR, which will result into channel timeout and VF probe failure. In such cases, try retrying VF "acquire" message so that in later attempts it could be successful to pass message to PF after the VF FLR is completed and can be probed successfully. Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28qed: Fix LACP pdu drops for VFsManish Chopra
VF is always configured to drop control frames (with reserved mac addresses) but to work LACP on the VFs, it would require LACP control frames to be forwarded or transmitted successfully. This patch fixes this in such a way that trusted VFs (marked through ndo_set_vf_trust) would be allowed to pass the control frames such as LACP pdus. Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28qed: Fix bug in tx promiscuous mode settingsManish Chopra
When running tx switched traffic between VNICs created via a bridge(to which VFs are added), adapter drops the unicast packets in tx flow due to VNIC's ucast mac being unknown to it. But VF interfaces being in promiscuous mode should have caused adapter to accept all the unknown ucast packets. Later, it was found that driver doesn't really configure tx promiscuous mode settings to accept all unknown unicast macs. This patch fixes tx promiscuous mode settings to accept all unknown/unmatched unicast macs and works out the scenario. Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28net: i825xx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profilesYang Wei
dev_consume_skb_irq() should be called in i596_interrupt() when skb xmit done. It makes drop profiles(dropwatch, perf) more friendly. Signed-off-by: Yang Wei <albin_yang@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree: 1) The nftnl mutex is now per-netns, therefore use reference counter for matches and targets to deal with concurrent updates from netns. Moreover, place extensions in a pernet list. Patches from Florian Westphal. 2) Bail out with EINVAL in case of negative timeouts via setsockopt() through ip_vs_set_timeout(), from ZhangXiaoxu. 3) Spurious EINVAL on ebtables 32bit binary with 64bit kernel, also from Florian. 4) Reset TCP option header parser in case of fingerprint mismatch, otherwise follow up overlapping fingerprint definitions including TCP options do not work, from Fernando Fernandez Mancera. 5) Compilation warning in ipt_CLUSTER with CONFIG_PROC_FS unset. From Anders Roxell. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28Revert "mm, memory_hotplug: initialize struct pages for the full memory section"Michal Hocko
This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684. The underlying assumption that one sparse section belongs into a single numa node doesn't hold really. Robert Shteynfeld has reported a boot failure. The boot log was not captured but his memory layout is as follows: Early memory node ranges node 1: [mem 0x0000000000001000-0x0000000000090fff] node 1: [mem 0x0000000000100000-0x00000000dbdf8fff] node 1: [mem 0x0000000100000000-0x0000001423ffffff] node 0: [mem 0x0000001424000000-0x0000002023ffffff] This means that node0 starts in the middle of a memory section which is also in node1. memmap_init_zone tries to initialize padding of a section even when it is outside of the given pfn range because there are code paths (e.g. memory hotplug) which assume that the full worth of memory section is always initialized. In this particular case, though, such a range is already intialized and most likely already managed by the page allocator. Scribbling over those pages corrupts the internal state and likely blows up when any of those pages gets used. Reported-by: Robert Shteynfeld <robert.shteynfeld@gmail.com> Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section") Cc: stable@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-28netfilter: ipt_CLUSTERIP: fix warning unused variable cnAnders Roxell
When CONFIG_PROC_FS isn't set the variable cn isn't used. net/ipv4/netfilter/ipt_CLUSTERIP.c: In function ‘clusterip_net_exit’: net/ipv4/netfilter/ipt_CLUSTERIP.c:849:24: warning: unused variable ‘cn’ [-Wunused-variable] struct clusterip_net *cn = clusterip_pernet(net); ^~ Rework so the variable 'cn' is declared inside "#ifdef CONFIG_PROC_FS". Fixes: b12f7bad5ad3 ("netfilter: ipt_CLUSTERIP: remove wrong WARN_ON_ONCE in netns exit routine") Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28netfilter: nfnetlink_osf: add missing fmatch checkFernando Fernandez Mancera
When we check the tcp options of a packet and it doesn't match the current fingerprint, the tcp packet option pointer must be restored to its initial value in order to do the proper tcp options check for the next fingerprint. Here we can see an example. Assumming the following fingerprint base with two lines: S10:64:1:60:M*,S,T,N,W6: Linux:3.0::Linux 3.0 S20:64:1:60:M*,S,T,N,W7: Linux:4.19:arch:Linux 4.1 Where TCP options are the last field in the OS signature, all of them overlap except by the last one, ie. 'W6' versus 'W7'. In case a packet for Linux 4.19 kicks in, the osf finds no matching because the TCP options pointer is updated after checking for the TCP options in the first line. Therefore, reset pointer back to where it should be. Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28netfilter: ebtables: compat: un-break 32bit setsockopt when no rules are presentFlorian Westphal
Unlike ip(6)tables ebtables only counts user-defined chains. The effect is that a 32bit ebtables binary on a 64bit kernel can do 'ebtables -N FOO' only after adding at least one rule, else the request fails with -EINVAL. This is a similar fix as done in 3f1e53abff84 ("netfilter: ebtables: don't attempt to allocate 0-sized compat array"). Fixes: 7d7d7e02111e9 ("netfilter: compat: reject huge allocation requests") Reported-by: Francesco Ruggeri <fruggeri@arista.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-27net: dsa: mv88e6xxx: Fix serdes irq setup going recursiveAndrew Lunn
Duec to a typo, mv88e6390_serdes_irq_setup() calls itself, rather than mv88e6390x_serdes_irq_setup(). It then blows the stack, and shortly after the machine blows up. Fixes: 2defda1f4b91 ("net: dsa: mv88e6xxx: Add support for SERDES on ports 2-8 for 6390X") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>