summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2016-02-16ipv4: Namespaceify ip_default_ttl sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16tcp: add tcpi_min_rtt and tcpi_notsent_bytes to tcp_infoEric Dumazet
tcpi_min_rtt reports the minimal rtt observed by TCP stack for the flow, in usec unit. Might be ~0U if not yet known. tcpi_notsent_bytes reports the amount of bytes in the write queue that were not yet sent. This is done in a single patch to not add a temporary 32bit padding hole in tcp_info. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16tcp: md5: release request socket instead of listenerEric Dumazet
If tcp_v4_inbound_md5_hash() returns an error, we must release the refcount on the request socket, not on the listener. The bug was added for IPv4 only. Fixes: 079096f103fac ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16net/ipv4: add dst cache support for gre lwtunnelsPaolo Abeni
In case of UDP traffic with datagram length below MTU this gives about 4% performance increase Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16net: add dst_cache to ovs vxlan lwtunnelPaolo Abeni
In case of UDP traffic with datagram length below MTU this give about 2% performance increase when tunneling over ipv4 and about 60% when tunneling over ipv6 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16ip_tunnel: replace dst_cache with generic implementationPaolo Abeni
The current ip_tunnel cache implementation is prone to a race that will cause the wrong dst to be cached on cuncurrent dst cache miss and ip tunnel update via netlink. Replacing with the generic implementation fix the issue. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16net: replace dst_cache ip6_tunnel implementation with the generic onePaolo Abeni
This also fix a potential race into the existing tunnel code, which could lead to the wrong dst to be permanenty cached: CPU1: CPU2: <xmit on ip6_tunnel> <cache lookup fails> dst = ip6_route_output(...) <tunnel params are changed via nl> dst_cache_reset() // no effect, // the cache is empty dst_cache_set() // the wrong dst // is permanenty stored // into the cache With the new dst implementation the above race is not possible since the first cache lookup after dst_cache_reset will fail due to the timestamp check Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16net: add dst_cache supportPaolo Abeni
This patch add a generic, lockless dst cache implementation. The need for lock is avoided updating the dst cache fields only in per cpu scope, and requiring that the cache manipulation functions are invoked with the local bh disabled. The refresh_ts and reset_ts fields are used to ensure the cache consistency in case of cuncurrent cache update (dst_cache_set*) and reset operation (dst_cache_reset). Consider the following scenario: CPU1: CPU2: <cache lookup with emtpy cache: it fails> <get dst via uncached route lookup> <related configuration changes> dst_cache_reset() dst_cache_set() The dst entry set passed to dst_cache_set() should not be used for later dst cache lookup, because it's obtained using old configuration values. Since the refresh_ts is updated only on dst_cache lookup, the cached value in the above scenario will be discarded on the next lookup. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Suggested-and-acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16tcp: do not set rtt_min to 1Eric Dumazet
There are some cases where rtt_us derives from deltas of jiffies, instead of using usec timestamps. Since we want to track minimal rtt, better to assume a delta of 0 jiffie might be in fact be very close to 1 jiffie. It is kind of sad jiffies_to_usecs(1) calls a function instead of simply using a constant. Fixes: f672258391b42 ("tcp: track min RTT using windowed min-filter") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16net: dsa: remove phy_disconnect from error pathSascha Hauer
The phy has not been initialized, disconnecting it in the error path results in a NULL pointer exception. Drop the phy_disconnect from the error path. Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Neil Armstrong <narmstrong@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16tipc: refactor node xmit and fix memory leaksRichard Alpe
Refactor tipc_node_xmit() to fail fast and fail early. Fix several potential memory leaks in unexpected error paths. Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16tipc: fix premature addition of node to lookup tableJon Paul Maloy
In commit 5266698661401a ("tipc: let broadcast packet reception use new link receive function") we introduced a new per-node broadcast reception link instance. This link is created at the moment the node itself is created. Unfortunately, the allocation is done after the node instance has already been added to the node lookup hash table. This creates a potential race condition, where arriving broadcast packets are able to find and access the node before it has been fully initialized, and before the above mentioned link has been created. The result is occasional crashes in the function tipc_bcast_rcv(), which is trying to access the not-yet existing link. We fix this by deferring the addition of the node instance until after it has been fully initialized in the function tipc_node_create(). Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16bridge: mdb: avoid uninitialized variable warningArnd Bergmann
A recent change to the mdb code confused the compiler to the point where it did not realize that the port-group returned from br_mdb_add_group() is always valid when the function returns a nonzero return value, so we get a spurious warning: net/bridge/br_mdb.c: In function 'br_mdb_add': net/bridge/br_mdb.c:542:4: error: 'pg' may be used uninitialized in this function [-Werror=maybe-uninitialized] __br_mdb_notify(dev, entry, RTM_NEWMDB, pg); Slightly rearranging the code in br_mdb_add_group() makes the problem go away, as gcc is clever enough to see that both functions check for 'ret != 0'. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 9e8430f8d60d ("bridge: mdb: Passing the port-group pointer to br_mdb module") Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16net: Copy inner L3 and L4 headers as unaligned on GRE TEBAlexander Duyck
This patch corrects the unaligned accesses seen on GRE TEB tunnels when generating hash keys. Specifically what this patch does is make it so that we force the use of skb_copy_bits when the GRE inner headers will be unaligned due to NET_IP_ALIGNED being a non-zero value. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16ethtool: ensure channel counts are within bounds during SCHANNELSKeller, Jacob E
Add a sanity check to ensure that all requested channel sizes are within bounds, which should reduce errors in driver implementation. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16ethtool: correctly ensure {GS}CHANNELS doesn't conflict with GS{RXFH}Keller, Jacob E
Ethernet drivers implementing both {GS}RXFH and {GS}CHANNELS ethtool ops incorrectly allow SCHANNELS when it would conflict with the settings from SRXFH. This occurs because it is not possible for drivers to understand whether their Rx flow indirection table has been configured or is in the default state. In addition, drivers currently behave in various ways when increasing the number of Rx channels. Some drivers will always destroy the Rx flow indirection table when this occurs, whether it has been set by the user or not. Other drivers will attempt to preserve the table even if the user has never modified it from the default driver settings. Neither of these situation is desirable because it leads to unexpected behavior or loss of user configuration. The correct behavior is to simply return -EINVAL when SCHANNELS would conflict with the current Rx flow table settings. However, it should only do so if the current settings were modified by the user. If we required that the new settings never conflict with the current (default) Rx flow settings, we would force users to first reduce their Rx flow settings and then reduce the number of Rx channels. This patch proposes a solution implemented in net/core/ethtool.c which ensures that all drivers behave correctly. It checks whether the RXFH table has been configured to non-default settings, and stores this information in a private netdev flag. When the number of channels is requested to change, it first ensures that the current Rx flow table is not going to assign flows to now disabled channels. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contain a rather large batch for your net that includes accumulated bugfixes, they are: 1) Run conntrack cleanup from workqueue process context to avoid hitting soft lockup via watchdog for large tables. This is required by the IPv6 masquerading extension. From Florian Westphal. 2) Use original skbuff from nfnetlink batch when calling netlink_ack() on error since this needs to access the skb->sk pointer. 3) Incremental fix on top of recent Sasha Levin's lock fix for conntrack resizing. 4) Fix several problems in nfnetlink batch message header sanitization and error handling, from Phil Turnbull. 5) Select NF_DUP_IPV6 based on CONFIG_IPV6, from Arnd Bergmann. 6) Fix wrong signess in return values on nf_tables counter expression, from Anton Protopopov. Due to the NetDev 1.1 organization burden, I had no chance to pass up this to you any sooner in this release cycle, sorry about that. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16af_unix: Guard against other == sk in unix_dgram_sendmsgRainer Weikusat
The unix_dgram_sendmsg routine use the following test if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) { to determine if sk and other are in an n:1 association (either established via connect or by using sendto to send messages to an unrelated socket identified by address). This isn't correct as the specified address could have been bound to the sending socket itself or because this socket could have been connected to itself by the time of the unix_peer_get but disconnected before the unix_state_lock(other). In both cases, the if-block would be entered despite other == sk which might either block the sender unintentionally or lead to trying to unlock the same spin lock twice for a non-blocking send. Add a other != sk check to guard against this. Fixes: 7d267278a9ec ("unix: avoid use-after-free in ep_remove_wait_queue") Reported-By: Philipp Hahn <pmhahn@pmhahn.de> Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Tested-by: Philipp Hahn <pmhahn@pmhahn.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16af_unix: Don't set err in unix_stream_read_generic unless there was an errorRainer Weikusat
The present unix_stream_read_generic contains various code sequences of the form err = -EDISASTER; if (<test>) goto out; This has the unfortunate side effect of possibly causing the error code to bleed through to the final out: return copied ? : err; and then to be wrongly returned if no data was copied because the caller didn't supply a data buffer, as demonstrated by the program available at http://pad.lv/1540731 Change it such that err is only set if an error condition was detected. Fixes: 3822b5c2fc62 ("af_unix: Revert 'lock_interruptible' in stream receive code") Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com> Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16batman-adv: Avoid endless loop in bat-on-bat netdevice checkAndrew Lunn
batman-adv checks in different situation if a new device is already on top of a different batman-adv device. This is done by getting the iflink of a device and all its parent. It assumes that this iflink is always a parent device in an acyclic graph. But this assumption is broken by devices like veth which are actually a pair of two devices linked to each other. The recursive check would therefore get veth0 when calling dev_get_iflink on veth1. And it gets veth0 when calling dev_get_iflink with veth1. Creating a veth pair and loading batman-adv freezes parts of the system ip link add veth0 type veth peer name veth1 modprobe batman-adv An RCU stall will be detected on the system which cannot be fixed. INFO: rcu_sched self-detected stall on CPU 1: (5264 ticks this GP) idle=3e9/140000000000001/0 softirq=144683/144686 fqs=5249 (t=5250 jiffies g=46 c=45 q=43) Task dump for CPU 1: insmod R running task 0 247 245 0x00000008 ffffffff8151f140 ffffffff8107888e ffff88000fd141c0 ffffffff8151f140 0000000000000000 ffffffff81552df0 ffffffff8107b420 0000000000000001 ffff88000e3fa700 ffffffff81540b00 ffffffff8107d667 0000000000000001 Call Trace: <IRQ> [<ffffffff8107888e>] ? rcu_dump_cpu_stacks+0x7e/0xd0 [<ffffffff8107b420>] ? rcu_check_callbacks+0x3f0/0x6b0 [<ffffffff8107d667>] ? hrtimer_run_queues+0x47/0x180 [<ffffffff8107cf9d>] ? update_process_times+0x2d/0x50 [<ffffffff810873fb>] ? tick_handle_periodic+0x1b/0x60 [<ffffffff810290ae>] ? smp_trace_apic_timer_interrupt+0x5e/0x90 [<ffffffff813bbae2>] ? apic_timer_interrupt+0x82/0x90 <EOI> [<ffffffff812c3fd7>] ? __dev_get_by_index+0x37/0x40 [<ffffffffa0031f3e>] ? batadv_hard_if_event+0xee/0x3a0 [batman_adv] [<ffffffff812c5801>] ? register_netdevice_notifier+0x81/0x1a0 [...] This can be avoided by checking if two devices are each others parent and stopping the check in this situation. Fixes: b7eddd0b3950 ("batman-adv: prevent using any virtual device created on batman-adv as hard-interface") Signed-off-by: Andrew Lunn <andrew@lunn.ch> [sven@narfation.org: rewritten description, extracted fix] Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-16batman-adv: Only put orig_node_vlan list reference when removedSven Eckelmann
The batadv_orig_node_vlan reference counter in batadv_tt_global_size_mod can only be reduced when the list entry was actually removed. Otherwise the reference counter may reach zero when batadv_tt_global_size_mod is called from two different contexts for the same orig_node_vlan but only one context is actually removing the entry from the list. The release function for this orig_node_vlan is not called inside the vlan_list_lock spinlock protected region because the function batadv_tt_global_size_mod still holds a orig_node_vlan reference for the object pointer on the stack. Thus the actual release function (when required) will be called only at the end of the function. Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-16batman-adv: Only put gw_node list reference when removedSven Eckelmann
The batadv_gw_node reference counter in batadv_gw_node_update can only be reduced when the list entry was actually removed. Otherwise the reference counter may reach zero when batadv_gw_node_update is called from two different contexts for the same gw_node but only one context is actually removing the entry from the list. The release function for this gw_node is not called inside the list_lock spinlock protected region because the function batadv_gw_node_update still holds a gw_node reference for the object pointer on the stack. Thus the actual release function (when required) will be called only at the end of the function. Fixes: bd3524c14bd0 ("batman-adv: remove obsolete deleted attribute for gateway node") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-02-13vsock: Fix blocking ops call in prepare_to_waitLaura Abbott
We receoved a bug report from someone using vmware: WARNING: CPU: 3 PID: 660 at kernel/sched/core.c:7389 __might_sleep+0x7d/0x90() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff810fa68d>] prepare_to_wait+0x2d/0x90 Modules linked in: vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event snd_ens1371 iosf_mbi gameport snd_rawmidi snd_ac97_codec ac97_bus snd_seq coretemp snd_seq_device snd_pcm snd_timer snd soundcore ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vmw_vmci vmw_balloon i2c_piix4 shpchp parport_pc parport acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc btrfs xor raid6_pq 8021q garp stp llc mrp crc32c_intel serio_raw mptspi vmwgfx drm_kms_helper ttm drm scsi_transport_spi mptscsih e1000 ata_generic mptbase pata_acpi CPU: 3 PID: 660 Comm: vmtoolsd Not tainted 4.2.0-0.rc1.git3.1.fc23.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014 0000000000000000 0000000049e617f3 ffff88006ac37ac8 ffffffff818641f5 0000000000000000 ffff88006ac37b20 ffff88006ac37b08 ffffffff810ab446 ffff880068009f40 ffffffff81c63bc0 0000000000000061 0000000000000000 Call Trace: [<ffffffff818641f5>] dump_stack+0x4c/0x65 [<ffffffff810ab446>] warn_slowpath_common+0x86/0xc0 [<ffffffff810ab4d5>] warn_slowpath_fmt+0x55/0x70 [<ffffffff8112551d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [<ffffffff810fa68d>] ? prepare_to_wait+0x2d/0x90 [<ffffffff810fa68d>] ? prepare_to_wait+0x2d/0x90 [<ffffffff810da2bd>] __might_sleep+0x7d/0x90 [<ffffffff812163b3>] __might_fault+0x43/0xa0 [<ffffffff81430477>] copy_from_iter+0x87/0x2a0 [<ffffffffa039460a>] __qp_memcpy_to_queue+0x9a/0x1b0 [vmw_vmci] [<ffffffffa0394740>] ? qp_memcpy_to_queue+0x20/0x20 [vmw_vmci] [<ffffffffa0394757>] qp_memcpy_to_queue_iov+0x17/0x20 [vmw_vmci] [<ffffffffa0394d50>] qp_enqueue_locked+0xa0/0x140 [vmw_vmci] [<ffffffffa039593f>] vmci_qpair_enquev+0x4f/0xd0 [vmw_vmci] [<ffffffffa04847bb>] vmci_transport_stream_enqueue+0x1b/0x20 [vmw_vsock_vmci_transport] [<ffffffffa047ae05>] vsock_stream_sendmsg+0x2c5/0x320 [vsock] [<ffffffff810fabd0>] ? wake_atomic_t_function+0x70/0x70 [<ffffffff81702af8>] sock_sendmsg+0x38/0x50 [<ffffffff81702ff4>] SYSC_sendto+0x104/0x190 [<ffffffff8126e25a>] ? vfs_read+0x8a/0x140 [<ffffffff817042ee>] SyS_sendto+0xe/0x10 [<ffffffff8186d9ae>] entry_SYSCALL_64_fastpath+0x12/0x76 transport->stream_enqueue may call copy_to_user so it should not be called inside a prepare_to_wait. Narrow the scope of the prepare_to_wait to avoid the bad call. This also applies to vsock_stream_recvmsg as well. Reported-by: Vinson Lee <vlee@freedesktop.org> Tested-by: Vinson Lee <vlee@freedesktop.org> Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-13ipv4: fix memory leaks in ip_cmsg_send() callersEric Dumazet
Dmitry reported memory leaks of IP options allocated in ip_cmsg_send() when/if this function returns an error. Callers are responsible for the freeing. Many thanks to Dmitry for the report and diagnostic. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: ip_tunnel: remove 'csum_help' argument to iptunnel_handle_offloadsEdward Cree
All users now pass false, so we can remove it, and remove the code that was conditional upon it. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: gre: Implement LCO for GRE over IPv4Edward Cree
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12fou: enable LCO in FOU and GUEEdward Cree
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: udp: always set up for CHECKSUM_PARTIAL offloadEdward Cree
If the dst device doesn't support it, it'll get fixed up later anyway by validate_xmit_skb(). Also, this allows us to take advantage of LCO to avoid summing the payload multiple times. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: local checksum offload for encapsulationEdward Cree
The arithmetic properties of the ones-complement checksum mean that a correctly checksummed inner packet, including its checksum, has a ones complement sum depending only on whatever value was used to initialise the checksum field before checksumming (in the case of TCP and UDP, this is the ones complement sum of the pseudo header, complemented). Consequently, if we are going to offload the inner checksum with CHECKSUM_PARTIAL, we can compute the outer checksum based only on the packed data not covered by the inner checksum, and the initial value of the inner checksum field. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12tcp/dccp: better use of ephemeral ports in bind()Eric Dumazet
Implement strategy used in __inet_hash_connect() in opposite way : Try to find a candidate using odd ports, then fallback to even ports. We no longer disable BH for whole traversal, but one bucket at a time. We also use cond_resched() to yield cpu to other tasks if needed. I removed one indentation level and tried to mirror the loop we have in __inet_hash_connect() and variable names to ease code maintenance. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12tcp/dccp: better use of ephemeral ports in connect()Eric Dumazet
In commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range in connect()"), I added a very simple heuristic, so that we got better chances to use even ports, and allow bind() users to have more available slots. It gave nice results, but with more than 200,000 TCP sessions on a typical server, the ~30,000 ephemeral ports are still a rare resource. I chose to go a step further, by looking at all even ports, and if none was available, fallback to odd ports. The companion patch does the same in bind(), but in opposite way. I've seen exec times of up to 30ms on busy servers, so I no longer disable BH for the whole traversal, but only for each hash bucket. I also call cond_resched() to be gentle to other tasks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix BPF handling of branch offset adjustmnets on backjumps, from Daniel Borkmann. 2) Make sure selinux knows about SOCK_DESTROY netlink messages, from Lorenzo Colitti. 3) Fix openvswitch tunnel mtu regression, from David Wragg. 4) Fix ICMP handling of TCP sockets in syn_recv state, from Eric Dumazet. 5) Fix SCTP user hmacid byte ordering bug, from Xin Long. 6) Fix recursive locking in ipv6 addrconf, from Subash Abhinov Kasiviswanathan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: bpf: fix branch offset adjustment on backjumps after patching ctx expansion vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices geneve: Relax MTU constraints vxlan: Relax MTU constraints flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen of: of_mdio: Add marvell, 88e1145 to whitelist of PHY compatibilities. selinux: nlmsgtab: add SOCK_DESTROY to the netlink mapping tables sctp: translate network order to host order when users get a hmacid enic: increment devcmd2 result ring in case of timeout tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs net:Add sysctl_max_skb_frags tcp: do not drop syn_recv on all icmp reports ipv6: fix a lockdep splat unix: correctly track in-flight fds in sending process user_struct update be2net maintainers' email addresses dwc_eth_qos: Reset hardware before PHY start ipv6: addrconf: Fix recursive spin lock call
2016-02-11net: bulk free SKBs that were delay free'ed due to IRQ contextJesper Dangaard Brouer
The network stack defers SKBs free, in-case free happens in IRQ or when IRQs are disabled. This happens in __dev_kfree_skb_irq() that writes SKBs that were free'ed during IRQ to the softirq completion queue (softnet_data.completion_queue). These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ in function net_tx_action(). Take advantage of this a use the skb defer and flush API, as we are already in softirq context. For modern drivers this rarely happens. Although most drivers do call dev_kfree_skb_any(), which detects the situation and calls __dev_kfree_skb_irq() when needed. This due to netpoll can call from IRQ context. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: bulk free infrastructure for NAPI context, use napi_consume_skbJesper Dangaard Brouer
Discovered that network stack were hitting the kmem_cache/SLUB slowpath when freeing SKBs. Doing bulk free with kmem_cache_free_bulk can speedup this slowpath. NAPI context is a bit special, lets take advantage of that for bulk free'ing SKBs. In NAPI context we are running in softirq, which gives us certain protection. A softirq can run on several CPUs at once. BUT the important part is a softirq will never preempt another softirq running on the same CPU. This gives us the opportunity to access per-cpu variables in softirq context. Extend napi_alloc_cache (before only contained page_frag_cache) to be a struct with a small array based stack for holding SKBs. Introduce a SKB defer and flush API for accessing this. Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any() when running in NAPI context. A small trick to handle/detect if we are called from netpoll is to see if budget is 0. In that case, we need to invoke dev_consume_skb_irq(). Joint work with Alexander Duyck. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11igmp: Namespacify igmp_qrv sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11igmp: Namespaceify igmp_llm_reports sysctl knobNikolay Borisov
This was initially introduced in df2cf4a78e488d26 ("IGMP: Inhibit reports for local multicast groups") by defining the sysctl in the ipv4_net_table array, however it was never implemented to be namespace aware. Fix this by changing the code accordingly. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11igmp: Namespaceify igmp_max_msf sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11igmp: Namespaceify igmp_max_memberships sysctl knobNikolay Borisov
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11openvswitch: allow management from inside user namespacesTycho Andersen
Operations with the GENL_ADMIN_PERM flag fail permissions checks because this flag means we call netlink_capable, which uses the init user ns. Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations which should be allowed inside a user namespace. The motivation for this is to be able to run openvswitch in unprivileged containers. I've tested this and it seems to work, but I really have no idea about the security consequences of this patch, so thoughts would be much appreciated. v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function v3: use separate ifs for UNS_ADMIN_PERM and ADMIN_PERM, instead of one massive one Reported-by: James Page <james.page@canonical.com> Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> CC: Eric Biederman <ebiederm@xmission.com> CC: Pravin Shelar <pshelar@ovn.org> CC: Justin Pettit <jpettit@nicira.com> CC: "David S. Miller" <davem@davemloft.net> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11rds: duplicate include net/tcp.hstephen hemminger
Duplicate include detected. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Allow tunnels to use inner checksum offloads with outer checksums neededAlexander Duyck
This patch enables us to use inner checksum offloads if provided by hardware with outer checksums computed by software. It basically reduces encap_hdr_csum to an advisory flag for now, but based on the fact that SCTP may be getting segmentation support before long I thought we may want to keep it as it is possible we may need to support CRC32c and 1's compliment checksum in the same packet at some point in the future. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11udp: Use uh->len instead of skb->len to compute checksum in segmentationAlexander Duyck
The segmentation code was having to do a bunch of work to pull the skb->len and strip the udp header offset before the value could be used to adjust the checksum. Instead of doing all this work we can just use the value that goes into uh->len since that is the correct value with the correct byte order that we need anyway. By using this value we can save ourselves a bunch of pain as there is no need to do multiple byte swaps. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11udp: Clean up the use of flags in UDP segmentation offloadAlexander Duyck
This patch goes though and cleans up the logic related to several of the control flags used in UDP segmentation. Specifically the use of dont_encap isn't really needed as we can just check the skb for CHECKSUM_PARTIAL and if it isn't set then we don't need to update the internal headers. As such we can just drop that value. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11gre: Use inner_proto to obtain inner header protocolAlexander Duyck
Instead of parsing headers to determine the inner protocol we can just pull the value from inner_proto. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11gre: Use GSO flags to determine csum need instead of GRE flagsAlexander Duyck
This patch updates the gre checksum path to follow something much closer to the UDP checksum path. By doing this we can avoid needing to do as much header inspection and can just make use of the fields we were already reading in the sk_buff structure. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Move skb_has_shared_frag check out of GRE code and into segmentationAlexander Duyck
The call skb_has_shared_frag is used in the GRE path and skb_checksum_help to verify that no frags can be modified by an external entity. This check really doesn't belong in the GRE path but in the skb_segment function itself. This way any protocol that might be segmented will be performing this check before attempting to offload a checksum to software. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Store checksum result for offloaded GSO checksumsAlexander Duyck
This patch makes it so that we can offload the checksums for a packet up to a certain point and then begin computing the checksums via software. Setting this up is fairly straight forward as all we need to do is reset the values stored in csum and csum_start for the GSO context block. One complication for this is remote checksum offload. In order to allow the inner checksums to be offloaded while computing the outer checksum manually we needed to have some way of indicating that the offload wasn't real. In order to do that I replaced CHECKSUM_PARTIAL with CHECKSUM_UNNECESSARY in the case of us computing checksums for the outer header while skipping computing checksums for the inner headers. We clean up the ip_summed flag and set it to either CHECKSUM_PARTIAL or CHECKSUM_NONE once we hand the packet off to the next lower level. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Update remote checksum segmentation to support use of GSO checksumAlexander Duyck
This patch addresses two main issues. First in the case of remote checksum offload we were avoiding dealing with scatter-gather issues. As a result it would be possible to assemble a series of frames that used frags instead of being linearized as they should have if remote checksum offload was enabled. Second I have updated the code so that we now let GSO take care of doing the checksum on the data itself and drop the special case that was added for remote checksum offload. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Move GSO csum into SKB_GSO_CBAlexander Duyck
This patch moves the checksum maintained by GSO out of skb->csum and into the GSO context block in order to allow for us to work on outer checksums while maintaining the inner checksum offsets in the case of the inner checksum being offloaded, while the outer checksums will be computed. While updating the code I also did a minor cleanu-up on gso_make_checksum. The change is mostly to make it so that we store the values and compute the checksum instead of computing the checksum and then storing the values we needed to update. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11net: Drop unecessary enc_features variable from tunnel segmentation functionsAlexander Duyck
The enc_features variable isn't necessary since features isn't used anywhere after we create enc_features so instead just use a destructive AND on features itself and save ourselves the variable declaration. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>