summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2016-08-16tcp: consider recv buf for the initial window scaleSoheil Hassas Yeganeh
[ Upstream commit f626300a3e776ccc9671b0dd94698fb3aa315966 ] tcp_select_initial_window() intends to advertise a window scaling for the maximum possible window size. To do so, it considers the maximum of net.ipv4.tcp_rmem[2] and net.core.rmem_max as the only possible upper-bounds. However, users with CAP_NET_ADMIN can use SO_RCVBUFFORCE to set the socket's receive buffer size to values larger than net.ipv4.tcp_rmem[2] and net.core.rmem_max. Thus, SO_RCVBUFFORCE is effectively ignored by tcp_select_initial_window(). To fix this, consider the maximum of net.ipv4.tcp_rmem[2], net.core.rmem_max and socket's initial buffer space. Fixes: b0573dea1fb3 ("[NET]: Introduce SO_{SND,RCV}BUFFORCE socket options") Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Suggested-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16tcp: enable per-socket rate limiting of all 'challenge acks'Jason Baron
[ Upstream commit 083ae308280d13d187512b9babe3454342a7987e ] The per-socket rate limit for 'challenge acks' was introduced in the context of limiting ack loops: commit f2b2c582e824 ("tcp: mitigate ACK loops for connections as tcp_sock") And I think it can be extended to rate limit all 'challenge acks' on a per-socket basis. Since we have the global tcp_challenge_ack_limit, this patch allows for tcp_challenge_ack_limit to be set to a large value and effectively rely on the per-socket limit, or set tcp_challenge_ack_limit to a lower value and still prevents a single connections from consuming the entire challenge ack quota. It further moves in the direction of eliminating the global limit at some point, as Eric Dumazet has suggested. This a follow-up to: Subject: tcp: make challenge acks less predictable Cc: Eric Dumazet <edumazet@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Yue Cao <ycao009@ucr.edu> Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-16tcp: make challenge acks less predictableEric Dumazet
[ Upstream commit 75ff39ccc1bd5d3c455b6822ab09e533c551f758 ] Yue Cao claims that current host rate limiting of challenge ACKS (RFC 5961) could leak enough information to allow a patient attacker to hijack TCP sessions. He will soon provide details in an academic paper. This patch increases the default limit from 100 to 1000, and adds some randomization so that the attacker can no longer hijack sessions without spending a considerable amount of probes. Based on initial analysis and patch from Linus. Note that we also have per socket rate limiting, so it is tempting to remove the host limit in the future. v2: randomize the count of challenge acks per second, not the period. Fixes: 282f23c6ee34 ("tcp: implement RFC 5961 3.2") Reported-by: Yue Cao <ycao009@ucr.edu> Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-11ipmr/ip6mr: Initialize the last assert time of mfc entries.Tom Goff
[ Upstream commit 70a0dec45174c976c64b4c8c1d0898581f759948 ] This fixes wrong-interface signaling on 32-bit platforms for entries created when jiffies > 2^31 + MFC_ASSERT_THRESH. Signed-off-by: Tom Goff <thomas.goff@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-11esp: Fix ESN generation under UDP encapsulationHerbert Xu
[ Upstream commit 962fcef33b03395051367181a0549d29d109d9a4 ] Blair Steven noticed that ESN in conjunction with UDP encapsulation is broken because we set the temporary ESP header to the wrong spot. This patch fixes this by first of all using the right spot, i.e., 4 bytes off the real ESP header, and then saving this information so that after encryption we can restore it properly. Fixes: 7021b2e1cddd ("esp4: Switch to new AEAD interface") Reported-by: Blair Steven <Blair.Steven@alliedtelesis.co.nz> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24netfilter: x_tables: introduce and use xt_copy_counters_from_userFlorian Westphal
commit d7591f0c41ce3e67600a982bab6989ef0f07b3ce upstream. The three variants use same copy&pasted code, condense this into a helper and use that. Make sure info.name is 0-terminated. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24netfilter: x_tables: do compat validation via translate_tableFlorian Westphal
commit 09d9686047dbbe1cf4faa558d3ecc4aae2046054 upstream. This looks like refactoring, but its also a bug fix. Problem is that the compat path (32bit iptables, 64bit kernel) lacks a few sanity tests that are done in the normal path. For example, we do not check for underflows and the base chain policies. While its possible to also add such checks to the compat path, its more copy&pastry, for instance we cannot reuse check_underflow() helper as e->target_offset differs in the compat case. Other problem is that it makes auditing for validation errors harder; two places need to be checked and kept in sync. At a high level 32 bit compat works like this: 1- initial pass over blob: validate match/entry offsets, bounds checking lookup all matches and targets do bookkeeping wrt. size delta of 32/64bit structures assign match/target.u.kernel pointer (points at kernel implementation, needed to access ->compatsize etc.) 2- allocate memory according to the total bookkeeping size to contain the translated ruleset 3- second pass over original blob: for each entry, copy the 32bit representation to the newly allocated memory. This also does any special match translations (e.g. adjust 32bit to 64bit longs, etc). 4- check if ruleset is free of loops (chase all jumps) 5-first pass over translated blob: call the checkentry function of all matches and targets. The alternative implemented by this patch is to drop steps 3&4 from the compat process, the translation is changed into an intermediate step rather than a full 1:1 translate_table replacement. In the 2nd pass (step #3), change the 64bit ruleset back to a kernel representation, i.e. put() the kernel pointer and restore ->u.user.name . This gets us a 64bit ruleset that is in the format generated by a 64bit iptables userspace -- we can then use translate_table() to get the 'native' sanity checks. This has two drawbacks: 1. we re-validate all the match and target entry structure sizes even though compat translation is supposed to never generate bogus offsets. 2. we put and then re-lookup each match and target. THe upside is that we get all sanity tests and ruleset validations provided by the normal path and can remove some duplicated compat code. iptables-restore time of autogenerated ruleset with 300k chains of form -A CHAIN0001 -m limit --limit 1/s -j CHAIN0002 -A CHAIN0002 -m limit --limit 1/s -j CHAIN0003 shows no noticeable differences in restore times: old: 0m30.796s new: 0m31.521s 64bit: 0m25.674s Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24netfilter: x_tables: xt_compat_match_from_user doesn't need a retvalFlorian Westphal
commit 0188346f21e6546498c2a0f84888797ad4063fc5 upstream. Always returned 0. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24netfilter: ip_tables: simplify translate_compat_table argsFlorian Westphal
commit 7d3f843eed29222254c9feab481f55175a1afcc9 upstream. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24netfilter: arp_tables: simplify translate_compat_table argsFlorian Westphal
commit 8dddd32756f6fe8e4e82a63361119b7e2384e02f upstream. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24netfilter: x_tables: check for bogus target offsetFlorian Westphal
commit ce683e5f9d045e5d67d1312a42b359cb2ab2a13c upstream. We're currently asserting that targetoff + targetsize <= nextoff. Extend it to also check that targetoff is >= sizeof(xt_entry). Since this is generic code, add an argument pointing to the start of the match/target, we can then derive the base structure size from the delta. We also need the e->elems pointer in a followup change to validate matches. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24netfilter: x_tables: add compat version of xt_check_entry_offsetsFlorian Westphal
commit fc1221b3a163d1386d1052184202d5dc50d302d1 upstream. 32bit rulesets have different layout and alignment requirements, so once more integrity checks get added to xt_check_entry_offsets it will reject well-formed 32bit rulesets. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24netfilter: x_tables: kill check_entry helperFlorian Westphal
commit aa412ba225dd3bc36d404c28cdc3d674850d80d0 upstream. Once we add more sanity testing to xt_check_entry_offsets it becomes relvant if we're expecting a 32bit 'config_compat' blob or a normal one. Since we already have a lot of similar-named functions (check_entry, compat_check_entry, find_and_check_entry, etc.) and the current incarnation is short just fold its contents into the callers. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24netfilter: x_tables: add and use xt_check_entry_offsetsFlorian Westphal
commit 7d35812c3214afa5b37a675113555259cfd67b98 upstream. Currently arp/ip and ip6tables each implement a short helper to check that the target offset is large enough to hold one xt_entry_target struct and that t->u.target_size fits within the current rule. Unfortunately these checks are not sufficient. To avoid adding new tests to all of ip/ip6/arptables move the current checks into a helper, then extend this helper in followup patches. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24netfilter: x_tables: validate targets of jumpsFlorian Westphal
commit 36472341017529e2b12573093cc0f68719300997 upstream. When we see a jump also check that the offset gets us to beginning of a rule (an ipt_entry). The extra overhead is negible, even with absurd cases. 300k custom rules, 300k jumps to 'next' user chain: [ plus one jump from INPUT to first userchain ]: Before: real 0m24.874s user 0m7.532s sys 0m16.076s After: real 0m27.464s user 0m7.436s sys 0m18.840s Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24netfilter: x_tables: don't move to non-existent next ruleFlorian Westphal
commit f24e230d257af1ad7476c6e81a8dc3127a74204e upstream. Ben Hawkes says: In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it is possible for a user-supplied ipt_entry structure to have a large next_offset field. This field is not bounds checked prior to writing a counter value at the supplied offset. Base chains enforce absolute verdict. User defined chains are supposed to end with an unconditional return, xtables userspace adds them automatically. But if such return is missing we will move to non-existent next rule. Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24ipv4: Fix non-initialized TTL when CONFIG_SYSCTL=nEzequiel Garcia
[ Upstream commit 049bbf589ec651685205bd8ce73221fdd62345cf ] Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob") moves the default TTL assignment, and as side-effect IPv4 TTL now has a default value only if sysctl support is enabled (CONFIG_SYSCTL=y). The sysctl_ip_default_ttl is fundamental for IP to work properly, as it provides the TTL to be used as default. The defautl TTL may be used in ip_selected_ttl, through the following flow: ip_select_ttl ip4_dst_hoplimit net->ipv4.sysctl_ip_default_ttl This commit fixes the issue by assigning net->ipv4.sysctl_ip_default_ttl in net_init_net, called during ipv4's initialization. Without this commit, a kernel built without sysctl support will send all IP packets with zero TTL (unless a TTL is explicitly set, e.g. with setsockopt). Given a similar issue might appear on the other knobs that were namespaceify, this commit also moves them. Fixes: fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob") Signed-off-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24udp: prevent skbs lingering in tunnel socket queuesHannes Frederic Sowa
[ Upstream commit e5aed006be918af163eb397e45aa5ea6cefd5e01 ] In case we find a socket with encapsulation enabled we should call the encap_recv function even if just a udp header without payload is available. The callbacks are responsible for correctly verifying and dropping the packets. Also, in case the header validation fails for geneve and vxlan we shouldn't put the skb back into the socket queue, no one will pick them up there. Instead we can simply discard them in the respective encap_recv functions. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-14net/route: enforce hoplimit max valuePaolo Abeni
Currently, when creating or updating a route, no check is performed in both ipv4 and ipv6 code to the hoplimit value. The caller can i.e. set hoplimit to 256, and when such route will be used, packets will be sent with hoplimit/ttl equal to 0. This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4 ipv6 route code, substituting any value greater than 255 with 255. This is consistent with what is currently done for ADVMSS and MTU in the ipv4 code. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11gre: do not keep the GRE header around in collect medata modeJiri Benc
For ipgre interface in collect metadata mode, it doesn't make sense for the interface to be of ARPHRD_IPGRE type. The outer header of received packets is not needed, as all the information from it is present in metadata_dst. We already don't set ipgre_header_ops for collect metadata interfaces, which is the only consumer of mac_header pointing to the outer IP header. Just set the interface type to ARPHRD_NONE in collect metadata mode for ipgre (not gretap, that still correctly stays ARPHRD_ETHER) and reset mac_header. Fixes: a64b04d86d14 ("gre: do not assign header_ops in collect metadata mode") Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10tcp: refresh skb timestamp at retransmit timeEric Dumazet
In the very unlikely case __tcp_retransmit_skb() can not use the cloning done in tcp_transmit_skb(), we need to refresh skb_mstamp before doing the copy and transmit, otherwise TCP TS val will be an exact copy of original transmit. Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06udp_offload: Set encapsulation before inner completes.Jarno Rajahalme
UDP tunnel segmentation code relies on the inner offsets being set for an UDP tunnel GSO packet, but the inner *_complete() functions will set the inner offsets only if 'encapsulation' is set before calling them. Currently, udp_gro_complete() sets 'encapsulation' only after the inner *_complete() functions are done. This causes the inner offsets having invalid values after udp_gro_complete() returns, which in turn will make it impossible to properly segment the packet in case it needs to be forwarded, which would be visible to the user either as invalid packets being sent or as packet loss. This patch fixes this by setting skb's 'encapsulation' in udp_gro_complete() before calling into the inner complete functions, and by making each possible UDP tunnel gro_complete() callback set the inner_mac_header to the beginning of the tunnel payload. Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Reviewed-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06udp_tunnel: Remove redundant udp_tunnel_gro_complete().Jarno Rajahalme
The setting of the UDP tunnel GSO type is already performed by udp[46]_gro_complete(). Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2016-05-04 1) The flowcache can hit an OOM condition if too many entries are in the gc_list. Fix this by counting the entries in the gc_list and refuse new allocations if the value is too high. 2) The inner headers are invalid after a xfrm transformation, so reset the skb encapsulation field to ensure nobody tries access the inner headers. Otherwise tunnel devices stacked on top of xfrm may build the outer headers based on wrong informations. 3) Add pmtu handling to vti, we need it to report pmtu informations for local generated packets. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02gre: do not pull header in ICMP error processingJiri Benc
iptunnel_pull_header expects that IP header was already pulled; with this expectation, it pulls the tunnel header. This is not true in gre_err. Furthermore, ipv4_update_pmtu and ipv4_redirect expect that skb->data points to the IP header. We cannot pull the tunnel header in this path. It's just a matter of not calling iptunnel_pull_header - we don't need any of its effects. Fixes: bda7bb463436 ("gre: Allow multiple protocol listener for gre protocol.") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01soreuseport: Fix TCP listener hash collisionCraig Gallek
I forgot to include a check for listener port equality when deciding if two sockets should belong to the same reuseport group. This was not caught previously because it's only necessary when two listening sockets for the same user happen to hash to the same listener bucket. The same error does not exist in the UDP path. Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection") Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-29ip_tunnel: fix preempt warning in ip tunnel creation/updatingPaolo Abeni
After the commit e09acddf873b ("ip_tunnel: replace dst_cache with generic implementation"), a preemption debug warning is triggered on ip4 tunnels updating; the dst cache helper needs to be invoked in unpreemptible context. We don't need to load the cache on tunnel update, so this commit fixes the warning replacing the load with a dst cache reset, which is preempt safe. Fixes: e09acddf873b ("ip_tunnel: replace dst_cache with generic implementation") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28gre: reject GUE and FOU in collect metadata modeJiri Benc
The collect metadata mode does not support GUE nor FOU. This might be implemented later; until then, we should reject such config. I think this is okay to be changed. It's unlikely anyone has such configuration (as it doesn't work anyway) and we may need a way to distinguish whether it's supported or not by the kernel later. For backwards compatibility with iproute2, it's not possible to just check the attribute presence (iproute2 always includes the attribute), the actual value has to be checked, too. Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28gre: build header correctly for collect metadata tunnelsJiri Benc
In ipgre (i.e. not gretap) + collect metadata mode, the skb was assumed to contain Ethernet header and was encapsulated as ETH_P_TEB. This is not the case, the interface is ARPHRD_IPGRE and the protocol to be used for encapsulation is skb->protocol. Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.") Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28gre: do not assign header_ops in collect metadata modeJiri Benc
In ipgre mode (i.e. not gretap) with collect metadata flag set, the tunnel is incorrectly assumed to be mGRE in NBMA mode (see commit 6a5f44d7a048c). This is not the case, we're controlling the encapsulation addresses by lwtunnel metadata. And anyway, assigning dev->header_ops in collect metadata mode does not make sense. Although it would be more user firendly to reject requests that specify both the collect metadata flag and a remote/local IP address, this would break current users of gretap or introduce ugly code and differences in handling ipgre and gretap configuration. Keep the current behavior of remote/local IP address being ignored in such case. v3: Back to v1, added explanation paragraph. v2: Reject configuration specifying both remote/local address and collect metadata flag. Fixes: 2e15ea390e6f4 ("ip_gre: Add support to collect tunnel metadata.") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24ipv4/fib: don't warn when primary address is missing if in_dev is deadPaolo Abeni
After commit fbd40ea0180a ("ipv4: Don't do expensive useless work during inetdev destroy.") when deleting an interface, fib_del_ifaddr() can be executed without any primary address present on the dead interface. The above is safe, but triggers some "bug: prim == NULL" warnings. This commit avoids warning if the in_dev is dead Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21tcp: Merge tx_flags and tskey in tcp_shifted_skbMartin KaFai Lau
After receiving sacks, tcp_shifted_skb() will collapse skbs if possible. tx_flags and tskey also have to be merged. This patch reuses the tcp_skb_collapse_tstamp() to handle them. BPF Output Before: ~~~~~ <no-output-due-to-missing-tstamp-event> BPF Output After: ~~~~~ <...>-2024 [007] d.s. 88.644374: : ee_data:14599 Packetdrill Script: ~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 0.200 write(4, ..., 1460) = 1460 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 write(4, ..., 13140) = 13140 0.200 > P. 1:1461(1460) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:14601(5840) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop> 0.300 > P. 1:1461(1460) ack 1 0.400 < . 1:1(0) ack 14601 win 257 0.400 close(4) = 0 0.400 > F. 14601:14601(0) ack 1 0.500 < F. 1:1(0) ack 14602 win 257 0.500 > . 14602:14602(0) ack 2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21tcp: Merge tx_flags and tskey in tcp_collapse_retransMartin KaFai Lau
If two skbs are merged/collapsed during retransmission, the current logic does not merge the tx_flags and tskey. The end result is the SCM_TSTAMP_ACK timestamp could be missing for a packet. The patch: 1. Merge the tx_flags 2. Overwrite the prev_skb's tskey with the next_skb's tskey BPF Output Before: ~~~~~~ <no-output-due-to-missing-tstamp-event> BPF Output After: ~~~~~~ packetdrill-2092 [001] d.s. 453.998486: : ee_data:1459 Packetdrill Script: ~~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 0.200 write(4, ..., 730) = 730 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 write(4, ..., 730) = 730 +0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0 0.200 write(4, ..., 11680) = 11680 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 > P. 1:731(730) ack 1 0.200 > P. 731:1461(730) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:13141(4380) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop> 0.300 > P. 1:1461(1460) ack 1 0.400 < . 1:1(0) ack 13141 win 257 0.400 close(4) = 0 0.400 > F. 13141:13141(0) ack 1 0.500 < F. 1:1(0) ack 13142 win 257 0.500 > . 13142:13142(0) ack 2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acksMartin KaFai Lau
Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received, it could incorrectly think that a skb has already been acked and queue a SCM_TSTAMP_ACK cmsg to the sk->sk_error_queue. In tcp_ack_tstamp(), it checks 'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'. If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill script, between() returns true but the tskey is actually not acked. e.g. try between(3, 2, 1). The fix is to replace between() with one before() and one !before(). By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be removed. A packetdrill script is used to reproduce the dup ack scenario. Due to the lacking cmsg support in packetdrill (may be I cannot find it), a BPF prog is used to kprobe to sock_queue_err_skb() and print out the value of serr->ee.ee_data. Both the packetdrill and the bcc BPF script is attached at the end of this commit message. BPF Output Before Fix: ~~~~~~ <...>-2056 [001] d.s. 433.927987: : ee_data:1459 #incorrect packetdrill-2056 [001] d.s. 433.929563: : ee_data:1459 #incorrect packetdrill-2056 [001] d.s. 433.930765: : ee_data:1459 #incorrect packetdrill-2056 [001] d.s. 434.028177: : ee_data:1459 packetdrill-2056 [001] d.s. 434.029686: : ee_data:14599 BPF Output After Fix: ~~~~~~ <...>-2049 [000] d.s. 113.517039: : ee_data:1459 <...>-2049 [000] d.s. 113.517253: : ee_data:14599 BCC BPF Script: ~~~~~~ #!/usr/bin/env python from __future__ import print_function from bcc import BPF bpf_text = """ #include <uapi/linux/ptrace.h> #include <net/sock.h> #include <bcc/proto.h> #include <linux/errqueue.h> #ifdef memset #undef memset #endif int trace_err_skb(struct pt_regs *ctx) { struct sk_buff *skb = (struct sk_buff *)ctx->si; struct sock *sk = (struct sock *)ctx->di; struct sock_exterr_skb *serr; u32 ee_data = 0; if (!sk || !skb) return 0; serr = SKB_EXT_ERR(skb); bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data); bpf_trace_printk("ee_data:%u\\n", ee_data); return 0; }; """ b = BPF(text=bpf_text) b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb") print("Attached to kprobe") b.trace_print() Packetdrill Script: ~~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 write(4, ..., 1460) = 1460 0.200 write(4, ..., 13140) = 13140 0.200 > P. 1:1461(1460) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:14601(5840) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop> 0.300 > P. 1:1461(1460) ack 1 0.400 < . 1:1(0) ack 14601 win 257 0.400 close(4) = 0 0.400 > F. 14601:14601(0) ack 1 0.500 < F. 1:1(0) ack 14602 win 257 0.500 > . 14602:14602(0) ack 2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Tested-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14soreuseport: fix ordering for mixed v4/v6 socketsCraig Gallek
With the SO_REUSEPORT socket option, it is possible to create sockets in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address. This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on the AF_INET6 sockets. Prior to the commits referenced below, an incoming IPv4 packet would always be routed to a socket of type AF_INET when this mixed-mode was used. After those changes, the same packet would be routed to the most recently bound socket (if this happened to be an AF_INET6 socket, it would have an IPv4 mapped IPv6 address). The change in behavior occurred because the recent SO_REUSEPORT optimizations short-circuit the socket scoring logic as soon as they find a match. They did not take into account the scoring logic that favors AF_INET sockets over AF_INET6 sockets in the event of a tie. To fix this problem, this patch changes the insertion order of AF_INET and AF_INET6 addresses in the TCP and UDP socket lists when the sockets have SO_REUSEPORT set. AF_INET sockets will be inserted at the head of the list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at the tail of the list. This will force AF_INET sockets to always be considered first. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection") Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13route: do not cache fib route info on local routes with oifChris Friesen
For local routes that require a particular output interface we do not want to cache the result. Caching the result causes incorrect behaviour when there are multiple source addresses on the interface. The end result being that if the intended recipient is waiting on that interface for the packet he won't receive it because it will be delivered on the loopback interface and the IP_PKTINFO ipi_ifindex will be set to the loopback interface as well. This can be tested by running a program such as "dhcp_release" which attempts to inject a packet on a particular interface so that it is received by another program on the same board. The receiving process should see an IP_PKTINFO ipi_ifndex value of the source interface (e.g., eth1) instead of the loopback interface (e.g., lo). The packet will still appear on the loopback interface in tcpdump but the important aspect is that the CMSG info is correct. Sample dhcp_release command line: dhcp_release eth1 192.168.204.222 02:11:33:22:44:66 Signed-off-by: Allain Legacy <allain.legacy@windriver.com> Signed off-by: Chris Friesen <chris.friesen@windriver.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree. More specifically, they are: 1) Fix missing filter table per-netns registration in arptables, from Florian Westphal. 2) Resolve out of bound access when parsing TCP options in nf_conntrack_tcp, patch from Jozsef Kadlecsik. 3) Prefer NFPROTO_BRIDGE extensions over NFPROTO_UNSPEC in ebtables, this resolves conflict between xt_limit and ebt_limit, from Phil Sutter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11net: vrf: Fix dst reference countingDavid Ahern
Vivek reported a kernel exception deleting a VRF with an active connection through it. The root cause is that the socket has a cached reference to a dst that is destroyed. Converting the dst_destroy to dst_release and letting proper reference counting kick in does not work as the dst has a reference to the device which needs to be released as well. I talked to Hannes about this at netdev and he pointed out the ipv4 and ipv6 dst handling has dst_ifdown for just this scenario. Rather than continuing with the reinvented dst wheel in VRF just remove it and leverage the ipv4 and ipv6 versions. Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver") Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOUAlexander Duyck
This patch fixes an issue I found in which we were dropping frames if we had enabled checksums on GRE headers that were encapsulated by either FOU or GUE. Without this patch I was barely able to get 1 Gb/s of throughput. With this patch applied I am now at least getting around 6 Gb/s. The issue is due to the fact that with FOU or GUE applied we do not provide a transport offset pointing to the GRE header, nor do we offload it in software as the GRE header is completely skipped by GSO and treated like a VXLAN or GENEVE type header. As such we need to prevent the stack from generating it and also prevent GRE from generating it via any interface we create. Fixes: c3483384ee511 ("gro: Allow tunnel stacking in the case of FOU/GUE") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07netfilter: arp_tables: register table in initnsFlorian Westphal
arptables is broken since we didn't register the table anymore -- even 'arptables -L' fails. Fixes: b9e69e127397187b ("netfilter: xtables: don't hook tables by default") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-31vti: Add pmtu handling to vti_xmit.Steffen Klassert
We currently rely on the PMTU discovery of xfrm. However if a packet is locally sent, the PMTU mechanism of xfrm tries to do local socket notification what might not work for applications like ping that don't check for this. So add pmtu handling to vti_xmit to report MTU changes immediately. Reported-by: Mark McKinstry <Mark.McKinstry@alliedtelesis.co.nz> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2016-03-30gro: Allow tunnel stacking in the case of FOU/GUEAlexander Duyck
This patch should fix the issues seen with a recent fix to prevent tunnel-in-tunnel frames from being generated with GRO. The fix itself is correct for now as long as we do not add any devices that support NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the potential to mess things up due to the fact that the outer transport header points to the outer UDP header and not the GRE header as would be expected. Fixes: fac8e0f579695 ("tunnels: Don't apply GRO to multiple layers of encapsulation.") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-28netfilter: ipv4: fix NULL dereferenceLiping Zhang
Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob") use sock_net(skb->sk) to get the net namespace, but we can't assume that sk_buff->sk is always exist, so when it is NULL, oops will happen. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Reviewed-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: enforce nul-terminated table name from getsockopt ↵Pablo Neira Ayuso
GET_ENTRIES Make sure the table names via getsockopt GET_ENTRIES is nul-terminated in ebtables and all the x_tables variants and their respective compat code. Uncovered by KASAN. Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: fix unconditional helperFlorian Westphal
Ben Hawkes says: In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it is possible for a user-supplied ipt_entry structure to have a large next_offset field. This field is not bounds checked prior to writing a counter value at the supplied offset. Problem is that mark_source_chains should not have been called -- the rule doesn't have a next entry, so its supposed to return an absolute verdict of either ACCEPT or DROP. However, the function conditional() doesn't work as the name implies. It only checks that the rule is using wildcard address matching. However, an unconditional rule must also not be using any matches (no -m args). The underflow validator only checked the addresses, therefore passing the 'unconditional absolute verdict' test, while mark_source_chains also tested for presence of matches, and thus proceeeded to the next (not-existent) rule. Unify this so that all the callers have same idea of 'unconditional rule'. Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: make sure e->next_offset covers remaining blob sizeFlorian Westphal
Otherwise this function may read data beyond the ruleset blob. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: validate e->target_offset earlyFlorian Westphal
We should check that e->target_offset is sane before mark_source_chains gets called since it will fetch the target entry for loop detection. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-27Fix returned tc and hoplimit values for route with IPv6 encapsulationQuentin Armitage
For a route with IPv6 encapsulation, the traffic class and hop limit values are interchanged when returned to userspace by the kernel. For example, see below. ># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000 ># ip route show table 1000 192.168.0.1 encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2 scope link Signed-off-by: Quentin Armitage <quentin@armitage.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-23net: Reset encap_level to avoid resetting features on inner IP headersAlexander Duyck
This patch corrects an oversight in which we were allowing the encap_level value to pass from the outer headers to the inner headers. As a result we were incorrectly identifying UDP or GRE tunnels as also making use of ipip or sit when the second header actually represented a tunnel encapsulated in either a UDP or GRE tunnel which already had the features masked. Fixes: 76443456227097179c1482 ("net: Move GSO csum into SKB_GSO_CB") Reported-by: Tom Herbert <tom@herbertland.com> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-22ipv4: initialize flowi4_flags before calling fib_lookup()Lance Richardson
Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst() before calling fib_lookup(), which means fib_table_lookup() is using non-deterministic data at this line: if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) { Fix by initializing the entire fl4 structure, which will prevent similar issues as fields are added in the future by ensuring that all fields are initialized to zero unless explicitly initialized to another value. Fixes: 58189ca7b2741 ("net: Fix vti use case with oif in dst lookups") Suggested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>