summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2016-09-25netfilter: xt_hashlimit: Create revision 2 to support higher pps ratesVishwanath Pai
Create a new revision for the hashlimit iptables extension module. Rev 2 will support higher pps of upto 1 million, Version 1 supports only 10k. To support this we have to increase the size of the variables avg and burst in hashlimit_cfg to 64-bit. Create two new structs hashlimit_cfg2 and xt_hashlimit_mtinfo2 and also create newer versions of all the functions for match, checkentry and destroy. Some of the functions like hashlimit_mt, hashlimit_mt_check etc are very similar in both rev1 and rev2 with only minor changes, so I have split those functions and moved all the common code to a *_common function. Signed-off-by: Vishwanath Pai <vpai@akamai.com> Signed-off-by: Joshua Hunt <johunt@akamai.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25netfilter: xt_hashlimit: Prepare for revision 2Vishwanath Pai
I am planning to add a revision 2 for the hashlimit xtables module to support higher packets per second rates. This patch renames all the functions and variables related to revision 1 by adding _v1 at the end of the names. Signed-off-by: Vishwanath Pai <vpai@akamai.com> Signed-off-by: Joshua Hunt <johunt@akamai.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25netfilter: nft_ct: report error if mark and dir specified simultaneouslyLiping Zhang
NFT_CT_MARK is unrelated to direction, so if NFTA_CT_DIRECTION attr is specified, report EINVAL to the userspace. This validation check was already done at nft_ct_get_init, but we missed it in nft_ct_set_init. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25netfilter: nft_ct: unnecessary to require dir when use ct l3proto/protocolLiping Zhang
Currently, if the user want to match ct l3proto, we must specify the direction, for example: # nft add rule filter input ct original l3proto ipv4 ^^^^^^^^ Otherwise, error message will be reported: # nft add rule filter input ct l3proto ipv4 nft add rule filter input ct l3proto ipv4 <cmdline>:1:1-38: Error: Could not process rule: Invalid argument add rule filter input ct l3proto ipv4 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Actually, there's no need to require NFTA_CT_DIRECTION attr, because ct l3proto and protocol are unrelated to direction. And for compatibility, even if the user specify the NFTA_CT_DIRECTION attr, do not report error, just skip it. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25netfilter: seqadj: Fix the wrong ack adjust for the RST packet without ackGao Feng
It is valid that the TCP RST packet which does not set ack flag, and bytes of ack number are zero. But current seqadj codes would adjust the "0" ack to invalid ack number. Actually seqadj need to check the ack flag before adjust it for these RST packets. The following is my test case client is 10.26.98.245, and add one iptable rule: iptables -I INPUT -p tcp --sport 12345 -m connbytes --connbytes 2: --connbytes-dir reply --connbytes-mode packets -j REJECT --reject-with tcp-reset This iptables rule could generate on TCP RST without ack flag. server:10.172.135.55 Enable the synproxy with seqadjust by the following iptables rules iptables -t raw -A PREROUTING -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m tcp --syn -j CT --notrack iptables -A INPUT -i eth0 -p tcp -d 10.172.135.55 --dport 12345 -m conntrack --ctstate INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7 --mss 1460 iptables -A OUTPUT -o eth0 -p tcp -s 10.172.135.55 --sport 12345 -m conntrack --ctstate INVALID,UNTRACKED -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j ACCEPT The following is my test result. 1. packet trace on client root@routers:/tmp# tcpdump -i eth0 tcp port 12345 -n tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [S], seq 3695959829, win 29200, options [mss 1460,sackOK,TS val 452367884 ecr 0,nop,wscale 7], length 0 IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [S.], seq 546723266, ack 3695959830, win 0, options [mss 1460,sackOK,TS val 15643479 ecr 452367884, nop,wscale 7], length 0 IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [.], ack 1, win 229, options [nop,nop,TS val 452367885 ecr 15643479], length 0 IP 10.172.135.55.12345 > 10.26.98.245.45154: Flags [.], ack 1, win 226, options [nop,nop,TS val 15643479 ecr 452367885], length 0 IP 10.26.98.245.45154 > 10.172.135.55.12345: Flags [R], seq 3695959830, win 0, length 0 2. seqadj log on server [62873.867319] Adjusting sequence number from 602341895->546723267, ack from 3695959830->3695959830 [62873.867644] Adjusting sequence number from 602341895->546723267, ack from 3695959830->3695959830 [62873.869040] Adjusting sequence number from 3695959830->3695959830, ack from 0->55618628 To summarize, it is clear that the seqadj codes adjust the 0 ack when receive one TCP RST packet without ack. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25netfilter: replace list_head with single linked listAaron Conole
The netfilter hook list never uses the prev pointer, and so can be trimmed to be a simple singly-linked list. In addition to having a more light weight structure for hook traversal, struct net becomes 5568 bytes (down from 6400) and struct net_device becomes 2176 bytes (down from 2240). Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25Merge tag 'rxrpc-rewrite-20160924' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Implement slow-start and other bits This set of patches implements the RxRPC slow-start feature for AF_RXRPC to improve performance and handling of occasional packet loss. This is more or less the same as TCP slow start [RFC 5681]. Firstly, there are some ACK generation improvements: (1) Send ACKs regularly to apprise the peer of our state so that they can do congestion management of their own. (2) Send an ACK when we fill in a hole in the buffer so that the peer can find out that we did this thus forestalling retransmission. (3) Note the final DATA packet's serial number in the final ACK for correlation purposes. and a couple of bug fixes: (4) Reinitialise the ACK state and clear the ACK and resend timers upon entering the client reply reception phase to kill off any pending probe ACKs. (5) Delay the resend timer to allow for nsec->jiffies conversion errors. and then there's the slow-start pieces: (6) Summarise an ACK. (7) Schedule a PING or IDLE ACK if the reply to a client call is overdue to try and find out what happened to it. (8) Implement the slow start feature. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24gre: use nla_get_be32() to extract flowinfoLance Richardson
Eliminate a sparse endianness mismatch warning, use nla_get_be32() to extract a __be32 value instead of nla_get_u32(). Signed-off-by: Lance Richardson <lrichard@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24rxrpc: Implement slow-startDavid Howells
Implement RxRPC slow-start, which is similar to RFC 5681 for TCP. A tracepoint is added to log the state of the congestion management algorithm and the decisions it makes. Notes: (1) Since we send fixed-size DATA packets (apart from the final packet in each phase), counters and calculations are in terms of packets rather than bytes. (2) The ACK packet carries the equivalent of TCP SACK. (3) The FLIGHT_SIZE calculation in RFC 5681 doesn't seem particularly suited to SACK of a small number of packets. It seems that, almost inevitably, by the time three 'duplicate' ACKs have been seen, we have narrowed the loss down to one or two missing packets, and the FLIGHT_SIZE calculation ends up as 2. (4) In rxrpc_resend(), if there was no data that apparently needed retransmission, we transmit a PING ACK to ask the peer to tell us what its Rx window state is. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24rxrpc: Schedule an ACK if the reply to a client call appears overdueDavid Howells
If we've sent all the request data in a client call but haven't seen any sign of the reply data yet, schedule an ACK to be sent to the server to find out if the reply data got lost. If the server hasn't yet hard-ACK'd the request data, we send a PING ACK to demand a response to find out whether we need to retransmit. If the server says it has received all of the data, we send an IDLE ACK to tell the server that we haven't received anything in the receive phase as yet. To make this work, a non-immediate PING ACK must carry a delay. I've chosen the same as the IDLE ACK for the moment. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24rxrpc: Generate a summary of the ACK state for later useDavid Howells
Generate a summary of the Tx buffer packet state when an ACK is received for use in a later patch that does congestion management. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24rxrpc: Delay the resend timer to allow for nsec->jiffies conv errorDavid Howells
When determining the resend timer value, we have a value in nsec but the timer is in jiffies which may be a million or more times more coarse. nsecs_to_jiffies() rounds down - which means that the resend timeout expressed as jiffies is very likely earlier than the one expressed as nanoseconds from which it was derived. The problem is that rxrpc_resend() gets triggered by the timer, but can't then find anything to resend yet. It sets the timer again - but gets kicked off immediately again and again until the nanosecond-based expiry time is reached and we actually retransmit. Fix this by adding 1 to the jiffies-based resend_at value to counteract the rounding and make sure that the timer happens after the nanosecond-based expiry is passed. Alternatives would be to adjust the timestamp on the packets to align with the jiffie scale or to switch back to using jiffie-timestamps. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24rxrpc: Reinitialise the call ACK and timer state for client reply phaseDavid Howells
Clear the ACK reason, ACK timer and resend timer when entering the client reply phase when the first DATA packet is received. New ACKs will be proposed once the data is queued. The resend timer is no longer relevant and we need to cancel ACKs scheduled to probe for a lost reply. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24rxrpc: Include the last reply DATA serial number in the final ACKDavid Howells
In a client call, include the serial number of the last DATA packet of the reply in the final ACK. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24rxrpc: Send an immediate ACK if we fill in a holeDavid Howells
Send an immediate ACK if we fill in a hole in the buffer left by an out-of-sequence packet. This may allow the congestion management in the peer to avoid a retransmission if packets got reordered on the wire. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24netfilter: Only allow sane values in nf_register_net_hookAaron Conole
This commit adds an upfront check for sane values to be passed when registering a netfilter hook. This will be used in a future patch for a simplified hook list traversal. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24netfilter: Remove explicit rcu_read_lock in nf_hook_slowAaron Conole
All of the callers of nf_hook_slow already hold the rcu_read_lock, so this cleanup removes the recursive call. This is just a cleanup, as the locking code gracefully handles this situation. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24netfilter: call nf_hook_ingress with rcu_read_lockAaron Conole
This commit ensures that the rcu read-side lock is held while the ingress hook is called. This ensures that a call to nf_hook_slow (and ultimately nf_ingress) will be read protected. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24netfilter: bridge: add and use br_nf_hook_threshFlorian Westphal
This replaces the last uses of NF_HOOK_THRESH(). Followup patch will remove it and rename nf_hook_thresh. The reason is that inet (non-bridge) netfilter no longer invokes the hooks from hooks, so we do no longer need the thresh value to skip hooks with a lower priority. The bridge netfilter however may need to do this. br_nf_hook_thresh is a wrapper that is supposed to do this, i.e. only call hooks with a priority that exceeds NF_BR_PRI_BRNF. It's used only in the recursion cases of br_netfilter. It invokes nf_hook_slow while holding an rcu read-side critical section to make a future cleanup simpler. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24netfilter: xt_TCPMSS: Refactor the codes to decrease one condition check and ↵Gao Feng
more readable The origin codes perform two condition checks with dst_mtu(skb_dst(skb)) and in_mtu. And the last statement is "min(dst_mtu(skb_dst(skb)), in_mtu) - minlen". It may let reader think about how about the result. Would it be negative. Now assign the result of min(dst_mtu(skb_dst(skb)), in_mtu) to a new variable, then only perform one condition check, and it is more readable. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24rxrpc: Send an ACK after every few DATA packets we receiveDavid Howells
Send an ACK if we haven't sent one for the last two packets we've received. This keeps the other end apprised of where we've got to - which is important if they're doing slow-start. We do this in recvmsg so that we can dispatch a packet directly without the need to wake up the background thread. This should possibly be made configurable in future. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24ip6_gre: fix flowi6_proto value in ip6gre_xmit_other()Lance Richardson
Similar to commit 3be07244b733 ("ip6_gre: fix flowi6_proto value in xmit path"), set flowi6_proto to IPPROTO_GRE for output route lookup. Up until now, ip6gre_xmit_other() has set flowi6_proto to a bogus value. This affected output route lookup for packets sent on an ip6gretap device in cases where routing was dependent on the value of flowi6_proto. Since the correct proto is already set in the tunnel flowi6 template via commit 252f3f5a1189 ("ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit path."), simply delete the line setting the incorrect flowi6_proto value. Suggested-by: Jiri Benc <jbenc@redhat.com> Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Lance Richardson <lrichard@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24Merge tag 'rxrpc-rewrite-20160923' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Bug fixes and tracepoints Here are a bunch of bug fixes: (1) Need to set the timestamp on a Tx packet before queueing it to avoid trouble with the retransmission function. (2) Don't send an ACK at the end of the service reply transmission; it's the responsibility of the client to send an ACK to close the call. The service can resend the last DATA packet or send a PING ACK. (3) Wake sendmsg() on abnormal call termination. (4) Use ktime_add_ms() not ktime_add_ns() to add millisecond offsets. (5) Use before_eq() & co. to compare serial numbers (which may wrap). (6) Start the resend timer on DATA packet transmission. (7) Don't accidentally cancel a retransmission upon receiving a NACK. (8) Fix the call timer setting function to deal with timeouts that are now or past. (9) Don't use a flag to communicate the presence of the last packet in the Tx buffer from sendmsg to the input routines where ACK and DATA reception is handled. The problem is that there's a window between queueing the last packet for transmission and setting the flag in which ACKs or reply DATA packets can arrive, causing apparent state machine violation issues. Instead use the annotation buffer to mark the last packet and pick up and set the flag in the input routines. (10) Don't call the tx_ack tracepoint and don't allocate a serial number if someone else nicked the ACK we were about to transmit. There are also new tracepoints and one altered tracepoint used to track down the above bugs: (11) Call timer tracepoint. (12) Data Tx tracepoint (and adjustments to ACK tracepoint). (13) Injected Rx packet loss tracepoint. (14) Ack proposal tracepoint. (15) Retransmission selection tracepoint. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2016-09-23 Only two patches this time: 1) Fix a comment reference to struct xfrm_replay_state_esn. From Richard Guy Briggs. 2) Convert xfrm_state_lookup to rcu, we don't need the xfrm_state_lock anymore in the input path. From Florian Westphal. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24net: Update API for VF vlan protocol 802.1ad supportMoshe Shemesh
Introduce new rtnl UAPI that exposes a list of vlans per VF, giving the ability for user-space application to specify it for the VF, as an option to support 802.1ad. We adjusted IP Link tool to support this option. For future use cases, the new UAPI supports multiple vlans. For now we limit the list size to a single vlan in kernel. Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward compatibility with older versions of IP Link tool. Add a vlan protocol parameter to the ndo_set_vf_vlan callback. We kept 802.1Q as the drivers' default vlan protocol. Suitable ip link tool command examples: Set vf vlan protocol 802.1ad: ip link set eth0 vf 1 vlan 100 proto 802.1ad Set vf to VST (802.1Q) mode: ip link set eth0 vf 1 vlan 100 proto 802.1Q Or by omitting the new parameter ip link set eth0 vf 1 vlan 100 Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23netns: move {inc,dec}_net_namespaces into #ifdefArnd Bergmann
With the newly enforced limit on the number of namespaces, we get a build warning if CONFIG_NETNS is disabled: net/core/net_namespace.c:273:13: error: 'dec_net_namespaces' defined but not used [-Werror=unused-function] net/core/net_namespace.c:268:24: error: 'inc_net_namespaces' defined but not used [-Werror=unused-function] This moves the two added functions inside the #ifdef that guards their callers. Fixes: 703286608a22 ("netns: Add a limit on the number of net namespaces") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-23IB/core: add support to create a unsafe global rkey to ib_create_pdChristoph Hellwig
Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or less unchecked, this moves the capability of creating a global rkey into the RDMA core, where it can be easily audited. It also prints a warning everytime this feature is used as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-09-23rxrpc: Add a tracepoint to log which packets will be retransmittedDavid Howells
Add a tracepoint to log in rxrpc_resend() which packets will be retransmitted. Note that if a positive ACK comes in whilst we have dropped the lock to retransmit another packet, the actual retransmission may not happen, though some of the effects will (such as altering the congestion management). Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23rxrpc: Add tracepoint for ACK proposalDavid Howells
Add a tracepoint to log proposed ACKs, including whether the proposal is used to update a pending ACK or is discarded in favour of an easlier, higher priority ACK. Whilst we're at it, get rid of the rxrpc_acks() function and access the name array directly. We do, however, need to validate the ACK reason number given to trace_rxrpc_rx_ack() to make sure we don't overrun the array. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23rxrpc: Add a tracepoint to log injected Rx packet lossDavid Howells
Add a tracepoint to log received packets that get discarded due to Rx packet loss. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23rxrpc: Add data Tx tracepoint and adjust Tx ACK tracepointDavid Howells
Add a tracepoint to log transmission of DATA packets (including loss injection). Adjust the ACK transmission tracepoint to include the packet serial number and to line this up with the DATA transmission display. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23rxrpc: Add a tracepoint for the call timerDavid Howells
Add a tracepoint to log call timer initiation, setting and expiry. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23rxrpc: Don't call the tx_ack tracepoint if don't generate an ACKDavid Howells
rxrpc_send_call_packet() is invoking the tx_ack tracepoint before it checks whether there's an ACK to transmit (another thread may jump in and transmit it). Fix this by only invoking the tracepoint if we get a valid ACK to transmit. Further, only allocate a serial number if we're going to actually transmit something. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23rxrpc: Pass the last Tx packet marker in the annotation bufferDavid Howells
When the last packet of data to be transmitted on a call is queued, tx_top is set and then the RXRPC_CALL_TX_LAST flag is set. Unfortunately, this leaves a race in the ACK processing side of things because the flag affects the interpretation of tx_top and also allows us to start receiving reply data before we've finished transmitting. To fix this, make the following changes: (1) rxrpc_queue_packet() now sets a marker in the annotation buffer instead of setting the RXRPC_CALL_TX_LAST flag. (2) rxrpc_rotate_tx_window() detects the marker and sets the flag in the same context as the routines that use it. (3) rxrpc_end_tx_phase() is simplified to just shift the call state. The Tx window must have been rotated before calling to discard the last packet. (4) rxrpc_receiving_reply() is added to handle the arrival of the first DATA packet of a reply to a client call (which is an implicit ACK of the Tx phase). (5) The last part of rxrpc_input_ack() is reordered to perform Tx rotation, then soft-ACK application and then to end the phase if we've rotated the last packet. In the event of a terminal ACK, the soft-ACK application will be skipped as nAcks should be 0. (6) rxrpc_input_ackall() now has to rotate as well as ending the phase. In addition: (7) Alter the transmit tracepoint to log the rotation of the last packet. (8) Remove the no-longer relevant queue_reqack tracepoint note. The ACK-REQUESTED packet header flag is now set as needed when we actually transmit the packet and may vary by retransmission. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23rxrpc: Fix call timerDavid Howells
Fix the call timer in the following ways: (1) If call->resend_at or call->ack_at are before or equal to the current time, then ignore that timeout. (2) If call->expire_at is before or equal to the current time, then don't set the timer at all (possibly we should queue the call). (3) Don't skip modifying the timer if timer_pending() is true. This indicates that the timer is working, not that it has expired and is running/waiting to run its expiry handler. Also call rxrpc_set_timer() to start the call timer going rather than calling add_timer(). Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23rxrpc: Fix accidental cancellation of scheduled resend by ACK parserDavid Howells
When rxrpc_input_soft_acks() is parsing the soft-ACKs from an ACK packet, it updates the Tx packet annotations in the annotation buffer. If a soft-ACK is an ACK, then we overwrite unack'd, nak'd or to-be-retransmitted states and that is fine; but if the soft-ACK is an NACK, we overwrite the to-be-retransmitted with a nak - which isn't. Instead, we need to let any scheduled retransmission stand if the packet was NAK'd. Note that we don't reissue a resend if the annotation is in the to-be-retransmitted state because someone else must've scheduled the resend already. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23svcrdma: support Remote InvalidationChuck Lever
Support Remote Invalidation. A private message is exchanged with the client upon RDMA transport connect that indicates whether Send With Invalidation may be used by the server to send RPC replies. The invalidate_rkey is arbitrarily chosen from among rkeys present in the RPC-over-RDMA header's chunk lists. Send With Invalidate improves performance only when clients can recognize, while processing an RPC reply, that an rkey has already been invalidated. That has been submitted as a separate change. In the future, the RPC-over-RDMA protocol might support Remote Invalidation properly. The protocol needs to enable signaling between peers to indicate when Remote Invalidation can be used for each individual RPC. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23svcrdma: Server-side support for rpcrdma_connect_privateChuck Lever
Prepare to receive an RDMA-CM private message when handling a new connection attempt, and send a similar message as part of connection acceptance. Both sides can communicate their various implementation limits. Implementations that don't support this sideband protocol ignore it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23svcrdma: Skip put_page() when send_reply() failsChuck Lever
Message from syslogd@klimt at Aug 18 17:00:37 ... kernel:page:ffffea0020639b00 count:0 mapcount:0 mapping: (null) index:0x0 Aug 18 17:00:37 klimt kernel: flags: 0x2fffff80000000() Aug 18 17:00:37 klimt kernel: page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) Aug 18 17:00:37 klimt kernel: kernel BUG at /home/cel/src/linux/linux-2.6/include/linux/mm.h:445! Aug 18 17:00:37 klimt kernel: RIP: 0010:[<ffffffffa05c21c1>] svc_rdma_sendto+0x641/0x820 [rpcrdma] send_reply() assigns its page argument as the first page of ctxt. On error, send_reply() already invokes svc_rdma_put_context(ctxt, 1); which does a put_page() on that very page. No need to do that again as svc_rdma_sendto exits. Fixes: 3e1eeb980822 ("svcrdma: Close connection when a send error occurs") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23svcrdma: Tail iovec leaves an orphaned DMA mappingChuck Lever
The ctxt's count field is overloaded to mean the number of pages in the ctxt->page array and the number of SGEs in the ctxt->sge array. Typically these two numbers are the same. However, when an inline RPC reply is constructed from an xdr_buf with a tail iovec, the head and tail often occupy the same page, but each are DMA mapped independently. In that case, ->count equals the number of pages, but it does not equal the number of SGEs. There's one more SGE, for the tail iovec. Hence there is one more DMA mapping than there are pages in the ctxt->page array. This isn't a real problem until the server's iommu is enabled. Then each RPC reply that has content in that iovec orphans a DMA mapping that consists of real resources. krb5i and krb5p always populate that tail iovec. After a couple million sent krb5i/p RPC replies, the NFS server starts behaving erratically. Reboot is needed to clear the problem. Fixes: 9d11b51ce7c1 ("svcrdma: Fix send_reply() scatter/gather set-up") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-09-23xprtrdma: use complete() instead complete_all()Daniel Wagner
There is only one waiter for the completion, therefore there is no need to use complete_all(). Let's make that clear by using complete() instead of complete_all(). The usage pattern of the completion is: waiter context waker context frwr_op_unmap_sync() reinit_completion() ib_post_send() wait_for_completion() frwr_wc_localinv_wake() complete() Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: linux-nfs@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-23rxrpc: Need to start the resend timer on initial transmissionDavid Howells
When a DATA packet has its initial transmission, we may need to start or adjust the resend timer. Without this we end up relying on being sent a NACK to initiate the resend. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23rxrpc: Use before_eq() and friends to compare serial numbersDavid Howells
before_eq() and friends should be used to compare serial numbers (when not checking for (non)equality) rather than casting to int, subtracting and checking the result. Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23bpf: add helper to invalidate hashDaniel Borkmann
Add a small helper that complements 36bbef52c7eb ("bpf: direct packet write and access for helpers for clsact progs") for invalidating the current skb->hash after mangling on headers via direct packet write. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23bpf: use bpf_get_smp_processor_id_proto instead of raw oneDaniel Borkmann
Same motivation as in commit 80b48c445797 ("bpf: don't use raw processor id in generic helper"), but this time for XDP typed programs. Thus, allow for preemption checks when we have DEBUG_PREEMPT enabled, and otherwise use the raw variant. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23bpf: use skb_to_full_sk helper in bpf_skb_under_cgroupDaniel Borkmann
We need to use skb_to_full_sk() helper introduced in commit bd5eb35f16a9 ("xfrm: take care of request sockets") as otherwise we miss tcp synack messages, since ownership is on request socket and therefore it would miss the sk_fullsock() check. Use skb_to_full_sk() as also done similarly in the bpf_get_cgroup_classid() helper via 2309236c13fe ("cls_cgroup: get sk_classid only from full sockets") fix to not let this fall through. Fixes: 4a482f34afcc ("cgroup: bpf: Add bpf_skb_in_cgroup_proto") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23net: dsa: add port fast ageingVivien Didelot
Today the DSA drivers are in charge of flushing the MAC addresses associated to a port when its STP state changes from Learning or Forwarding, to Disabled or Blocking or Listening. This makes the drivers more complex and hides the generic switch logic. Introduce a new optional port_fast_age operation to dsa_switch_ops, to move this logic to the DSA layer and keep drivers simple. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23net: dsa: add port STP state helperVivien Didelot
Add a void helper to set the STP state of a port, checking first if the required routine is provided by the driver. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23tcp: fix a compile error in DBGUNDO()Eric Dumazet
If DBGUNDO() is enabled (FASTRETRANS_DEBUG > 1), a compile error will happen, since inet6_sk(sk)->daddr became sk->sk_v6_daddr Fixes: efe4208f47f9 ("ipv6: make lookups simpler and faster") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23rxrpc: Should be using ktime_add_ms() not ktime_add_ns()David Howells
ktime_add_ms() should be used to add the resend time (in ms) rather than ktime_add_ns(). Signed-off-by: David Howells <dhowells@redhat.com>