summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2017-02-21SUNRPC: Add a helper function xdr_stream_decode_string_dup()Trond Myklebust
Create a helper function that decodes a xdr string object, allocates a memory buffer and then store it as a NUL terminated string. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-21sunrpc: silence uninitialized variable warningDan Carpenter
kstrtouint() can return a couple different error codes so the check for "ret == -EINVAL" is wrong and static analysis tools correctly complain that we can use "num" without initializing it. It's not super harmful because we check the bounds. But it's also easy enough to fix. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10sunrpc: Allow xprt->ops->timer method to sleepChuck Lever
The transport lock is needed to protect the xprt_adjust_cwnd() call in xs_udp_timer, but it is not necessary for accessing the rq_reply_bytes_recvd or tk_status fields. It is correct to sublimate the lock into UDP's xs_udp_timer method, where it is required. The ->timer method has to take the transport lock if needed, but it can now sleep safely, or even call back into the RPC scheduler. This is more a clean-up than a fix, but the "issue" was introduced by my transport switch patches back in 2005. Fixes: 46c0ee8bc4ad ("RPC: separate xprt_timer implementations") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Refactor management of mw_list fieldChuck Lever
Clean up some duplicate code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Handle stale connection rejectionChuck Lever
A server rejects a connection attempt with STALE_CONNECTION when a client attempts to connect to a working remote service, but uses a QPN and GUID that corresponds to an old connection that was abandoned. This might occur after a client crashes and restarts. Fix rpcrdma_conn_upcall() to distinguish between a normal rejection and rejection of stale connection parameters. As an additional clean-up, remove the code that retries the connection attempt with different ORD/IRD values. Code audit of other ULP initiators shows no similar special case handling of initiator_depth or responder_resources. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Properly recover FRWRs with in-flight FASTREG WRsChuck Lever
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional double DMA unmap of an FRWR MR when a connection is lost. I see one way this can happen. When a request requires more than one segment or chunk, rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment (MR) in each chunk. Each call posts a FASTREG Work Request to register one MR. Now suppose that the transport connection is lost part-way through marshaling this request. As part of recovering and resetting that req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands all the req's registered FRWRs to the MR recovery thread. But note: FRWR registration is asynchronous. So it's possible that some of these "already registered" FRWRs are fully registered, and some are still waiting for their FASTREG WR to complete. When the connection is lost, the "already registered" frmrs are marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR. But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running. - If the recovery thread runs last, then the frmr is marked FRMR_IS_INVALID, and life continues. - If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR, but the recovery thread has already DMA unmapped that MR. When ->frwr_op_map later re-uses this frmr, it sees it is not marked FRMR_IS_INVALID, and tries to recover it before using it, resulting in a second DMA unmap of the same MR. The fix is to guarantee in-flight FASTREG WRs have flushed before MR recovery runs on those FRWRs. Thus we depend on ro_unmap_safe (called from xprt_rdma_send_request on retransmit, or from xprt_rdma_free) to clean up old registrations as needed. Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Shrink send SGEs arrayChuck Lever
We no longer need to accommodate an xdr_buf whose pages start at an offset and cross extra page boundaries. If there are more partial or whole pages to send than there are available SGEs, the marshaling logic is now smart enough to use a Read chunk instead of failing. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Reduce required number of send SGEsChuck Lever
The MAX_SEND_SGES check introduced in commit 655fec6987be ("xprtrdma: Use gathered Send for large inline messages") fails for devices that have a small max_sge. Instead of checking for a large fixed maximum number of SGEs, check for a minimum small number. RPC-over-RDMA will switch to using a Read chunk if an xdr_buf has more pages than can fit in the device's max_sge limit. This is considerably better than failing all together to mount the server. This fix supports devices that have as few as three send SGEs available. Reported-by: Selvin Xavier <selvin.xavier@broadcom.com> Reported-by: Devesh Sharma <devesh.sharma@broadcom.com> Reported-by: Honggang Li <honli@redhat.com> Reported-by: Ram Amrani <Ram.Amrani@cavium.com> Fixes: 655fec6987be ("xprtrdma: Use gathered Send for large ...") Cc: stable@vger.kernel.org # v4.9+ Tested-by: Honggang Li <honli@redhat.com> Tested-by: Ram Amrani <Ram.Amrani@cavium.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Disable pad optimization by defaultChuck Lever
Commit d5440e27d3e5 ("xprtrdma: Enable pad optimization") made the Linux client omit XDR round-up padding in normal Read and Write chunks so that the client doesn't have to register and invalidate 3-byte memory regions that contain no real data. Unfortunately, my cheery 2014 assessment that this optimization "is supported now by both Linux and Solaris servers" was premature. We've found bugs in Solaris in this area since commit d5440e27d3e5 ("xprtrdma: Enable pad optimization") was merged (SYMLINK is the main offender). So for maximum interoperability, I'm disabling this optimization again. If a CM private message is exchanged when connecting, the client recognizes that the server is Linux, and enables the optimization for that connection. Until now the Solaris server bugs did not impact common operations, and were thus largely benign. Soon, less capable devices on Linux NFS/RDMA clients will make use of Read chunks more often, and these Solaris bugs will prevent interoperation in more cases. Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Per-connection pad optimizationChuck Lever
Pad optimization is changed by echoing into /proc/sys/sunrpc/rdma_pad_optimize. This is a global setting, affecting all RPC-over-RDMA connections to all servers. The marshaling code picks up that value and uses it for decisions about how to construct each RPC-over-RDMA frame. Having it change suddenly in mid-operation can result in unexpected failures. And some servers a client mounts might need chunk round-up, while others don't. So instead, copy the pad_optimize setting into each connection's rpcrdma_ia when the transport is created, and use the copy, which can't change during the life of the connection, instead. This also removes a hack: rpcrdma_convert_iovs was using the remote-invalidation-expected flag to predict when it could leave out Write chunk padding. This is because the Linux server handles implicit XDR padding on Write chunks correctly, and only Linux servers can set the connection's remote-invalidation-expected flag. It's more sensible to use the pad optimization setting instead. Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Fix Read chunk paddingChuck Lever
When pad optimization is disabled, rpcrdma_convert_iovs still does not add explicit XDR round-up padding to a Read chunk. Commit 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling") incorrectly short-circuited the test for whether round-up padding is needed that appears later in rpcrdma_convert_iovs. However, if this is indeed a regular Read chunk (and not a Position-Zero Read chunk), the tail iovec _always_ contains the chunk's padding, and never anything else. So, it's easy to just skip the tail when padding optimization is enabled, and add the tail in a subsequent Read chunk segment, if disabled. Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09NFSv4: Set the connection timeout to match the lease periodTrond Myklebust
Set the timeout for TCP connections to be 1 lease period to ensure that we don't lose our lease due to a faulty TCP connection. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09SUNRPC: Allow changing of the TCP timeout parameters on the flyTrond Myklebust
When the NFSv4 server tells us the lease period, we usually want to adjust down the timeout parameters on the TCP connection to ensure that we don't miss lease renewals due to a faulty connection. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09SUNRPC: Refactor TCP socket timeout code into a helper functionTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09SUNRPC: Remove unused function rpc_get_timeout()Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08sunrpc: use simple_read_from_buffer for reading cache flushKinglong Mee
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08sunrpc: record rpc client pointer in seq->private directlyKinglong Mee
pos in rpc_clnt_iter is useless, drop it and record clnt in seq_private. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08sunrpc: update the comments of sunrpc proc pathKinglong Mee
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08sunrpc: remove dead codes of cr_magic in rpc_credKinglong Mee
Don't found any place using the cr_magic. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08sunrpc: rename NFS_NGROUPS to UNX_NGROUPS for auth unixKinglong Mee
NFS_NGROUPS has been move to sunrpc, rename to UNX_NGROUPS. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08sunrpc/nfs: cleanup procfs/pipefs entry in cache_detailKinglong Mee
Record flush/channel/content entries is useless, remove them. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08sunrpc: error out if register_shrinker failKinglong Mee
register_shrinker may return error when register fail, error out. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-30SUNRPC: two small improvements to rpcauth shrinker.NeilBrown
1/ If we find an entry that is too young to be pruned, return SHRINK_STOP to ensure we don't get called again. This is more correct, and avoids wasting a little CPU time. Prior to 3.12, it can prevent drop_slab() from spinning indefinitely. 2/ Return a precise number from rpcauth_cache_shrink_count(), rather than rounding down to a multiple of 100 (of whatever sysctl_vfs_cache_pressure is). This ensures that when we "echo 3 > /proc/sys/vm/drop_caches", this cache is still purged, even if it has fewer than 100 entires. Neither of these are really important, they just make behaviour more predicatable, which can be helpful when debugging related issues. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-01-28Merge tag 'nfs-for-4.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Stable patches: - NFSv4.1: Fix a deadlock in layoutget - NFSv4 must not bump sequence ids on NFS4ERR_MOVED errors - NFSv4 Fix a regression with OPEN EXCLUSIVE4 mode - Fix a memory leak when removing the SUNRPC module Bugfixes: - Fix a reference leak in _pnfs_return_layout" * tag 'nfs-for-4.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: pNFS: Fix a reference leak in _pnfs_return_layout nfs: Fix "Don't increment lock sequence ID after NFS4ERR_MOVED" SUNRPC: cleanup ida information when removing sunrpc module NFSv4.0: always send mode in SETATTR after EXCLUSIVE4 nfs: Don't increment lock sequence ID after NFS4ERR_MOVED NFSv4.1: Fix a deadlock in layoutget
2017-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) GTP fixes from Andreas Schultz (missing genl module alias, clear IP DF on transmit). 2) Netfilter needs to reflect the fwmark when sending resets, from Pau Espin Pedrol. 3) nftable dump OOPS fix from Liping Zhang. 4) Fix erroneous setting of VIRTIO_NET_HDR_F_DATA_VALID on transmit, from Rolf Neugebauer. 5) Fix build error of ipt_CLUSTERIP when procfs is disabled, from Arnd Bergmann. 6) Fix regression in handling of NETIF_F_SG in harmonize_features(), from Eric Dumazet. 7) Fix RTNL deadlock wrt. lwtunnel module loading, from David Ahern. 8) tcp_fastopen_create_child() needs to setup tp->max_window, from Alexey Kodanev. 9) Missing kmemdup() failure check in ipv6 segment routing code, from Eric Dumazet. 10) Don't execute unix_bind() under the bindlock, otherwise we deadlock with splice. From WANG Cong. 11) ip6_tnl_parse_tlv_enc_lim() potentially reallocates the skb buffer, therefore callers must reload cached header pointers into that skb. Fix from Eric Dumazet. 12) Fix various bugs in legacy IRQ fallback handling in alx driver, from Tobias Regnery. 13) Do not allow lwtunnel drivers to be unloaded while they are referenced by active instances, from Robert Shearman. 14) Fix truncated PHY LED trigger names, from Geert Uytterhoeven. 15) Fix a few regressions from virtio_net XDP support, from John Fastabend and Jakub Kicinski. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (102 commits) ISDN: eicon: silence misleading array-bounds warning net: phy: micrel: add support for KSZ8795 gtp: fix cross netns recv on gtp socket gtp: clear DF bit on GTP packet tx gtp: add genl family modules alias tcp: don't annotate mark on control socket from tcp_v6_send_response() ravb: unmap descriptors when freeing rings virtio_net: reject XDP programs using header adjustment virtio_net: use dev_kfree_skb for small buffer XDP receive r8152: check rx after napi is enabled r8152: re-schedule napi for tx r8152: avoid start_xmit to schedule napi when napi is disabled r8152: avoid start_xmit to call napi_schedule during autosuspend net: dsa: Bring back device detaching in dsa_slave_suspend() net: phy: leds: Fix truncated LED trigger names net: phy: leds: Break dependency of phy.h on phy_led_triggers.h net: phy: leds: Clear phy_num_led_triggers on failure to avoid crash net-next: ethernet: mediatek: change the compatible string Documentation: devicetree: change the mediatek ethernet compatible string bnxt_en: Fix RTNL lock usage on bnxt_get_port_module_status(). ...
2017-01-27tcp: don't annotate mark on control socket from tcp_v6_send_response()Pablo Neira
Unlike ipv4, this control socket is shared by all cpus so we cannot use it as scratchpad area to annotate the mark that we pass to ip6_xmit(). Add a new parameter to ip6_xmit() to indicate the mark. The SCTP socket family caches the flowi6 structure in the sctp_transport structure, so we cannot use to carry the mark unless we later on reset it back, which I discarded since it looks ugly to me. Fixes: bf99b4ded5f8 ("tcp: fix mark propagation with fwmark_reflect enabled") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains a large batch with Netfilter fixes for your net tree, they are: 1) Two patches to solve conntrack garbage collector cpu hogging, one to remove GC_MAX_EVICTS and another to look at the ratio (scanned entries vs. evicted entries) to make a decision on whether to reduce or not the scanning interval. From Florian Westphal. 2) Two patches to fix incorrect set element counting if NLM_F_EXCL is is not set. Moreover, don't decrenent set->nelems from abort patch if -ENFILE which leaks a spare slot in the set. This includes a patch to deconstify the set walk callback to update set->ndeact. 3) Two fixes for the fwmark_reflect sysctl feature: Propagate mark to reply packets both from nf_reject and local stack, from Pau Espin Pedrol. 4) Fix incorrect handling of loopback traffic in rpfilter and nf_tables fib expression, from Liping Zhang. 5) Fix oops on stateful objects netlink dump, when no filter is specified. Also from Liping Zhang. 6) Fix a build error if proc is not available in ipt_CLUSTERIP, related to fix that was applied in the previous batch for net. From Arnd Bergmann. 7) Fix lack of string validation in table, chain, set and stateful object names in nf_tables, from Liping Zhang. Moreover, restrict maximum log prefix length to 127 bytes, otherwise explicitly bail out. 8) Two patches to fix spelling and typos in nf_tables uapi header file and Kconfig, patches from Alexander Alemayhu and William Breathitt Gray. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25Merge tag 'batadv-net-for-davem-20170125' of git://git.open-mesh.org/linux-mergeDavid S. Miller
Simon Wunderlich says: ==================== Here is a batman-adv bugfix: - fix reference count handling on fragmentation error, by Sven Eckelmann ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25net: dsa: Bring back device detaching in dsa_slave_suspend()Florian Fainelli
Commit 448b4482c671 ("net: dsa: Add lockdep class to tx queues to avoid lockdep splat") removed the netif_device_detach() call done in dsa_slave_suspend() which is necessary, and paired with a corresponding netif_device_attach(), bring it back. Fixes: 448b4482c671 ("net: dsa: Add lockdep class to tx queues to avoid lockdep splat") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25tcp: correct memory barrier usage in tcp_check_space()Jason Baron
sock_reset_flag() maps to __clear_bit() not the atomic version clear_bit(). Thus, we need smp_mb(), smp_mb__after_atomic() is not sufficient. Fixes: 3c7151275c0c ("tcp: add memory barriers to write space paths") Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25sctp: sctp gso should set feature with NETIF_F_SG when calling skb_segmentXin Long
Now sctp gso puts segments into skb's frag_list, then processes these segments in skb_segment. But skb_segment handles them only when gs is enabled, as it's in the same branch with skb's frags. Although almost all the NICs support sg other than some old ones, but since commit 1e16aa3ddf86 ("net: gso: use feature flag argument in all protocol gso handlers"), features &= skb->dev->hw_enc_features, and xfrm_output_gso call skb_segment with features = 0, which means sctp gso would call skb_segment with sg = 0, and skb_segment would not work as expected. This patch is to fix it by setting features param with NETIF_F_SG when calling skb_segment so that it can go the right branch to process the skb's frag_list. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25sctp: sctp_addr_id2transport should verify the addr before looking up assocXin Long
sctp_addr_id2transport is a function for sockopt to look up assoc by address. As the address is from userspace, it can be a v4-mapped v6 address. But in sctp protocol stack, it always handles a v4-mapped v6 address as a v4 address. So it's necessary to convert it to a v4 address before looking up assoc by address. This patch is to fix it by calling sctp_verify_addr in which it can do this conversion before calling sctp_endpoint_lookup_assoc, just like what sctp_sendmsg and __sctp_connect do for the address from users. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24lwtunnel: Fix oops on state free after encap module unloadRobert Shearman
When attempting to free lwtunnel state after the module for the encap has been unloaded an oops occurs: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: lwtstate_free+0x18/0x40 [..] task: ffff88003e372380 task.stack: ffffc900001fc000 RIP: 0010:lwtstate_free+0x18/0x40 RSP: 0018:ffff88003fd83e88 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88002bbb3380 RCX: ffff88000c91a300 [..] Call Trace: <IRQ> free_fib_info_rcu+0x195/0x1a0 ? rt_fibinfo_free+0x50/0x50 rcu_process_callbacks+0x2d3/0x850 ? rcu_process_callbacks+0x296/0x850 __do_softirq+0xe4/0x4cb irq_exit+0xb0/0xc0 smp_apic_timer_interrupt+0x3d/0x50 apic_timer_interrupt+0x93/0xa0 [..] Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8 The problem is after the module for the encap can be unloaded the corresponding ops is removed and is thus NULL here. Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so grab the module reference for the ops on creating lwtunnel state and of course release the reference when freeing the state. Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation") Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: Specify the owning module for lwtunnel opsRobert Shearman
Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so specify the owning module for all lwtunnel ops. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: fix cleanup at module unloadParthasarathy Bhuvaragan
In tipc_server_stop(), we iterate over the connections with limiting factor as server's idr_in_use. We ignore the fact that this variable is decremented in tipc_close_conn(), leading to premature exit. In this commit, we iterate until the we have no connections left. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Tested-by: John Thompson <thompa.atl@gmail.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: ignore requests when the connection state is not CONNECTEDParthasarathy Bhuvaragan
In tipc_conn_sendmsg(), we first queue the request to the outqueue followed by the connection state check. If the connection is not connected, we should not queue this message. In this commit, we reject the messages if the connection state is not CF_CONNECTED. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Tested-by: John Thompson <thompa.atl@gmail.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: fix nametbl_lock soft lockup at module exitParthasarathy Bhuvaragan
Commit 333f796235a527 ("tipc: fix a race condition leading to subscriber refcnt bug") reveals a soft lockup while acquiring nametbl_lock. Before commit 333f796235a527, we call tipc_conn_shutdown() from tipc_close_conn() in the context of tipc_topsrv_stop(). In that context, we are allowed to grab the nametbl_lock. Commit 333f796235a527, moved tipc_conn_release (renamed from tipc_conn_shutdown) to the connection refcount cleanup. This allows either tipc_nametbl_withdraw() or tipc_topsrv_stop() to the cleanup. Since tipc_exit_net() first calls tipc_topsrv_stop() and then tipc_nametble_withdraw() increases the chances for the later to perform the connection cleanup. The soft lockup occurs in the call chain of tipc_nametbl_withdraw(), when it performs the tipc_conn_kref_release() as it tries to grab nametbl_lock again while holding it already. tipc_nametbl_withdraw() grabs nametbl_lock tipc_nametbl_remove_publ() tipc_subscrp_report_overlap() tipc_subscrp_send_event() tipc_conn_sendmsg() << if (con->flags != CF_CONNECTED) we do conn_put(), triggering the cleanup as refcount=0. >> tipc_conn_kref_release tipc_sock_release tipc_conn_release tipc_subscrb_delete tipc_subscrp_delete tipc_nametbl_unsubscribe << Soft Lockup >> The previous changes in this series fixes the race conditions fixed by commit 333f796235a527. Hence we can now revert the commit. Fixes: 333f796235a52727 ("tipc: fix a race condition leading to subscriber refcnt bug") Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: fix connection refcount errorParthasarathy Bhuvaragan
Until now, the generic server framework maintains the connection id's per subscriber in server's conn_idr. At tipc_close_conn, we remove the connection id from the server list, but the connection is valid until we call the refcount cleanup. Hence we have a window where the server allocates the same connection to an new subscriber leading to inconsistent reference count. We have another refcount warning we grab the refcount in tipc_conn_lookup() for connections with flag with CF_CONNECTED not set. This usually occurs at shutdown when the we stop the topology server and withdraw TIPC_CFG_SRV publication thereby triggering a withdraw message to subscribers. In this commit, we: 1. remove the connection from the server list at recount cleanup. 2. grab the refcount for a connection only if CF_CONNECTED is set. Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: add subscription refcount to avoid invalid deleteParthasarathy Bhuvaragan
Until now, the subscribers keep track of the subscriptions using reference count at subscriber level. At subscription cancel or subscriber delete, we delete the subscription only if the timer was pending for the subscription. This approach is incorrect as: 1. del_timer() is not SMP safe, if on CPU0 the check for pending timer returns true but CPU1 might schedule the timer callback thereby deleting the subscription. Thus when CPU0 is scheduled, it deletes an invalid subscription. 2. We export tipc_subscrp_report_overlap(), which accesses the subscription pointer multiple times. Meanwhile the subscription timer can expire thereby freeing the subscription and we might continue to access the subscription pointer leading to memory violations. In this commit, we introduce subscription refcount to avoid deleting an invalid subscription. Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24tipc: fix nametbl_lock soft lockup at node/link eventsParthasarathy Bhuvaragan
We trigger a soft lockup as we grab nametbl_lock twice if the node has a pending node up/down or link up/down event while: - we process an incoming named message in tipc_named_rcv() and perform an tipc_update_nametbl(). - we have pending backlog items in the name distributor queue during a nametable update using tipc_nametbl_publish() or tipc_nametbl_withdraw(). The following are the call chain associated: tipc_named_rcv() Grabs nametbl_lock tipc_update_nametbl() (publish/withdraw) tipc_node_subscribe()/unsubscribe() tipc_node_write_unlock() << lockup occurs if an outstanding node/link event exits, as we grabs nametbl_lock again >> tipc_nametbl_withdraw() Grab nametbl_lock tipc_named_process_backlog() tipc_update_nametbl() << rest as above >> The function tipc_node_write_unlock(), in addition to releasing the lock processes the outstanding node/link up/down events. To do this, we need to grab the nametbl_lock again leading to the lockup. In this commit we fix the soft lockup by introducing a fast variant of node_unlock(), where we just release the lock. We adapt the node_subscribe()/node_unsubscribe() to use the fast variants. Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24netfilter: nf_tables: bump set->ndeact on set flushPablo Neira Ayuso
Add missing set->ndeact update on each deactivated element from the set flush path. Otherwise, sets with fixed size break after flush since accounting breaks. # nft add set x y { type ipv4_addr\; size 2\; } # nft add element x y { 1.1.1.1 } # nft add element x y { 1.1.1.2 } # nft flush set x y # nft add element x y { 1.1.1.1 } <cmdline>:1:1-28: Error: Could not process rule: Too many open files in system Fixes: 8411b6442e59 ("netfilter: nf_tables: support for set flushing") Reported-by: Elise Lennion <elise.lennion@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24netfilter: nf_tables: deconstify walk callback functionPablo Neira Ayuso
The flush operation needs to modify set and element objects, so let's deconstify this. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24netfilter: nf_tables: fix set->nelems counting with no NLM_F_EXCLPablo Neira Ayuso
If the element exists and no NLM_F_EXCL is specified, do not bump set->nelems, otherwise we leak one set element slot. This problem amplifies if the set is full since the abort path always decrements the counter for the -ENFILE case too, giving one spare extra slot. Fix this by moving set->nelems update to nft_add_set_elem() after successful element insertion. Moreover, remove the element if the set is full so there is no need to rely on the abort path to undo things anymore. Fixes: c016c7e45ddf ("netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24netfilter: nft_log: restrict the log prefix length to 127Liping Zhang
First, log prefix will be truncated to NF_LOG_PREFIXLEN-1, i.e. 127, at nf_log_packet(), so the extra part is useless. Second, after adding a log rule with a very very long prefix, we will fail to dump the nft rules after this _special_ one, but acctually, they do exist. For example: # name_65000=$(printf "%0.sQ" {1..65000}) # nft add rule filter output log prefix "$name_65000" # nft add rule filter output counter # nft add rule filter output counter # nft list chain filter output table ip filter { chain output { type filter hook output priority 0; policy accept; } } So now, restrict the log prefix length to NF_LOG_PREFIXLEN-1. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-01-24SUNRPC: cleanup ida information when removing sunrpc moduleKinglong Mee
After removing sunrpc module, I get many kmemleak information as, unreferenced object 0xffff88003316b1e0 (size 544): comm "gssproxy", pid 2148, jiffies 4294794465 (age 4200.081s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffffb0cfb58a>] kmemleak_alloc+0x4a/0xa0 [<ffffffffb03507fe>] kmem_cache_alloc+0x15e/0x1f0 [<ffffffffb0639baa>] ida_pre_get+0xaa/0x150 [<ffffffffb0639cfd>] ida_simple_get+0xad/0x180 [<ffffffffc06054fb>] nlmsvc_lookup_host+0x4ab/0x7f0 [lockd] [<ffffffffc0605e1d>] lockd+0x4d/0x270 [lockd] [<ffffffffc06061e5>] param_set_timeout+0x55/0x100 [lockd] [<ffffffffc06cba24>] svc_defer+0x114/0x3f0 [sunrpc] [<ffffffffc06cbbe7>] svc_defer+0x2d7/0x3f0 [sunrpc] [<ffffffffc06c71da>] rpc_show_info+0x8a/0x110 [sunrpc] [<ffffffffb044a33f>] proc_reg_write+0x7f/0xc0 [<ffffffffb038e41f>] __vfs_write+0xdf/0x3c0 [<ffffffffb0390f1f>] vfs_write+0xef/0x240 [<ffffffffb0392fbd>] SyS_write+0xad/0x130 [<ffffffffb0d06c37>] entry_SYSCALL_64_fastpath+0x1a/0xa9 [<ffffffffffffffff>] 0xffffffffffffffff I found, the ida information (dynamic memory) isn't cleanup. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Fixes: 2f048db4680a ("SUNRPC: Add an identifier for struct rpc_clnt") Cc: stable@vger.kernel.org # v3.12+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-24ipv6: fix ip6_tnl_parse_tlv_enc_lim()Eric Dumazet
This function suffers from multiple issues. First one is that pskb_may_pull() may reallocate skb->head, so the 'raw' pointer needs either to be reloaded or not used at all. Second issue is that NEXTHDR_DEST handling does not validate that the options are present in skb->data, so we might read garbage or access non existent memory. With help from Willem de Bruijn. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24ip6_tunnel: must reload ipv6h in ip6ip6_tnl_xmit()Eric Dumazet
Since ip6_tnl_parse_tlv_enc_lim() can call pskb_may_pull(), we must reload any pointer that was related to skb->head (or skb->data), or risk use after free. Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dmitry Kozlov <xeb@mail.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24af_unix: move unix_mknod() out of bindlockWANG Cong
Dmitry reported a deadlock scenario: unix_bind() path: u->bindlock ==> sb_writer do_splice() path: sb_writer ==> pipe->mutex ==> u->bindlock In the unix_bind() code path, unix_mknod() does not have to be done with u->bindlock held, since it is a pure fs operation, so we can just move unix_mknod() out. Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24mac80211: don't try to sleep in rate_control_rate_init()Johannes Berg
In my previous patch, I missed that rate_control_rate_init() is called from some places that cannot sleep, so it cannot call ieee80211_recalc_min_chandef(). Remove that call for now to fix the context bug, we'll have to find a different way to fix the minimum channel width issue. Fixes: 96aa2e7cf126 ("mac80211: calculate min channel width correctly") Reported-by: Xiaolong Ye (via lkp-robot) <xiaolong.ye@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-01-23netfilter: nf_tables: validate the name size when possibleLiping Zhang
Currently, if the user add a stateful object with the name size exceed NFT_OBJ_MAXNAMELEN - 1 (i.e. 31), we truncate it down to 31 silently. This is not friendly, furthermore, this will cause duplicated stateful objects when the first 31 characters of the name is same. So limit the stateful object's name size to NFT_OBJ_MAXNAMELEN - 1. After apply this patch, error message will be printed out like this: # name_32=$(printf "%0.sQ" {1..32}) # nft add counter filter $name_32 <cmdline>:1:1-52: Error: Could not process rule: Numerical result out of range add counter filter QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Also this patch cleans up the codes which missing the name size limit validation in nftables. Fixes: e50092404c1b ("netfilter: nf_tables: add stateful objects") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>