summaryrefslogtreecommitdiff
path: root/net/unix/af_unix.c
AgeCommit message (Collapse)Author
2017-05-20af_unix: Guard against other == sk in unix_dgram_sendmsgRainer Weikusat
commit a5527dda344fff0514b7989ef7a755729769daa1 upstream. The unix_dgram_sendmsg routine use the following test if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) { to determine if sk and other are in an n:1 association (either established via connect or by using sendto to send messages to an unrelated socket identified by address). This isn't correct as the specified address could have been bound to the sending socket itself or because this socket could have been connected to itself by the time of the unix_peer_get but disconnected before the unix_state_lock(other). In both cases, the if-block would be entered despite other == sk which might either block the sender unintentionally or lead to trying to unlock the same spin lock twice for a non-blocking send. Add a other != sk check to guard against this. Fixes: 7d267278a9ec ("unix: avoid use-after-free in ep_remove_wait_queue") Reported-By: Philipp Hahn <pmhahn@pmhahn.de> Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Tested-by: Philipp Hahn <pmhahn@pmhahn.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-18Revert "af_unix: Fix splice-bind deadlock"Linus Torvalds
commit 38f7bd94a97b542de86a2be9229289717e33a7a4 upstream. This reverts commit c845acb324aa85a39650a14e7696982ceea75dc1. It turns out that it just replaces one deadlock with another one: we can still get the wrong lock ordering with the readlock due to overlayfs calling back into the filesystem layer and still taking the vfs locks after the readlock. The proper solution ends up being to just split the readlock into two pieces: the bind lock (taken *outside* the vfs locks) and the IO lock (taken *inside* the filesystem locks). The two locks are independent anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-07-11af_unix: Fix splice-bind deadlockRainer Weikusat
[ Upstream commit c845acb324aa85a39650a14e7696982ceea75dc1 ] On 2015/11/06, Dmitry Vyukov reported a deadlock involving the splice system call and AF_UNIX sockets, http://lists.openwall.net/netdev/2015/11/06/24 The situation was analyzed as (a while ago) A: socketpair() B: splice() from a pipe to /mnt/regular_file does sb_start_write() on /mnt C: try to freeze /mnt wait for B to finish with /mnt A: bind() try to bind our socket to /mnt/new_socket_name lock our socket, see it not bound yet decide that it needs to create something in /mnt try to do sb_start_write() on /mnt, block (it's waiting for C). D: splice() from the same pipe to our socket lock the pipe, see that socket is connected try to lock the socket, block waiting for A B: get around to actually feeding a chunk from pipe to file, try to lock the pipe. Deadlock. on 2015/11/10 by Al Viro, http://lists.openwall.net/netdev/2015/11/10/4 The patch fixes this by removing the kern_path_create related code from unix_mknod and executing it as part of unix_bind prior acquiring the readlock of the socket in question. This means that A (as used above) will sb_start_write on /mnt before it acquires the readlock, hence, it won't indirectly block B which first did a sb_start_write and then waited for a thread trying to acquire the readlock. Consequently, A being blocked by C waiting for B won't cause a deadlock anymore (effectively, both A and B acquire two locks in opposite order in the situation described above). Dmitry Vyukov(<dvyukov@google.com>) tested the original patch. Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-07-11VFS: AF_UNIX sockets should call mknod on the top layer onlyDavid Howells
[ Upstream commit ee8ac4d61c2cf43bdd427e70db97ac330e61570d ] AF_UNIX sockets should call mknod on the top layer only and should not attempt to modify the lower layer in a layered filesystem such as overlayfs. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-07-11VFS: net/unix: d_backing_inode() annotationsDavid Howells
[ Upstream commit a25b376bded1ba7fd1d455e140d723b7de2e343c ] places where we are dealing with S_ISSOCK file creation/lookups. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-02-15unix: properly account for FDs passed over unix socketswilly tarreau
[ Upstream commit 712f4aad406bb1ed67f3f98d04c044191f0ff593 ] It is possible for a process to allocate and accumulate far more FDs than the process' limit by sending them over a unix socket then closing them to keep the process' fd count low. This change addresses this problem by keeping track of the number of FDs in flight per user and preventing non-privileged processes from having more FDs in flight than their configured FD limit. Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-01-15unix: avoid use-after-free in ep_remove_wait_queueRainer Weikusat
[ Upstream commit 7d267278a9ece963d77eefec61630223fce08c6c ] Rainer Weikusat <rweikusat@mobileactivedefense.com> writes: An AF_UNIX datagram socket being the client in an n:1 association with some server socket is only allowed to send messages to the server if the receive queue of this socket contains at most sk_max_ack_backlog datagrams. This implies that prospective writers might be forced to go to sleep despite none of the message presently enqueued on the server receive queue were sent by them. In order to ensure that these will be woken up once space becomes again available, the present unix_dgram_poll routine does a second sock_poll_wait call with the peer_wait wait queue of the server socket as queue argument (unix_dgram_recvmsg does a wake up on this queue after a datagram was received). This is inherently problematic because the server socket is only guaranteed to remain alive for as long as the client still holds a reference to it. In case the connection is dissolved via connect or by the dead peer detection logic in unix_dgram_sendmsg, the server socket may be freed despite "the polling mechanism" (in particular, epoll) still has a pointer to the corresponding peer_wait queue. There's no way to forcibly deregister a wait queue with epoll. Based on an idea by Jason Baron, the patch below changes the code such that a wait_queue_t belonging to the client socket is enqueued on the peer_wait queue of the server whenever the peer receive queue full condition is detected by either a sendmsg or a poll. A wake up on the peer queue is then relayed to the ordinary wait queue of the client socket via wake function. The connection to the peer wait queue is again dissolved if either a wake up is about to be relayed or the client socket reconnects or a dead peer is detected or the client socket is itself closed. This enables removing the second sock_poll_wait from unix_dgram_poll, thus avoiding the use-after-free, while still ensuring that no blocked writer sleeps forever. Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets") Reviewed-by: Jason Baron <jbaron@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2016-01-15af_unix: Revert 'lock_interruptible' in stream receive codeRainer Weikusat
[ Upstream commit 3822b5c2fc62e3de8a0f33806ff279fb7df92432 ] With b3ca9b02b00704053a38bfe4c31dbbb9c13595d0, the AF_UNIX SOCK_STREAM receive code was changed from using mutex_lock(&u->readlock) to mutex_lock_interruptible(&u->readlock) to prevent signals from being delayed for an indefinite time if a thread sleeping on the mutex happened to be selected for handling the signal. But this was never a problem with the stream receive code (as opposed to its datagram counterpart) as that never went to sleep waiting for new messages with the mutex held and thus, wouldn't cause secondary readers to block on the mutex waiting for the sleeping primary reader. As the interruptible locking makes the code more complicated in exchange for no benefit, change it back to using mutex_lock. Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-11-13net/unix: fix logic about sk_peek_offsetAndrey Vagin
[ Upstream commit e9193d60d363e4dff75ff6d43a48f22be26d59c7 ] Now send with MSG_PEEK can return data from multiple SKBs. Unfortunately we take into account the peek offset for each skb, that is wrong. We need to apply the peek offset only once. In addition, the peek offset should be used only if MSG_PEEK is set. Cc: "David S. Miller" <davem@davemloft.net> (maintainer:NETWORKING Cc: Eric Dumazet <edumazet@google.com> (commit_signer:1/14=7%) Cc: Aaron Conole <aconole@bytheb.org> Fixes: 9f389e35674f ("af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag") Signed-off-by: Andrey Vagin <avagin@openvz.org> Tested-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-11-13af_unix: return data from multiple SKBs on recv() with MSG_PEEK flagAaron Conole
[ Upstream commit 9f389e35674f5b086edd70ed524ca0f287259725 ] AF_UNIX sockets now return multiple skbs from recv() when MSG_PEEK flag is set. This is referenced in kernel bugzilla #12323 @ https://bugzilla.kernel.org/show_bug.cgi?id=12323 As described both in the BZ and lkml thread @ http://lkml.org/lkml/2008/1/8/444 calling recv() with MSG_PEEK on an AF_UNIX socket only reads a single skb, where the desired effect is to return as much skb data has been queued, until hitting the recv buffer size (whichever comes first). The modified MSG_PEEK path will now move to the next skb in the tree and jump to the again: label, rather than following the natural loop structure. This requires duplicating some of the loop head actions. This was tested using the python socketpair python code attached to the bugzilla issue. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-06-15unix/caif: sk_socket can disappear when state is unlockedMark Salyzyn
[ Upstream commit b48732e4a48d80ed4a14812f0bab09560846514e ] got a rare NULL pointer dereference in clear_bit Signed-off-by: Mark Salyzyn <salyzyn@android.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> ---- v2: switch to sock_flag(sk, SOCK_DEAD) and added net/caif/caif_socket.c v3: return -ECONNRESET in upstream caller of wait function for SOCK_DEAD Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2014-06-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov. 2) Multiqueue support in xen-netback and xen-netfront, from Andrew J Benniston. 3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn Mork. 4) BPF now has a "random" opcode, from Chema Gonzalez. 5) Add more BPF documentation and improve test framework, from Daniel Borkmann. 6) Support TCP fastopen over ipv6, from Daniel Lee. 7) Add software TSO helper functions and use them to support software TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia. 8) Support software TSO in fec driver too, from Nimrod Andy. 9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli. 10) Handle broadcasts more gracefully over macvlan when there are large numbers of interfaces configured, from Herbert Xu. 11) Allow more control over fwmark used for non-socket based responses, from Lorenzo Colitti. 12) Do TCP congestion window limiting based upon measurements, from Neal Cardwell. 13) Support busy polling in SCTP, from Neal Horman. 14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru. 15) Bridge promisc mode handling improvements from Vlad Yasevich. 16) Don't use inetpeer entries to implement ID generation any more, it performs poorly, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits) rtnetlink: fix userspace API breakage for iproute2 < v3.9.0 tcp: fixing TLP's FIN recovery net: fec: Add software TSO support net: fec: Add Scatter/gather support net: fec: Increase buffer descriptor entry number net: fec: Factorize feature setting net: fec: Enable IP header hardware checksum net: fec: Factorize the .xmit transmit function bridge: fix compile error when compiling without IPv6 support bridge: fix smatch warning / potential null pointer dereference via-rhine: fix full-duplex with autoneg disable bnx2x: Enlarge the dorq threshold for VFs bnx2x: Check for UNDI in uncommon branch bnx2x: Fix 1G-baseT link bnx2x: Fix link for KR with swapped polarity lane sctp: Fix sk_ack_backlog wrap-around problem net/core: Add VF link state control policy net/fsl: xgmac_mdio is dependent on OF_MDIO net/fsl: Make xgmac_mdio read error message useful net_sched: drr: warn when qdisc is not work conserving ...
2014-05-16net: unix: Align send data_len up to PAGE_SIZEKirill Tkhai
Using whole of allocated pages reduces requested skb->data size. This is just a little more thriftily allocation. netperf does not show difference with the current performance. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-18arch: Mass conversion of smp_mb__*()Peter Zijlstra
Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-11net: Fix use after free by removing length arg from sk_data_ready callbacks.David S. Miller
Several spots in the kernel perform a sequence like: skb_queue_tail(&sk->s_receive_queue, skb); sk->sk_data_ready(sk, skb->len); But at the moment we place the SKB onto the socket receive queue it can be consumed and freed up. So this skb->len access is potentially to freed up memory. Furthermore, the skb->len can be modified by the consumer so it is possible that the value isn't accurate. And finally, no actual implementation of this callback actually uses the length argument. And since nobody actually cared about it's value, lots of call sites pass arbitrary values in such as '0' and even '1'. So just remove the length argument from the callback, that way there is no confusion whatsoever and all of these use-after-free cases get fixed as a side effect. Based upon a patch by Eric Dumazet and his suggestion to audit this issue tree-wide. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-26net: unix: non blocking recvmsg() should not return -EINTREric Dumazet
Some applications didn't expect recvmsg() on a non blocking socket could return -EINTR. This possibility was added as a side effect of commit b3ca9b02b00704 ("net: fix multithreaded signal handling in unix recv routines"). To hit this bug, you need to be a bit unlucky, as the u->readlock mutex is usually held for very small periods. Fixes: b3ca9b02b00704 ("net: fix multithreaded signal handling in unix recv routines") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06net: unix socket code abuses csum_partialAnton Blanchard
The unix socket code is using the result of csum_partial to hash into a lookup table: unix_hash_fold(csum_partial(sunaddr, len, 0)); csum_partial is only guaranteed to produce something that can be folded into a checksum, as its prototype explains: * returns a 32-bit number suitable for feeding into itself * or csum_tcpudp_magic The 32bit value should not be used directly. Depending on the alignment, the ppc64 csum_partial will return different 32bit partial checksums that will fold into the same 16bit checksum. This difference causes the following testcase (courtesy of Gustavo) to sometimes fail: #include <sys/socket.h> #include <stdio.h> int main() { int fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0); int i = 1; setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &i, 4); struct sockaddr addr; addr.sa_family = AF_LOCAL; bind(fd, &addr, 2); listen(fd, 128); struct sockaddr_storage ss; socklen_t sslen = (socklen_t)sizeof(ss); getsockname(fd, (struct sockaddr*)&ss, &sslen); fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0); if (connect(fd, (struct sockaddr*)&ss, sslen) == -1){ perror(NULL); return 1; } printf("OK\n"); return 0; } As suggested by davem, fix this by using csum_fold to fold the partial 32bit checksum into a 16bit checksum before using it. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-18net: add build-time checks for msg->msg_name sizeSteffen Hurrle
This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg handler msg_name and msg_namelen logic"). DECLARE_SOCKADDR validates that the structure we use for writing the name information to is not larger than the buffer which is reserved for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR consistently in sendmsg code paths. Signed-off-by: Steffen Hurrle <steffen@hurrle.net> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/intel/i40e/i40e_main.c drivers/net/macvtap.c Both minor merge hassles, simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17net: unix: allow bind to fail on mutex lockSasha Levin
This is similar to the set_peek_off patch where calling bind while the socket is stuck in unix_dgram_recvmsg() will block and cause a hung task spew after a while. This is also the last place that did a straightforward mutex_lock(), so there shouldn't be any more of these patches. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10net: unix: allow set_peek_off to failSasha Levin
unix_dgram_recvmsg() will hold the readlock of the socket until recv is complete. In the same time, we may try to setsockopt(SO_PEEK_OFF) which will hang until unix_dgram_recvmsg() will complete (which can take a while) without allowing us to break out of it, triggering a hung task spew. Instead, allow set_peek_off to fail, this way userspace will not hang. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06unix: convert printks to pr_<level>wangweidong
use pr_<level> instead of printk(LEVEL) Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-20net: rework recvmsg handler msg_name and msg_namelen logicHannes Frederic Sowa
This patch now always passes msg->msg_namelen as 0. recvmsg handlers must set msg_namelen to the proper size <= sizeof(struct sockaddr_storage) to return msg_name to the user. This prevents numerous uninitialized memory leaks we had in the recvmsg handlers and makes it harder for new code to accidentally leak uninitialized memory. Optimize for the case recvfrom is called with NULL as address. We don't need to copy the address at all, so set it to NULL before invoking the recvmsg handler. We can do so, because all the recvmsg handlers must cope with the case a plain read() is called on them. read() also sets msg_name to NULL. Also document these changes in include/linux/net.h as suggested by David Miller. Changes since RFC: Set msg->msg_name = NULL if user specified a NULL in msg_name but had a non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't affect sendto as it would bail out earlier while trying to copy-in the address. It also more naturally reflects the logic by the callers of verify_iovec. With this change in place I could remove " if (!uaddr || msg_sys->msg_namelen == 0) msg->msg_name = NULL ". This change does not alter the user visible error logic as we ignore msg_namelen as long as msg_name is NULL. Also remove two unnecessary curly brackets in ___sys_recvmsg and change comments to netdev style. Cc: David Miller <davem@davemloft.net> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-19net: unix: inherit SOCK_PASS{CRED, SEC} flags from socket to fix raceDaniel Borkmann
In the case of credentials passing in unix stream sockets (dgram sockets seem not affected), we get a rather sparse race after commit 16e5726 ("af_unix: dont send SCM_CREDENTIALS by default"). We have a stream server on receiver side that requests credential passing from senders (e.g. nc -U). Since we need to set SO_PASSCRED on each spawned/accepted socket on server side to 1 first (as it's not inherited), it can happen that in the time between accept() and setsockopt() we get interrupted, the sender is being scheduled and continues with passing data to our receiver. At that time SO_PASSCRED is neither set on sender nor receiver side, hence in cmsg's SCM_CREDENTIALS we get eventually pid:0, uid:65534, gid:65534 (== overflow{u,g}id) instead of what we actually would like to see. On the sender side, here nc -U, the tests in maybe_add_creds() invoked through unix_stream_sendmsg() would fail, as at that exact time, as mentioned, the sender has neither SO_PASSCRED on his side nor sees it on the server side, and we have a valid 'other' socket in place. Thus, sender believes it would just look like a normal connection, not needing/requesting SO_PASSCRED at that time. As reverting 16e5726 would not be an option due to the significant performance regression reported when having creds always passed, one way/trade-off to prevent that would be to set SO_PASSCRED on the listener socket and allow inheriting these flags to the spawned socket on server side in accept(). It seems also logical to do so if we'd tell the listener socket to pass those flags onwards, and would fix the race. Before, strace: recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"blub\n", 4096}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=0, uid=65534, gid=65534}}, msg_flags=0}, 0) = 5 After, strace: recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"blub\n", 4096}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=11580, uid=1000, gid=1000}}, msg_flags=0}, 0) = 5 Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-11af_unix: fix bug on large send()Eric Dumazet
commit e370a723632 ("af_unix: improve STREAM behavior with fragmented memory") added a bug on large send() because the skb_copy_datagram_from_iovec() call always start from the beginning of iovec. We must instead use the @sent variable to properly skip the already processed part. Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-10net: attempt high order allocations in sock_alloc_send_pskb()Eric Dumazet
Adding paged frags skbs to af_unix sockets introduced a performance regression on large sends because of additional page allocations, even if each skb could carry at least 100% more payload than before. We can instruct sock_alloc_send_pskb() to attempt high order allocations. Most of the time, it does a single page allocation instead of 8. I added an additional parameter to sock_alloc_send_pskb() to let other users to opt-in for this new feature on followup patches. Tested: Before patch : $ netperf -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 2304 212992 212992 10.00 46861.15 After patch : $ netperf -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 2304 212992 212992 10.00 57981.11 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-10af_unix: improve STREAM behavior with fragmented memoryEric Dumazet
unix_stream_sendmsg() currently uses order-2 allocations, and we had numerous reports this can fail. The __GFP_REPEAT flag present in sock_alloc_send_pskb() is not helping. This patch extends the work done in commit eb6a24816b247c ("af_unix: reduce high order page allocations) for datagram sockets. This opens the possibility of zero copy IO (splice() and friends) The trick is to not use skb_pull() anymore in recvmsg() path, and instead add a @consumed field in UNIXCB() to track amount of already read payload in the skb. There is a performance regression for large sends because of extra page allocations that will be addressed in a follow-up patch, allowing sock_alloc_send_pskb() to attempt high order page allocations. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-12af_unix: use freezable blocking calls in readColin Cross
Avoid waking up every thread sleeping in read call on an AF_UNIX socket during suspend and resume by calling a freezable blocking call. Previous patches modified the freezer to avoid sending wakeups to threads that are blocked in freezable blocking calls. This call was selected to be converted to a freezable call because it doesn't hold any locks or release any resources when interrupted that might be needed by another freezing task or a kernel driver during suspend, and is a common site where idle userspace tasks are blocked. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-04-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c drivers/net/ethernet/emulex/benet/be.h include/net/tcp.h net/mac802154/mac802154.h Most conflicts were minor overlapping stuff. The be2net driver brought in some fixes that added __vlan_put_tag calls, which in net-next take an additional argument. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-30unix/stream: fix peeking with an offset larger than data in queueBenjamin Poirier
Currently, peeking on a unix stream socket with an offset larger than len of the data in the sk receive queue returns immediately with bogus data. This patch fixes this so that the behavior is the same as peeking with no offset on an empty queue: the caller blocks. Signed-off-by: Benjamin Poirier <bpoirier@suse.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/emulex/benet/be_main.c drivers/net/ethernet/intel/igb/igb_main.c drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c include/net/scm.h net/batman-adv/routing.c net/ipv4/tcp_input.c The e{uid,gid} --> {uid,gid} credentials fix conflicted with the cleanup in net-next to now pass cred structs around. The be2net driver had a bug fix in 'net' that overlapped with the VLAN interface changes by Patrick McHardy in net-next. An IGB conflict existed because in 'net' the build_skb() support was reverted, and in 'net-next' there was a comment style fix within that code. Several batman-adv conflicts were resolved by making sure that all calls to batadv_is_my_mac() are changed to have a new bat_priv first argument. Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO rewrite in 'net-next', mostly overlapping changes. Thanks to Stephen Rothwell and Antonio Quartulli for help with several of these merge resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-07scm: Stop passing struct credEric W. Biederman
Now that uids and gids are completely encapsulated in kuid_t and kgid_t we no longer need to pass struct cred which allowed us to test both the uid and the user namespace for equality. Passing struct cred potentially allows us to pass the entire group list as BSD does but I don't believe the cost of cache line misses justifies retaining code for a future potential application. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/nfc/microread/mei.c net/netfilter/nfnetlink_queue_core.c Pull in 'net' to get Eric Biederman's AF_UNIX fix, upon which some cleanups are going to go on-top. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-05af_unix: If we don't care about credentials coallesce all messagesEric W. Biederman
It was reported that the following LSB test case failed https://lsbbugs.linuxfoundation.org/attachment.cgi?id=2144 because we were not coallescing unix stream messages when the application was expecting us to. The problem was that the first send was before the socket was accepted and thus sock->sk_socket was NULL in maybe_add_creds, and the second send after the socket was accepted had a non-NULL value for sk->socket and thus we could tell the credentials were not needed so we did not bother. The unnecessary credentials on the first message cause unix_stream_recvmsg to start verifying that all messages had the same credentials before coallescing and then the coallescing failed because the second message had no credentials. Ignoring credentials when we don't care in unix_stream_recvmsg fixes a long standing pessimization which would fail to coallesce messages when reading from a unix stream socket if the senders were different even if we did not care about their credentials. I have tested this and verified that the in the LSB test case mentioned above that the messages do coallesce now, while the were failing to coallesce without this change. Reported-by: Karel Srot <ksrot@redhat.com> Reported-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-05Revert "af_unix: dont send SCM_CREDENTIAL when dest socket is NULL"Eric W. Biederman
This reverts commit 14134f6584212d585b310ce95428014b653dfaf6. The problem that the above patch was meant to address is that af_unix messages are not being coallesced because we are sending unnecesarry credentials. Not sending credentials in maybe_add_creds totally breaks unconnected unix domain sockets that wish to send credentails to other sockets. In practice this break some versions of udev because they receive a message and the sending uid is bogus so they drop the message. Reported-by: Sven Joachim <svenjoac@gmx.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-02net: fix smatch warnings inside datagram_pollJacob Keller
Commit 7d4c04fc170087119727119074e72445f2bb192b ("net: add option to enable error queue packets waking select") has an issue due to operator precedence causing the bit-wise OR to bind to the sock_flags call instead of the result of the terniary conditional. This fixes the *_poll functions to work properly. The old code results in "mask |= POLLPRI" instead of what was intended, which is to only include POLLPRI when the socket option is enabled. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-31net: add option to enable error queue packets waking selectKeller, Jacob E
Currently, when a socket receives something on the error queue it only wakes up the socket on select if it is in the "read" list, that is the socket has something to read. It is useful also to wake the socket if it is in the error list, which would enable software to wait on error queue packets without waking up for regular data on the socket. The main use case is for receiving timestamped transmit packets which return the timestamp to the socket via the error queue. This enables an application to select on the socket for the error queue only instead of for the regular traffic. -v2- * Added the SO_SELECT_ERR_QUEUE socket option to every architechture specific file * Modified every socket poll function that checks error queue Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Cc: Jeffrey Kirsher <jeffrey.t.kirsher@intel.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Matthew Vick <matthew.vick@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26af_unix: dont send SCM_CREDENTIAL when dest socket is NULLdingtianhong
SCM_SCREDENTIALS should apply to write() syscalls only either source or destination socket asserted SOCK_PASSCRED. The original implememtation in maybe_add_creds is wrong, and breaks several LSB testcases ( i.e. /tset/LSB.os/netowkr/recvfrom/T.recvfrom). Origionally-authored-by: Karel Srot <ksrot@redhat.com> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-25unix: fix a race condition in unix_release()Paul Moore
As reported by Jan, and others over the past few years, there is a race condition caused by unix_release setting the sock->sk pointer to NULL before properly marking the socket as dead/orphaned. This can cause a problem with the LSM hook security_unix_may_send() if there is another socket attempting to write to this partially released socket in between when sock->sk is set to NULL and it is marked as dead/orphaned. This patch fixes this by only setting sock->sk to NULL after the socket has been marked as dead; I also take the opportunity to make unix_release_sock() a void function as it only ever returned 0/success. Dave, I think this one should go on the -stable pile. Special thanks to Jan for coming up with a reproducer for this problem. Reported-by: Jan Stancek <jan.stancek@gmail.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-27hlist: drop the node parameter from iteratorsSasha Levin
I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-18net: proc: change proc_net_remove to remove_proc_entryGao feng
proc_net_remove is only used to remove proc entries that under /proc/net,it's not a general function for removing proc entries of netns. if we want to remove some proc entries which under /proc/net/stat/, we still need to call remove_proc_entry. this patch use remove_proc_entry to replace proc_net_remove. we can remove proc_net_remove after this patch. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18net: proc: change proc_net_fops_create to proc_createGao feng
Right now, some modules such as bonding use proc_create to create proc entries under /proc/net/, and other modules such as ipv4 use proc_net_fops_create. It looks a little chaos.this patch changes all of proc_net_fops_create to proc_create. we can remove proc_net_fops_create after this patch. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-09unix: Use FIELD_SIZEOF() in af_unix_init().YOSHIFUJI Hideaki / 吉藤英明
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-17af_unix: old_cred is surplusAlan Cox
Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31af_unix: fix shutdown parameter checkingXi Wang
Return -EINVAL rather than 0 given an invalid "mode" parameter. Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21af_netlink: force credentials passing [CVE-2012-3520]Eric Dumazet
Pablo Neira Ayuso discovered that avahi and potentially NetworkManager accept spoofed Netlink messages because of a kernel bug. The kernel passes all-zero SCM_CREDENTIALS ancillary data to the receiver if the sender did not provide such data, instead of not including any such data at all or including the correct data from the peer (as it is the case with AF_UNIX). This bug was introduced in commit 16e572626961 (af_unix: dont send SCM_CREDENTIALS by default) This patch forces passing credentials for netlink, as before the regression. Another fix would be to not add SCM_CREDENTIALS in netlink messages if not provided by the sender, but it might break some programs. With help from Florian Weimer & Petr Matousek This issue is designated as CVE-2012-3520 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second vfs pile from Al Viro: "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the deadlock reproduced by xfstests 068), symlink and hardlink restriction patches, plus assorted cleanups and fixes. Note that another fsfreeze deadlock (emergency thaw one) is *not* dealt with - the series by Fernando conflicts a lot with Jan's, breaks userland ABI (FIFREEZE semantics gets changed) and trades the deadlock for massive vfsmount leak; this is going to be handled next cycle. There probably will be another pull request, but that stuff won't be in it." Fix up trivial conflicts due to unrelated changes next to each other in drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) delousing target_core_file a bit Documentation: Correct s_umount state for freeze_fs/unfreeze_fs fs: Remove old freezing mechanism ext2: Implement freezing btrfs: Convert to new freezing mechanism nilfs2: Convert to new freezing mechanism ntfs: Convert to new freezing mechanism fuse: Convert to new freezing mechanism gfs2: Convert to new freezing mechanism ocfs2: Convert to new freezing mechanism xfs: Convert to new freezing code ext4: Convert to new freezing mechanism fs: Protect write paths by sb_start_write - sb_end_write fs: Skip atime update on frozen filesystem fs: Add freezing handling to mnt_want_write() / mnt_drop_write() fs: Improve filesystem freezing handling switch the protection of percpu_counter list to spinlock nfsd: Push mnt_want_write() outside of i_mutex btrfs: Push mnt_want_write() outside of i_mutex fat: Push mnt_want_write() outside of i_mutex ...
2012-07-29clean unix_bind() up a bitAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29pull mnt_want_write()/mnt_drop_write() into ↵Al Viro
kern_path_create()/done_path_create() resp. One side effect - attempt to create a cross-device link on a read-only fs fails with EROFS instead of EXDEV now. Makes more sense, POSIX allows, etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29new helper: done_path_create()Al Viro
releases what needs to be released after {kern,user}_path_create() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>