linux/linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2015-10-13	net/fsl_pq_mdio: check TBI address for consistency with mapped range	Gerlando Falauto
	When configuring the MDIO subsystem it is also necessary to configure the TBI register. Make sure the TBI is contained within the mapped register range in order to: a) make sure the address is computed correctly b) make users aware that we're actually accessing that register In case of error, print a message but continue anyway. Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com> Cc: Timur Tabi <timur@tabi.org> Cc: David S. Miller <davem@davemloft.net> Cc: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13	Merge branch 'dsa-mv88e6xxx-fix-hardware-bridging'	David S. Miller
	Vivien Didelot says: ==================== net: dsa: mv88e6xxx: fix hardware bridging DSA and its drivers currently hook the NETDEV_CHANGEUPPER net_device event in order to configure the VLAN map of every port. This VLAN map is a feature of these switch chips to hardcode and restrict which output ports a given input port can egress frames to. A Linux bridge is a simple untagged VLAN propagated by the bridge code itself. With a proper 802.1Q support, a driver does not need this hook anymore, and will simply program the related VLAN object. This patchset improves the hardware bridging code in the mv88e6xxx driver with a strict 802.1Q mode. Ideally, the equivalent must be done for Broadcom Starfighter 2 and Rocker, before completely getting rid of this hook. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13	net: dsa: mv88e6xxx: fix hardware bridging	Vivien Didelot
	Playing with the VLAN map of every port to implement "hardware bridging" in the 88E6352 driver was a hack until full 802.1Q was supported. Indeed with 802.1Q port mode "Disabled" or "Fallback", this feature is used to restrict which output ports an input port can egress frames to. A Linux bridge is an untagged VLAN. With full 802.1Q support, we don't need this hack anymore and can use the "Secure" strict 802.1Q port mode. With this mode, the port-based VLAN map still needs to be configured, but all the logic is VTU-centric. This means that the switch only cares about rules described in its hardware VLAN table, which is exactly what Linux bridge expects and what we want. Note also that the hardware bridging was broken with the previous flexible "Fallback" 802.1Q port mode. Here's an example: Port0 and Port1 belong to the same bridge. If Port0 sends crafted tagged frames with VID 200 to Port1, Port1 receives it. Even if Port1 is in hardware VLAN 200, but not Port0, Port1 will still receive it, because Fallback mode doesn't care about invalid VID or non-member source port. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13	net: dsa: do not warn unsupported bridge ops	Vivien Didelot
	A DSA driver may not provide the port_join_bridge and port_leave_bridge functions, so don't warn in such case. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13	net: dsa: mv88e6xxx: do not support per-port FID	Vivien Didelot
	Since we configure a switch chip through a Linux bridge, and a bridge is implemented as a VLAN, there is no need for per-port FID anymore. This patch gets rid of this and simplifies the driver code since we can now directly map all 4095 FIDs available to all VLANs. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13	net: dsa: mv88e6xxx: bridges do not need an FID	Vivien Didelot
	With 88E6352 and similar switch chips, each port has a map to restrict which output port this input port can egress frames to. The current driver code implements hardware bridging using this feature, and assigns to a bridge group the FID of its first member. Now that 802.1Q is fully implemented in this driver, a Linux bridge which is a simple untagged VLAN, already gets its own FID. This patch gets rid of the per-bridge FID and explicits the usage of the port based VLAN map feature. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13	RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in ↵	Sowmini Varadhan
	rds_tcp_accept_one() Consider the following "duelling syn" sequence between two peers A and B: A B SYN1 --> <-- SYN2 SYN2ACK --> Note that the SYN/ACK has already been sent out by TCP before rds_tcp_accept_one() gets invoked as part of callbacks. If the inet_addr(A) is numerically less than inet_addr(B), the arbitration scheme in rds_tcp_accept_one() will prefer the TCP connection triggered by SYN1, and will send a CLOSE for the SYN2 (just after the SYN2ACK was sent). Since B also follows the same arbitration scheme, it will send the SYN-ACK for SYN1 that will set up a healthy ESTABLISHED connection on both sides. B will also get a CLOSE for SYN2, which should result in the cleanup of the TCP state machine for SYN2, but it should not trigger any stale RDS-TCP callbacks (such as ->writespace, ->state_change etc), that would disrupt the progress of the SYN2 based RDS-TCP connection. Thus the arbitration scheme in rds_tcp_accept_one() should restore rds_tcp callbacks for the winner before setting them up for the new accept socket, and also make sure that conn->c_outgoing is set to 0 so that we do not trigger any reconnect attempts on the passive side of the tcp socket in the future, in conformance with commit c82ac7e69efe ("net/rds: RDS-TCP: only initiate reconnect attempt on outgoing TCP socket.") Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13	RDS: Invoke ->laddr_check() in rds_bind() for explicitly bound transports.	Sowmini Varadhan
	The IP address passed to rds_bind() should be vetted by the transport's ->laddr_check() for a previously bound transport. This needs to be done to avoid cases where, for example, the application has asked for an IB transport, but the IP address passed to bind is only usable on ethernet interfaces. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13	qlcnic: constify qlcnic_mbx_ops structure	Julia Lawall
	The only instance of a qlcnic_mbx_ops structure is never modified. Thus the declaration of the structure and all references to the structure type can be made const. In the definition of the qlcnic_mailbox structure, the ops field is no longer lined up with the other fields. This was left as is, to avoid a lot of trivial changes on the other lines. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Sony Chacko <sony.chacko@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-13	netfilter: nfqueue: don't use prev pointer	Florian Westphal
	Usage of -prev seems buggy. While packet was out our hook cannot be removed but we have no way to know if the previous one is still valid. So better not use ->prev at all. Since NF_REPEAT just asks to invoke same hook function again, just do so, and continue with nf_interate if we get an ACCEPT verdict. A side effect of this change is that if nf_reinject(NF_REPEAT) causes another REPEAT we will now drop the skb instead of a kernel loop. However, NF_REPEAT loops would be a bug so this should not happen anyway. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-13	mac80211: Fix hwflags debugfs file format	Mohammed Shafi Shajakhan
	Commit 30686bf7f5b3 ("mac80211: convert HW flags to unsigned long bitmap") accidentally removed the newline delimiter from the hwflags debugfs file. Fix this by adding back the newline between the HW flags. Cc: stable@vger.kernel.org [4.2] Signed-off-by: Mohammed Shafi Shajakhan <mohammed@qti.qualcomm.com> [fix commit log] Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-10-13	drm/vmwgfx: Fix kernel NULL pointer dereference on older hardware	Thomas Hellstrom
	The commit "drm/vmwgfx: Fix up user_dmabuf refcounting", while fixing a kernel crash introduced a NULL pointer dereference on older hardware. Fix this. Cc: <stable@vger.kernel.org> Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Sinclair Yeh <syeh@vmware.com> Reviewed-by: Brian Paul <brianp@vmware.com>
2015-10-13	selftests/powerpc: Fix build failure of load_unaligned_zeropad test	Michael Ellerman
	Commit 7a5692e6e533 ("arch/powerpc: provide zero_bytemask() for big-endian") added a call to __fls() in our word-at-a-time.h. That was fine for the kernel build but missed the fact that we also use word-at-a-time.h in a userspace test. Pulling in the kernel version of __fls() gets messy, so just define our own, it's unlikely to change often. Fixes: 7a5692e6e533 ("arch/powerpc: provide zero_bytemask() for big-endian") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-10-12	bridge: vlan: enforce no pvid flag in vlan ranges	Nikolay Aleksandrov
	Currently it's possible for someone to send a vlan range to the kernel with the pvid flag set which will result in the pvid bouncing from a vlan to vlan and isn't correct, it also introduces problems for hardware where it doesn't make sense having more than 1 pvid. iproute2 already enforces this, so let's enforce it on kernel-side as well. Reported-by: Elad Raz <eladr@mellanox.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	atm: iphase: fix misleading indention	Tillmann Heidsieck
	Fix a smatch warning: drivers/atm/iphase.c:1178 rx_pkt() warn: curly braces intended? The code is correct, the indention is misleading. In case the allocation of skb fails, we want to skip to the end. Signed-off-by: Tillmann Heidsieck <theidsieck@leenox.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	atm: iphase: return -ENOMEM instead of -1 in case of failed kmalloc()	Tillmann Heidsieck
	Smatch complains about returning hard coded error codes, silence this warning. drivers/atm/iphase.c:115 ia_enque_rtn_q() warn: returning -1 instead of -ENOMEM is sloppy Signed-off-by: Tillmann Heidsieck <theidsieck@leenox.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	ipv6 route: use err pointers instead of returning pointer by reference	Roopa Prabhu
	This patch makes ip6_route_info_create return err pointer instead of returning the rt pointer by reference as suggested by Dave Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	net: hns: fix the unknown phy_nterface_t type error	huangdaode
	This patch fix the building error reported by Jiri Pirko <jiri@resnulli.us> drivers/net/ethernet/hisilicon/hns/hnae.h:465:2: error: unknown type name 'phy_interface_t' phy_interface_t phy_if; ^ the full build log is on https://lists.01.org/pipermail/kbuild-all. Signed-off-by: huangdaode <huangdaode@hisilicon.com> Signed-off-by: yankejian <yankejian@huawei.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	tun: use sk_fullsock() before reading sk->sk_tsflags	Eric Dumazet
	timewait or request sockets are small and do not contain sk->sk_tsflags Without this fix, we might read garbage, and crash later in __skb_complete_tx_timestamp() -> sock_queue_err_skb() (These pseudo sockets do not have an error queue either) Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	Merge branch 'netns-defrag'	David S. Miller
	Eric W. Biederman says: ==================== net: Pass net into defragmentation This is the next installment of my work to pass struct net through the output path so the code does not need to guess how to figure out which network namespace it is in, and ultimately routes can have output devices in another network namespace. In netfilter and af_packet we defragment packets in the output path, and there is the usual amount of confusion about how to compute which net we are processing the packets in. This patchset clears that confusion up by explicitly passing in struct net in ip_defrag, ip_check_defrag, and nf_ct_frag6_gather. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	ipv6: Pass struct net into nf_ct_frag6_gather	Eric W. Biederman
	The function nf_ct_frag6_gather is called on both the input and the output paths of the networking stack. In particular ipv6_defrag which calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain on input and the LOCAL_OUT chain on output. The addition of a net parameter makes it explicit which network namespace the packets are being reassembled in, and removes the need for nf_ct_frag6_gather to guess. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	ipv4: Pass struct net into ip_defrag and ip_check_defrag	Eric W. Biederman
	The function ip_defrag is called on both the input and the output paths of the networking stack. In particular conntrack when it is tracking outbound packets from the local machine calls ip_defrag. So add a struct net parameter and stop making ip_defrag guess which network namespace it needs to defragment packets in. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	ipv4: Only compute net once in ip_call_ra_chain	Eric W. Biederman
	ip_call_ra_chain is called early in the forwarding chain from ip_forward and ip_mr_input, which makes skb->dev the correct expression to get the input network device and dev_net(skb->dev) a correct expression for the network namespace the packet is being processed in. Compute the network namespace and store it in a variable to make the code clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	packet: fix match_fanout_group()	Eric Dumazet
	Recent TCP listener patches exposed a prior af_packet bug : match_fanout_group() blindly assumes it is always safe to cast sk to a packet socket to compare fanout with af_packet_priv But SYNACK packets can be sent while attached to request_sock, which are smaller than a "struct sock". We can read non existent memory and crash. Fixes: c0de08d04215 ("af_packet: don't emit packet on orig fanout group") Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Eric Leblond <eric@regit.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	rtnetlink: fix gcc -Wconversion warning	Arad, Ronen
	RTA_ALIGNTO is currently define as 4. It has to be 4U to prevent warning for RTA_ALIGN and RTA_DATA expansions when -Wconversion gcc option is enabled. This follows NLMSG_ALIGNTO definition in <include/uapi/linux/netlink.h>. Signed-off-by: Ronen Arad <ronen.arad@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	Merge tag 'wireless-drivers-next-for-davem-2015-10-09' of ↵	David S. Miller
	git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next Kalle Valo says: ==================== Major changes: iwlwifi * some debugfs improvements * fix signedness in beacon statistics * deinline some functions to reduce size when device tracing is enabled * filter beacons out in AP mode when no stations are associated * deprecate firmwares version -12 * fix a runtime PM vs. legacy suspend race * one-liner fix for a ToF bug * clean-ups in the rx code * small debugging improvement * fix WoWLAN with new firmware versions * more clean-ups towards multiple RX queues; * some rate scaling fixes and improvements; * some time-of-flight fixes; * other generic improvements and clean-ups; brcmfmac * rework code dealing with multiple interfaces * allow logging firmware console using debug level * support for BCM4350, BCM4365, and BCM4366 PCIE devices * fixed for legacy P2P and P2P device handling * correct set and get tx-power ath9k * add support for Outside Context of a BSS (OCB) mode mwifiex * add USB multichannel feature ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	ipv4/icmp: redirect messages can use the ingress daddr as source	Paolo Abeni
	This patch allows configuring how the source address of ICMP redirect messages is selected; by default the old behaviour is retained, while setting icmp_redirects_use_orig_daddr force the usage of the destination address of the packet that caused the redirect. The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the following scenario: Two machines are set up with VRRP to act as routers out of a subnet, they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to x.x.x.254/24. If a host in said subnet needs to get an ICMP redirect from the VRRP router, i.e. to reach a destination behind a different gateway, the source IP in the ICMP redirect is chosen as the primary IP on the interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2. The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2, and will continue to use the wrong next-op. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	bridge: try switchdev op first in __vlan_vid_add/del	Jiri Pirko
	Some drivers need to implement both switchdev vlan ops and vid_add/kill ndos. For that to work in bridge code, we need to try switchdev op first when adding/deleting vlan id. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	BNX2: free temp_stats_blk on error path	wangweidong
	In bnx2_init_board, missing free temp_stats_blk on error path when some operations do failed. Just add the 'kfree' operation. Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	Merge branch 'setsockopt_incoming_cpu'	David S. Miller
	Eric Dumazet says: ==================== tcp: better smp listener behavior As promised in last patch series, we implement a better SO_REUSEPORT strategy, based on cpu hints if given by the application. We also moved sk_refcnt out of the cache line containing the lookup keys, as it was considerably slowing down smp operations because of false sharing. This was simpler than converting listen sockets to conventional RCU (to avoid sk_refcnt dirtying) Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	tcp: shrink tcp_timewait_sock by 8 bytes	Eric Dumazet
	Reducing tcp_timewait_sock from 280 bytes to 272 bytes allows SLAB to pack 15 objects per page instead of 14 (on x86) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	net: shrink struct sock and request_sock by 8 bytes	Eric Dumazet
	One 32bit hole is following skc_refcnt, use it. skc_incoming_cpu can also be an union for request_sock rcv_wnd. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	net: align sk_refcnt on 128 bytes boundary	Eric Dumazet
	sk->sk_refcnt is dirtied for every TCP/UDP incoming packet. This is a performance issue if multiple cpus hit a common socket, or multiple sockets are chained due to SO_REUSEPORT. By moving sk_refcnt 8 bytes further, first 128 bytes of sockets are mostly read. As they contain the lookup keys, this has a considerable performance impact, as cpus can cache them. These 8 bytes are not wasted, we use them as a place holder for various fields, depending on the socket type. Tested: SYN flood hitting a 16 RX queues NIC. TCP listener using 16 sockets and SO_REUSEPORT and SO_INCOMING_CPU for proper siloing. Could process 6.0 Mpps SYN instead of 4.2 Mpps Kernel profile looked like : 11.68% [kernel] [k] sha_transform 6.51% [kernel] [k] __inet_lookup_listener 5.07% [kernel] [k] __inet_lookup_established 4.15% [kernel] [k] memcpy_erms 3.46% [kernel] [k] ipt_do_table 2.74% [kernel] [k] fib_table_lookup 2.54% [kernel] [k] tcp_make_synack 2.34% [kernel] [k] tcp_conn_request 2.05% [kernel] [k] __netif_receive_skb_core 2.03% [kernel] [k] kmem_cache_alloc Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	net: SO_INCOMING_CPU setsockopt() support	Eric Dumazet
	SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command to fetch incoming cpu handling a particular TCP flow after accept() This commits adds setsockopt() support and extends SO_REUSEPORT selection logic : If a TCP listener or UDP socket has this option set, a packet is delivered to this socket only if CPU handling the packet matches the specified one. This allows to build very efficient TCP servers, using one listener per RX queue, as the associated TCP listener should only accept flows handled in softirq by the same cpu. This provides optimal NUMA behavior and keep cpu caches hot. Note that __inet_lookup_listener() still has to iterate over the list of all listeners. Following patch puts sk_refcnt in a different cache line to let this iteration hit only shared and read mostly cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	packet: support per-packet fwmark for af_packet sendmsg	Edward Jee
	Signed-off-by: Edward Hyunkoo Jee <edjee@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	sock: support per-packet fwmark	Edward Jee
	It's useful to allow users to set fwmark for an individual packet, without changing the socket state. The function this patch adds in sock layer can be used by the protocols that need such a feature. Signed-off-by: Edward Hyunkoo Jee <edjee@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	Merge branch 'bpf-unprivileged'	David S. Miller
	Alexei Starovoitov says: ==================== bpf: unprivileged v1-v2: - this set logically depends on cb patch "bpf: fix cb access in socket filter programs": http://patchwork.ozlabs.org/patch/527391/ which is must have to allow unprivileged programs. Thanks Daniel for finding that issue. - refactored sysctl to be similar to 'modules_disabled' - dropped bpf_trace_printk - split tests into separate patch and added more tests based on discussion v1 cover letter: I think it is time to liberate eBPF from CAP_SYS_ADMIN. As was discussed when eBPF was first introduced two years ago the only piece missing in eBPF verifier is 'pointer leak detection' to make it available to non-root users. Patch 1 adds this pointer analysis. The eBPF programs, obviously, need to see and operate on kernel addresses, but with these extra checks they won't be able to pass these addresses to user space. Patch 2 adds accounting of kernel memory used by programs and maps. It changes behavoir for existing root users, but I think it needs to be done consistently for both root and non-root, since today programs and maps are only limited by number of open FDs (RLIMIT_NOFILE). Patch 2 accounts program's and map's kernel memory as RLIMIT_MEMLOCK. Unprivileged eBPF is only meaningful for 'socket filter'-like programs. eBPF programs for tracing and TC classifiers/actions will stay root only. In parallel the bpf fuzzing effort is ongoing and so far we've found only one verifier bug and that was already fixed. The 'constant blinding' pass also being worked on. It will obfuscate constant-like values that are part of eBPF ISA to make jit spraying attacks even harder. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	bpf: add unprivileged bpf tests	Alexei Starovoitov
	Add new tests samples/bpf/test_verifier: unpriv: return pointer checks that pointer cannot be returned from the eBPF program unpriv: add const to pointer unpriv: add pointer to pointer unpriv: neg pointer checks that pointer arithmetic is disallowed unpriv: cmp pointer with const unpriv: cmp pointer with pointer checks that comparison of pointers is disallowed Only one case allowed 'void value = bpf_map_lookup_elem(..); if (value == 0) ...' unpriv: check that printk is disallowed since bpf_trace_printk is not available to unprivileged unpriv: pass pointer to helper function checks that pointers cannot be passed to functions that expect integers If function expects a pointer the verifier allows only that type of pointer. Like 1st argument of bpf_map_lookup_elem() must be pointer to map. (applies to non-root as well) unpriv: indirectly pass pointer on stack to helper function checks that pointer stored into stack cannot be used as part of key passed into bpf_map_lookup_elem() unpriv: mangle pointer on stack 1 unpriv: mangle pointer on stack 2 checks that writing into stack slot that already contains a pointer is disallowed unpriv: read pointer from stack in small chunks checks that < 8 byte read from stack slot that contains a pointer is disallowed unpriv: write pointer into ctx checks that storing pointers into skb->fields is disallowed unpriv: write pointer into map elem value checks that storing pointers into element values is disallowed For example: int bpf_prog(struct __sk_buff skb) { u32 key = 0; u64 value = bpf_map_lookup_elem(&map, &key); if (value) value = (u64) skb; } will be rejected. unpriv: partial copy of pointer checks that doing 32-bit register mov from register containing a pointer is disallowed unpriv: pass pointer to tail_call checks that passing pointer as an index into bpf_tail_call is disallowed unpriv: cmp map pointer with zero checks that comparing map pointer with constant is disallowed unpriv: write into frame pointer checks that frame pointer is read-only (applies to root too) unpriv: cmp of frame pointer checks that R10 cannot be using in comparison unpriv: cmp of stack pointer checks that Rx = R10 - imm is ok, but comparing Rx is not unpriv: obfuscate stack pointer checks that Rx = R10 - imm is ok, but Rx -= imm is not Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	bpf: charge user for creation of BPF maps and programs	Alexei Starovoitov
	since eBPF programs and maps use kernel memory consider it 'locked' memory from user accounting point of view and charge it against RLIMIT_MEMLOCK limit. This limit is typically set to 64Kbytes by distros, so almost all bpf+tracing programs would need to increase it, since they use maps, but kernel charges maximum map size upfront. For example the hash map of 1024 elements will be charged as 64Kbyte. It's inconvenient for current users and changes current behavior for root, but probably worth doing to be consistent root vs non-root. Similar accounting logic is done by mmap of perf_event. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	bpf: enable non-root eBPF programs	Alexei Starovoitov
	In order to let unprivileged users load and execute eBPF programs teach verifier to prevent pointer leaks. Verifier will prevent - any arithmetic on pointers (except R10+Imm which is used to compute stack addresses) - comparison of pointers (except if (map_value_ptr == 0) ... ) - passing pointers to helper functions - indirectly passing pointers in stack to helper functions - returning pointer from bpf program - storing pointers into ctx or maps Spill/fill of pointers into stack is allowed, but mangling of pointers stored in the stack or reading them byte by byte is not. Within bpf programs the pointers do exist, since programs need to be able to access maps, pass skb pointer to LD_ABS insns, etc but programs cannot pass such pointer values to the outside or obfuscate them. Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs, so that socket filters (tcpdump), af_packet (quic acceleration) and future kcm can use it. tracing and tc cls/act program types still require root permissions, since tracing actually needs to be able to see all kernel pointers and tc is for root only. For example, the following unprivileged socket filter program is allowed: int bpf_prog1(struct __sk_buff skb) { u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); u64 value = bpf_map_lookup_elem(&my_map, &index); if (value) value += skb->len; return 0; } but the following program is not: int bpf_prog1(struct __sk_buff skb) { u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)); u64 value = bpf_map_lookup_elem(&my_map, &index); if (value) value += (u64) skb; return 0; } since it would leak the kernel address into the map. Unprivileged socket filter bpf programs have access to the following helper functions: - map lookup/update/delete (but they cannot store kernel pointers into them) - get_random (it's already exposed to unprivileged user space) - get_smp_processor_id - tail_call into another socket filter program - ktime_get_ns The feature is controlled by sysctl kernel.unprivileged_bpf_disabled. This toggle defaults to off (0), but can be set true (1). Once true, bpf programs and maps cannot be accessed from unprivileged process, and the toggle cannot be set back to false. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12	netfilter: nfnetlink_log: autoload nf_conntrack_netlink module ↵	Ken-ichirou MATSUZAWA
	NFQA_CFG_F_CONNTRACK config flag This patch enables to load nf_conntrack_netlink module if NFULNL_CFG_F_CONNTRACK config flag is specified. Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12	arm64: Fix MINSIGSTKSZ and SIGSTKSZ	Manjeet Pawar
	MINSIGSTKSZ and SIGSTKSZ for ARM64 are not correctly set in latest kernel. This patch fixes this issue. This issue is reported in LTP (testcase: sigaltstack02.c). Testcase failed when sigaltstack() called with stack size "MINSIGSTKSZ - 1" Since in Glibc-2.22, MINSIGSTKSZ is set to 5120 but in kernel it is set to 2048 so testcase gets failed. Testcase Output: sigaltstack02 1 TPASS : stgaltstack() fails, Invalid Flag value,errno:22 sigaltstack02 2 TFAIL : sigaltstack() returned 0, expected -1,errno:12 Reported Issue in Glibc Bugzilla: Bugfix in Glibc-2.22: [Bug 16850] https://sourceware.org/bugzilla/show_bug.cgi?id=16850 Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Akhilesh Kumar <akhilesh.k@samsung.com> Signed-off-by: Manjeet Pawar <manjeet.p@samsung.com> Signed-off-by: Rohit Thapliyal <r.thapliyal@samsung.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2015-10-12	arm64: errata: use KBUILD_CFLAGS_MODULE for erratum #843419	Will Deacon
	Commit df057cc7b4fa ("arm64: errata: add module build workaround for erratum #843419") sets CFLAGS_MODULE to ensure that the large memory model is used by the compiler when building kernel modules. However, CFLAGS_MODULE is an environment variable and intended to be overridden on the command line, which appears to be the case with the Ubuntu kernel packaging system, so use KBUILD_CFLAGS_MODULE instead. Cc: <stable@vger.kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Fixes: df057cc7b4fa ("arm64: errata: add module build workaround for erratum #843419") Reported-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2015-10-12	svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE	Chuck Lever
	Now that the NFS server advertises a maximum payload size of 1MB for RPC/RDMA again, it crashes in svc_process_common() when NFS client sends a 1MB NFS WRITE on an NFS/RDMA mount. The server has set up a 259 element array of struct page pointers in rq_pages[] for each incoming request. The last element of the array is NULL. When an incoming request has been completely received, rdma_read_complete() attempts to set the starting page of the incoming page vector: rqstp->rq_arg.pages = &rqstp->rq_pages[head->hdr_count]; and the page to use for the reply: rqstp->rq_respages = &rqstp->rq_arg.pages[page_no]; But the value of page_no has already accounted for head->hdr_count. Thus rq_respages now points past the end of the incoming pages. For NFS WRITE operations smaller than the maximum, this is harmless. But when the NFS WRITE operation is as large as the server's max payload size, rq_respages now points at the last entry in rq_pages, which is NULL. Fixes: cc9a903d915c ('svcrdma: Change maximum server payload . . .') BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Shirley Ma <shirley.ma@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-12	netfilter: bridge: avoid unused label warning	Arnd Bergmann
	With the ARM mini2440_defconfig, the bridge netfilter code gets built with both CONFIG_NF_DEFRAG_IPV4 and CONFIG_NF_DEFRAG_IPV6 disabled, which leads to a harmless gcc warning: net/bridge/br_netfilter_hooks.c: In function 'br_nf_dev_queue_xmit': net/bridge/br_netfilter_hooks.c:792:2: warning: label 'drop' defined but not used [-Wunused-label] This gets rid of the warning by cleaning up the code to avoid the respective #ifdefs causing this problem, and replacing them with if(IS_ENABLED()) checks. I have verified that the resulting object code is unchanged, and an additional advantage is that we now get compile coverage of the unused functions in more configurations. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: dd302b59bde0 ("netfilter: bridge: don't leak skb in error paths") Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12	Merge tag 'ipvs4-for-v4.4' of ↵	Pablo Neira Ayuso
	https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== Fourth Round of IPVS Updates for v4.4 please consider these build warning cleanups from David Ahern and myself. They resolve some minor side effects of Eric Biederman' heroic work to cleanup IPVS which you recently pulled: its queued up for v4.4 so no need to worry about earlier kernel versions. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12	nfnetlink_cttimeout: add rcu_barrier() on module removal	Pablo Neira Ayuso
	Make sure kfree_rcu() released objects before leaving the module removal exit path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12	netfilter: conntrack: fix crash on timeout object removal	Pablo Neira Ayuso
	The object and module refcounts are updated for each conntrack template, however, if we delete the iptables rules and we flush the timeout database, we may end up with invalid references to timeout object that are just gone. Resolve this problem by setting the timeout reference to NULL when the custom timeout entry is removed from our base. This patch requires some RCU trickery to ensure safe pointer handling. This handling is similar to what we already do with conntrack helpers, the idea is to avoid bumping the timeout object reference counter from the packet path to avoid the cost of atomic ops. Reported-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12	netfilter: xt_CT: don't put back reference to timeout policy object	Pablo Neira Ayuso
	On success, this shouldn't put back the timeout policy object, otherwise we may have module refcount overflow and we allow deletion of timeout that are still in use. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-10-12	net: HNS: fix MDIO dependencies	Arnd Bergmann
	The newly introduced HNS_MDIO Kconfig symbol selects 'MDIO', but that is the wrong symbol as the code used by this driver is provided by PHYLIB rather than the MDIO driver. Also, there is no need to make this driver user selectable, because it is already selected by all drivers that need it. This changes the Kconfig file to select the correct library, and to make the option silent. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 5b904d39406 ("net: add Hisilicon Network Subsystem MDIO support") Signed-off-by: David S. Miller <davem@davemloft.net>