summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2016-07-12net: dsa: Fix non static symbol warningWei Yongjun
Fixes the following sparse warning: net/dsa/dsa2.c:680:6: warning: symbol '_dsa_unregister_switch' was not declared. Should it be static? Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-12rxrpc: Fix error handling in af_rxrpc_init()Wei Yongjun
security initialized after alloc workqueue, so we should exit security before destroy workqueue in the error handing. Fixes: 648af7fca159 ("rxrpc: Absorb the rxkad security module") Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree. they are: 1) Fix leak in the error path of nft_expr_init(), from Liping Zhang. 2) Tracing from nf_tables cannot be disabled, also from Zhang. 3) Fix an integer overflow on 32bit archs when setting the number of hashtable buckets, from Florian Westphal. 4) Fix configuration of ipvs sync in backup mode with IPv6 address, from Quentin Armitage via Simon Horman. 5) Fix incorrect timeout calculation in nft_ct NFT_CT_EXPIRATION, from Florian Westphal. 6) Skip clash resolution in conntrack insertion races if NAT is in place. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-12netfilter: conntrack: skip clash resolution if nat is in placePablo Neira Ayuso
The clash resolution is not easy to apply if the NAT table is registered. Even if no NAT rules are installed, the nul-binding ensures that a unique tuple is used, thus, the packet that loses race gets a different source port number, as described by: http://marc.info/?l=netfilter-devel&m=146818011604484&w=2 Clash resolution with NAT is also problematic if addresses/port range ports are used since the conntrack that wins race may describe a different mangling that we may have earlier applied to the packet via nf_nat_setup_info(). Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
2016-07-12netfilter: conntrack: protect early_drop by rcu read lockLiping Zhang
User can add ct entry via nfnetlink(IPCTNL_MSG_CT_NEW), and if the total number reach the nf_conntrack_max, we will try to drop some ct entries. But in this case(the main function call path is ctnetlink_create_conntrack -> nf_conntrack_alloc -> early_drop), rcu_read_lock is not held, so race with hash resize will happen. Fixes: 242922a02717 ("netfilter: conntrack: simplify early_drop") Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11ipv4: af_inet: make it explicitly non-modularPaul Gortmaker
The Makefile controlling compilation of this file is obj-y, meaning that it currently is never being built as a module. Since MODULE_ALIAS is a no-op for non-modular code, we can simply remove the MODULE_ALIAS_NETPROTO variant used here. We replace module.h with kmod.h since the file does make use of request_module() in order to load other modules from here. We don't have to worry about init.h coming in via the removed module.h since the file explicitly includes init.h already. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11tipc: reset all unicast links when broadcast send link failsJon Paul Maloy
In test situations with many nodes and a heavily stressed system we have observed that the transmission broadcast link may fail due to an excessive number of retransmissions of the same packet. In such situations we need to reset all unicast links to all peers, in order to reset and re-synchronize the broadcast link. In this commit, we add a new function tipc_bearer_reset_all() to be used in such situations. The function scans across all bearers and resets all their pertaining links. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11tipc: ensure correct broadcast send buffer release when peer is lostJon Paul Maloy
After a new receiver peer has been added to the broadcast transmission link, we allow immediate transmission of new broadcast packets, trusting that the new peer will not accept the packets until it has received the previously sent unicast broadcast initialiation message. In the same way, the sender must not accept any acknowledges until it has itself received the broadcast initialization from the peer, as well as confirmation of the reception of its own initialization message. Furthermore, when a receiver peer goes down, the sender has to produce the missing acknowledges from the lost peer locally, in order ensure correct release of the buffers that were expected to be acknowledged by the said peer. In a highly stressed system we have observed that contact with a peer may come up and be lost before the above mentioned broadcast initial- ization and confirmation have been received. This leads to the locally produced acknowledges being rejected, and the non-acknowledged buffers to linger in the broadcast link transmission queue until it fills up and the link goes into permanent congestion. In this commit, we remedy this by temporarily setting the corresponding broadcast receive link state to ESTABLISHED and the 'bc_peer_is_up' state to true before we issue the local acknowledges. This ensures that those acknowledges will always be accepted. The mentioned state values are restored immediately afterwards when the link is reset. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11tipc: extend broadcast link initialization criteriaJon Paul Maloy
At first contact between two nodes, an endpoint might sometimes have time to send out a LINK_PROTOCOL/STATE packet before it has received the broadcast initialization packet from the peer, i.e., before it has received a valid broadcast packet number to add to the 'bc_ack' field of the protocol message. This means that the peer endpoint will receive a protocol packet with an invalid broadcast acknowledge value of 0. Under unlucky circumstances this may lead to the original, already received acknowledge value being overwritten, so that the whole broadcast link goes stale after a while. We fix this by delaying the setting of the link field 'bc_peer_is_up' until we know that the peer really has received our own broadcast initialization message. The latter is always sent out as the first unicast message on a link, and always with seqeunce number 1. Because of this, we only need to look for a non-zero unicast acknowledge value in the arriving STATE messages, and once that is confirmed we know we are safe and can set the mentioned field. Before this moment, we must ignore all broadcast acknowledges from the peer. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sock: ignore SCM_RIGHTS and SCM_CREDENTIALS in __sock_cmsg_sendSoheil Hassas Yeganeh
Sergei Trofimovich reported that pulse audio sends SCM_CREDENTIALS as a control message to TCP. Since __sock_cmsg_send does not support SCM_RIGHTS and SCM_CREDENTIALS, it returns an error and hence breaks pulse audio over TCP. SCM_RIGHTS and SCM_CREDENTIALS are sent on the SOL_SOCKET layer but they semantically belong to SOL_UNIX. Since all cmsg-processing functions including sock_cmsg_send ignore control messages of other layers, it is best to ignore SCM_RIGHTS and SCM_CREDENTIALS for consistency (and also for fixing pulse audio over TCP). Fixes: c14ac9451c34 ("sock: enable timestamping using control messages") Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Reported-by: Sergei Trofimovich <slyfox@gentoo.org> Tested-by: Sergei Trofimovich <slyfox@gentoo.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11ipv4: reject RTNH_F_DEAD and RTNH_F_LINKDOWN from user spaceJulian Anastasov
Vegard Nossum is reporting for a crash in fib_dump_info when nh_dev = NULL and fib_nhs == 1: Pid: 50, comm: netlink.exe Not tainted 4.7.0-rc5+ RIP: 0033:[<00000000602b3d18>] RSP: 0000000062623890 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 000000006261b800 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000024 RDI: 000000006245ba00 RBP: 00000000626238f0 R08: 000000000000029c R09: 0000000000000000 R10: 0000000062468038 R11: 000000006245ba00 R12: 000000006245ba00 R13: 00000000625f96c0 R14: 00000000601e16f0 R15: 0000000000000000 Kernel panic - not syncing: Kernel mode fault at addr 0x2e0, ip 0x602b3d18 CPU: 0 PID: 50 Comm: netlink.exe Not tainted 4.7.0-rc5+ #581 Stack: 626238f0 960226a02 00000400 000000fe 62623910 600afca7 62623970 62623a48 62468038 00000018 00000000 00000000 Call Trace: [<602b3e93>] rtmsg_fib+0xd3/0x190 [<602b6680>] fib_table_insert+0x260/0x500 [<602b0e5d>] inet_rtm_newroute+0x4d/0x60 [<60250def>] rtnetlink_rcv_msg+0x8f/0x270 [<60267079>] netlink_rcv_skb+0xc9/0xe0 [<60250d4b>] rtnetlink_rcv+0x3b/0x50 [<60265400>] netlink_unicast+0x1a0/0x2c0 [<60265e47>] netlink_sendmsg+0x3f7/0x470 [<6021dc9a>] sock_sendmsg+0x3a/0x90 [<6021e0d0>] ___sys_sendmsg+0x300/0x360 [<6021fa64>] __sys_sendmsg+0x54/0xa0 [<6021fac0>] SyS_sendmsg+0x10/0x20 [<6001ea68>] handle_syscall+0x88/0x90 [<600295fd>] userspace+0x3fd/0x500 [<6001ac55>] fork_handler+0x85/0x90 $ addr2line -e vmlinux -i 0x602b3d18 include/linux/inetdevice.h:222 net/ipv4/fib_semantics.c:1264 Problem happens when RTNH_F_LINKDOWN is provided from user space when creating routes that do not use the flag, catched with netlink fuzzer. Currently, the kernel allows user space to set both flags to nh_flags and fib_flags but this is not intentional, the assumption was that they are not set. Fix this by rejecting both flags with EINVAL. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Fixes: 0eeb075fad73 ("net: ipv4 sysctl option to ignore routes when nexthop link is down") Signed-off-by: Julian Anastasov <ja@ssi.bg> Cc: Andy Gospodarek <gospo@cumulusnetworks.com> Cc: Dinesh Dutt <ddutt@cumulusnetworks.com> Cc: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11tcp: make challenge acks less predictableEric Dumazet
Yue Cao claims that current host rate limiting of challenge ACKS (RFC 5961) could leak enough information to allow a patient attacker to hijack TCP sessions. He will soon provide details in an academic paper. This patch increases the default limit from 100 to 1000, and adds some randomization so that the attacker can no longer hijack sessions without spending a considerable amount of probes. Based on initial analysis and patch from Linus. Note that we also have per socket rate limiting, so it is tempting to remove the host limit in the future. v2: randomize the count of challenge acks per second, not the period. Fixes: 282f23c6ee34 ("tcp: implement RFC 5961 3.2") Reported-by: Yue Cao <ycao009@ucr.edu> Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11tunnels: correct conditional build of MPLS and IPv6Simon Horman
Using a combination if #if conditionals and goto labels to unwind tunnel4_init seems unwieldy. This patch takes a simpler approach of directly unregistering previously registered protocols when an error occurs. This fixes a number of problems with the current implementation including the potential presence of labels when they are unused and the potential absence of unregister code when it is needed. Fixes: 8afe97e5d416 ("tunnels: support MPLS over IPv4 tunnels") Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: implement prsctp PRIO policyXin Long
prsctp PRIO policy is a policy to abandon lower priority chunks when asoc doesn't have enough snd buffer, so that the current chunk with higher priority can be queued successfully. Similar to TTL/RTX policy, we will set the priority of the chunk to prsctp_param with sinfo->sinfo_timetolive in sctp_set_prsctp_policy(). So if PRIO policy is enabled, msg->expire_at won't work. asoc->sent_cnt_removable will record how many chunks can be checked to remove. If priority policy is enabled, when the chunk is queued into the out_queue, we will increase sent_cnt_removable. When the chunk is moved to abandon_queue or dequeue and free, we will decrease sent_cnt_removable. In sctp_sendmsg, we will check if there is enough snd buffer for current msg and if sent_cnt_removable is not 0. Then try to abandon chunks in sctp_prune_prsctp when sendmsg from the retransmit/transmited queue, and free chunks from out_queue in right order until the abandon+free size > msg_len - sctp_wfree. For the abandon size, we have to wait until it sends FORWARD TSN, receives the sack and the chunks are really freed. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: implement prsctp RTX policyXin Long
prsctp RTX policy is a policy to abandon chunks when they are retransmitted beyond the max count. This patch uses sent_count to count how many times one chunk has been sent, and prsctp_param is the max rtx count, which is from sinfo->sinfo_timetolive in sctp_set_prsctp_policy(). So similar to TTL policy, if RTX policy is enabled, msg->expire_at won't work. Then in sctp_chunk_abandoned, this patch checks if chunk->sent_count is bigger than chunk->prsctp_param to abandon this chunk. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: implement prsctp TTL policyXin Long
prsctp TTL policy is a policy to abandon chunks when they expire at the specific time in local stack. It's similar with expires_at in struct sctp_datamsg. This patch uses sinfo->sinfo_timetolive to set the specific time for TTL policy. sinfo->sinfo_timetolive is also used for msg->expires_at. So if prsctp_enable or TTL policy is not enabled, msg->expires_at still works as before. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: add SCTP_PR_ASSOC_STATUS on sctp sockoptXin Long
This patch adds SCTP_PR_ASSOC_STATUS to sctp sockopt, which is used to dump the prsctp statistics info from the asoc. The prsctp statistics includes abandoned_sent/unsent from the asoc. abandoned_sent is the count of the packets we drop packets from retransmit/transmited queue, and abandoned_unsent is the count of the packets we drop from out_queue according to the policy. Note: another option for prsctp statistics dump described in rfc is SCTP_PR_STREAM_STATUS, which is used to dump the prsctp statistics info from each stream. But by now, linux doesn't yet have per stream statistics info, it needs rfc6525 to be implemented. As the prsctp statistics for each stream has to be based on per stream statistics, we will delay it until rfc6525 is done in linux. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: add SCTP_DEFAULT_PRINFO into sctp sockoptXin Long
This patch adds SCTP_DEFAULT_PRINFO to sctp sockopt. It is used to set/get sctp Partially Reliable Policies' default params, which includes 3 policies (ttl, rtx, prio) and their values. Still, if we set policy params in sndinfo, we will use the params of sndinfo against chunks, instead of the default params. In this patch, we will use 5-8bit of sp/asoc->default_flags to store prsctp policies, and reuse asoc->default_timetolive to store their values. It means if we enable and set prsctp policy, prior ttl timeout in sctp will not work any more. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11sctp: add SCTP_PR_SUPPORTED on sctp sockoptXin Long
According to section 4.5 of rfc7496, prsctp_enable should be per asoc. We will add prsctp_enable to both asoc and ep, and replace the places where it used net.sctp->prsctp_enable with asoc->prsctp_enable. ep->prsctp_enable will be initialized with net.sctp->prsctp_enable, and asoc->prsctp_enable will be initialized with ep->prsctp_enable. We can also modify it's value through sockopt SCTP_PR_SUPPORTED. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11udp: prevent bugcheck if filter truncates packet too muchMichal Kubeček
If socket filter truncates an udp packet below the length of UDP header in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if kernel is configured that way) can be easily enforced by an unprivileged user which was reported as CVE-2016-6162. For a reproducer, see http://seclists.org/oss-sec/2016/q3/8 Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Reported-by: Marco Grassi <marco.gra@gmail.com> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11Merge tag 'batadv-net-for-davem-20160708' of git://git.open-mesh.org/linux-mergeDavid S. Miller
Simon Wunderlich says: ==================== Here are a couple batman-adv bugfix patches, all by Sven Eckelmann: - Fix possible NULL pointer dereference for vlan_insert_tag (two patches) - Fix reference handling in some features, which may lead to reference leaks or invalid memory access (four patches) - Fix speedy join: DHCP packets handled by the gateway feature should be sent with 4-address unicast instead of 3-address unicast to make speedy join work. This fixes/speeds up DHCP assignment for clients which join a mesh for the first time. (one patch) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11netfilter: nf_conntrack_h323: fix off-by-one in DecodeQ931Toby DiPasquale
This patch corrects an off-by-one error in the DecodeQ931 function in the nf_conntrack_h323 module. This error could result in reading off the end of a Q.931 frame. Signed-off-by: Toby DiPasquale <toby@cbcg.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11Merge tag 'ipvs-for-v4.8' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== IPVS Updates for v4.8 please consider these enhancements to the IPVS. This alters the behaviour of the "least connection" schedulers such that pre-established connections are included in the active connection count. This avoids overloading servers when a large number of new connections arrive in a short space of time - e.g. when clients reconnect after a node or network failure. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nf_tables: get rid of possible_net_t from set and basechainPablo Neira Ayuso
We can pass the netns pointer as parameter to the functions that need to gain access to it. From basechains, I didn't find any client for this field anymore so let's remove this too. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nft_ct: make byte/packet expr more friendlyLiping Zhang
If we want to use ct packets expr, and add a rule like follows: # nft add rule filter input ct packets gt 1 counter We will find that no packets will hit it, because nf_conntrack_acct is disabled by default. So It will not work until we enable it manually via "echo 1 > /proc/sys/net/netfilter/nf_conntrack_acct". This is not friendly, so like xt_connbytes do, if the user want to use ct byte/packet expr, enable nf_conntrack_acct automatically. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: physdev: physdev-is-out should not work with OUTPUT chainHangbin Liu
physdev_mt() will check skb->nf_bridge first, which was alloced in br_nf_pre_routing. So if we want to use --physdev-out and physdev-is-out, we need to match it in FORWARD or POSTROUTING chain. physdev_mt_check() only checked physdev-out and missed physdev-is-out. Fix it and update the debug message to make it clearer. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Marcelo R Leitner <marcelo.leitner@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nat: convert nat bysrc hash to rhashtableFlorian Westphal
It did use a fixed-size bucket list plus single lock to protect add/del. Unlike the main conntrack table we only need to add and remove keys. Convert it to rhashtable to get table autosizing and per-bucket locking. The maximum number of entries is -- as before -- tied to the number of conntracks so we do not need another upperlimit. The change does not handle rhashtable_remove_fast error, only possible "error" is -ENOENT, and that is something that can happen legitimetely, e.g. because nat module was inserted at a later time and no src manip took place yet. Tested with http-client-benchmark + httpterm with DNAT and SNAT rules in place. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11Merge tag 'ipvs-fixes2-for-v4.7' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs Simon Horman says: ==================== Second Round of IPVS Fixes for v4.7 The fix from Quentin Armitage allows the backup sync daemon to be bound to a link-local mcast IPv6 address as is already the case for IPv4. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: move nat hlist_head to nf_connFlorian Westphal
The nat extension structure is 32bytes in size on x86_64: struct nf_conn_nat { struct hlist_node bysource; /* 0 16 */ struct nf_conn * ct; /* 16 8 */ union nf_conntrack_nat_help help; /* 24 4 */ int masq_index; /* 28 4 */ /* size: 32, cachelines: 1, members: 4 */ /* last cacheline: 32 bytes */ }; The hlist is needed to quickly check for possible tuple collisions when installing a new nat binding. Storing this in the extension area has two drawbacks: 1. We need ct backpointer to get the conntrack struct from the extension. 2. When reallocation of extension area occurs we need to fixup the bysource hash head via hlist_replace_rcu. We can avoid both by placing the hlist_head in nf_conn and place nf_conn in the bysource hash rather than the extenstion. We can also remove the ->move support; no other extension needs it. Moving the entire nat extension into nf_conn would be possible as well but then we have to add yet another callback for deletion from the bysource hash table rather than just using nat extension ->destroy hook for this. nf_conn size doesn't increase due to aligment, followup patch replaces hlist_node with single pointer. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: conntrack: simplify early_dropFlorian Westphal
We don't need to acquire the bucket lock during early drop, we can use lockless traveral just like ____nf_conntrack_find. The timer deletion serves as synchronization point, if another cpu attempts to evict same entry, only one will succeed with timer deletion. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: nf_ct_helper: unlink helper again when hash resize happenLiping Zhang
From: Liping Zhang <liping.zhang@spreadtrum.com> Similar to ctnl_untimeout, when hash resize happened, we should try to do unhelp from the 0# bucket again. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: cttimeout: unlink timeout obj again when hash resize happenLiping Zhang
Imagine such situation, nf_conntrack_htable_size now is 4096, we are doing ctnl_untimeout, and iterate on 3000# bucket. Meanwhile, another user try to reduce hash size to 2048, then all nf_conn are removed to the new hashtable. When this hash resize operation finished, we still try to itreate ct begin from 3000# bucket, find nothing to do and just return. We may miss unlinking some timeout objects. And later we will end up with invalid references to timeout object that are already gone. So when we find that hash resize happened, try to unlink timeout objects from the 0# bucket again. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: conntrack: fix race between nf_conntrack proc read and hash resizeLiping Zhang
When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack hash table via /sys/module/nf_conntrack/parameters/hashsize, race will happen, because reader can observe a newly allocated hash but the old size (or vice versa). So oops will happen like follows: BUG: unable to handle kernel NULL pointer dereference at 0000000000000017 IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack] Call Trace: [<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack] [<ffffffff81261a1c>] seq_read+0x2cc/0x390 [<ffffffff812a8d62>] proc_reg_read+0x42/0x70 [<ffffffff8123bee7>] __vfs_read+0x37/0x130 [<ffffffff81347980>] ? security_file_permission+0xa0/0xc0 [<ffffffff8123cf75>] vfs_read+0x95/0x140 [<ffffffff8123e475>] SyS_read+0x55/0xc0 [<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4 It is very easy to reproduce this kernel crash. 1. open one shell and input the following cmds: while : ; do echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize done 2. open more shells and input the following cmds: while : ; do cat /proc/net/nf_conntrack done 3. just wait a monent, oops will happen soon. The solution in this patch is based on Florian's Commit 5e3c61f98175 ("netfilter: conntrack: fix lookup race during hash resize"). And add a wrapper function nf_conntrack_get_ht to get hash and hsize suggested by Florian Westphal. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11NFC: digital: Fix RTOX supervisor PDU handlingThierry Escande
When the target needs more time to process the received PDU, it sends Response Timeout Extension (RTOX) PDU. When the initiator receives a RTOX PDU, it must reply with a RTOX PDU and extends the current rwt value with the formula: rwt_int = rwt * rtox This patch takes care of the rtox value passed by the target in the RTOX PDU and extends the timeout for the next response accordingly. Signed-off-by: Thierry Escande <thierry.escande@collabora.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11NFC: digital: Add support for NFC DEP Response Waiting TimeThierry Escande
When sending an ATR_REQ, the initiator must wait for the ATR_RES at least 'RWT(nfcdep,activation) + dRWT(nfcdep)' and no more than 'RWT(nfcdep,activation) + dRWT(nfcdep) + dT(nfcdep,initiator)'. This gives a timeout value between 1237 ms and 1337 ms. This patch defines DIGITAL_ATR_RES_RWT to 1337 used for the timeout value of ATR_REQ command. For other DEP PDUs, the initiator must wait between 'RWT + dRWT(nfcdep)' and 'RWT + dRWT(nfcdep) + dT(nfcdep,initiator)' where RWT is given by the following formula: '(256 * 16 / f(c)) * 2^wt' where wt is the value of the TO field in the ATR_RES response and is in the range between 0 and 14. This patch declares a mapping table for wt values and gives RWT max values between 100 ms and 5049 ms. This patch also defines DIGITAL_ATR_RES_TO_WT, the maximum wt value in target mode, to 8. Signed-off-by: Thierry Escande <thierry.escande@collabora.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11NFC: digital: Free supervisor PDUsThierry Escande
This patch frees the RTOX resp sk_buff in initiator mode. It also makes use of the free_resp exit point for ATN supervisor PDUs in both initiator and target mode. Signed-off-by: Thierry Escande <thierry.escande@collabora.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11NFC: digital: Rework ACK PDU handling in initiator modeThierry Escande
With this patch, ACK PDU sk_buffs are now freed and code has been refactored for better errors handling. Signed-off-by: Thierry Escande <thierry.escande@collabora.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11NFC: digital: Fix ACK & NACK PDUs handling in target modeThierry Escande
When the target receives a NACK PDU, it re-sends the last sent PDU. ACK PDUs are received by the target as a reply from the initiator to chained I-PDUs. There are 3 cases to handle: - If the target has previously received 1 or more ATN PDUs and the PNI in the ACK PDU is equal to the target PNI - 1, then it means that the initiator did not received the last issued PDU from the target. In this case it re-sends this PDU. - If the target has received 1 or more ATN PDUs but the ACK PNI is not the target PNI - 1, then this means that this ACK is the reply of the previous chained I-PDU sent by the target. The target did not received it on the first attempt and it is being re-sent by the initiator. The process continues as usual. - No ATN PDU received before this ACK PDU. This is the reply of a chained I-PDU. The target keeps on processing its chained I-PDU. The code has been refactored to avoid too many indentation levels. Also, ACK and NACK PDUs were not freed. This is now fixed. Signed-off-by: Thierry Escande <thierry.escande@collabora.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11NFC: digital: Fix target DEP_REQ I-PDU handling after ATN PDUThierry Escande
When the initiator sends a DEP_REQ I-PDU, the target device may not reply in a timely manner. In this case the initiator device must send an attention PDU (ATN) and if the recipient replies with an ATN PDU in return, then the last I-PDU must be sent again by the initiator. This patch fixes how the target handles I-PDU received after an ATN PDU has been received. There are 2 possible cases: - The target has received the initial DEP_REQ and sends back the DEP_RES but the initiator did not receive it. In this case, after the initiator has sent an ATN PDU and the target replied it (with an ATN as well), the initiator sends the saved skb of the initial DEP_REQ again and the target replies with the saved skb of the initial DEP_RES. - Or the target did not even received the initial DEP_REQ. In this case, after the ATN PDUs exchange, the initiator sends the saved skb and the target simply passes it up, just as usual. This behavior is controlled using the atn_count and the PNI field of the digital device structure. Signed-off-by: Thierry Escande <thierry.escande@collabora.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11NFC: digital: Remove useless call to skb_reserve()Thierry Escande
When allocating chained I-PDUs, there is no need to call skb_reserve() since it's already done by digital_alloc_skb() and contains enough room for the driver head and tail data. Signed-off-by: Thierry Escande <thierry.escande@collabora.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-11NFC: digital: Fix handling of saved PDU sk_buff pointersThierry Escande
This patch fixes the way an I-PDU is saved in case it needs to be sent again. It is now copied using pskb_copy() and not simply referenced using skb_get() since it could be modified by the driver. digital_in_send_saved_skb() and digital_tg_send_saved_skb() still get a reference on the saved skb which is re-sent but release it if the send operation fails. That way the caller doesn't have to take care about skb ref in case of error. RTOX supervisor PDU must not be saved as this can override a previously saved I-PDU that should be re-sent later on. Signed-off-by: Thierry Escande <thierry.escande@collabora.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-07-09dccp: avoid deadlock in dccp_v4_ctl_send_resetEric Dumazet
In the prep work I did before enabling BH while handling socket backlog, I missed two points in DCCP : 1) dccp_v4_ctl_send_reset() uses bh_lock_sock(), assuming BH were blocked. It is not anymore always true. 2) dccp_v4_route_skb() was using __IP_INC_STATS() instead of IP_INC_STATS() A similar fix was done for TCP, in commit 47dcc20a39d0 ("ipv4: tcp: ip_send_unicast_reply() is not BH safe") Fixes: 7309f8821fd6 ("dccp: do not assume DCCP code is non preemptible") Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09ipv6: do not abuse GFP_ATOMIC in inet6_netconf_notify_devconf()Eric Dumazet
All inet6_netconf_notify_devconf() callers are in process context, so we can use GFP_KERNEL allocations if we take care of not holding a rwlock while not needed in ip6mr (we hold RTNL there) Fixes: d67b8c616b48 ("netconf: advertise mc_forwarding status") Fixes: f3a1bfb11ccb ("rtnl/ipv6: use netconf msg to advertise forwarding status") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09ipv4: do not abuse GFP_ATOMIC in inet_netconf_notify_devconf()Eric Dumazet
inet_forward_change() runs with RTNL held. We are allowed to sleep if required. If we use __in_dev_get_rtnl() instead of __in_dev_get_rcu(), we no longer have to use GFP_ATOMIC allocations in inet_netconf_notify_devconf(), meaning we are less likely to miss notifications under memory pressure, and wont touch precious memory reserves either and risk dropping incoming packets. inet_netconf_get_devconf() can also use GFP_KERNEL allocation. Fixes: edc9e748934c ("rtnl/ipv4: use netconf msg to advertise forwarding status") Fixes: 9e5511106f99 ("rtnl/ipv4: add support of RTM_GETNETCONF") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09net: tracepoint napi:napi_poll add work and budgetJesper Dangaard Brouer
An important information for the napi_poll tracepoint is knowing the work done (packets processed) by the napi_poll() call. Add both the work done and budget, as they are related. Handle trace_napi_poll() param change in dropwatch/drop_monitor and in python perf script netdev-times.py in backward compat way, as python fortunately supports optional parameter handling. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09mpls: allow routes on ipip and sit devicesSimon Horman
Allow MPLS routes on IPIP and SIT devices now that they support forwarding MPLS packets. Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09ipip: support MPLS over IPv4Simon Horman
Extend the IPIP driver to support MPLS over IPv4. The implementation is an extension of existing support for IPv4 over IPv4 and is based of multiple inner-protocol support for the SIT driver. Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09sit: support MPLS over IPv4Simon Horman
Extend the SIT driver to support MPLS over IPv4. This implementation extends existing support for IPv6 over IPv4 and IPv4 over IPv4. Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09tunnels: support MPLS over IPv4 tunnelsSimon Horman
Extend tunnel support to MPLS over IPv4. The implementation extends the existing differentiation between IPIP and IPv6 over IPv4 to also cover MPLS over IPv4. Signed-off-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Dinan Gunawardena <dinan.gunawardena@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09net: bridge: extend MLD/IGMP query statsNikolay Aleksandrov
As was suggested this patch adds support for the different versions of MLD and IGMP query types. Since the user visible structure is still in net-next we can augment it instead of adding netlink attributes. The distinction between the different IGMP/MLD query types is done as suggested in Section 7.1, RFC 3376 [1] and Section 8.1, RFC 3810 [2] based on query payload size and code for IGMP. Since all IGMP packets go through multicast_rcv() and it uses ip_mc_check_igmp/ipv6_mc_check_mld we can be sure that at least the ip/ipv6 header can be directly used. [1] https://tools.ietf.org/html/rfc3376#section-7 [2] https://tools.ietf.org/html/rfc3810#section-8.1 Suggested-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>