summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2017-06-16networking: convert many more places to skb_put_zero()Johannes Berg
There were many places that my previous spatch didn't find, as pointed out by yuan linyu in various patches. The following spatch found many more and also removes the now unnecessary casts: @@ identifier p, p2; expression len; expression skb; type t, t2; @@ ( -p = skb_put(skb, len); +p = skb_put_zero(skb, len); | -p = (t)skb_put(skb, len); +p = skb_put_zero(skb, len); ) ... when != p ( p2 = (t2)p; -memset(p2, 0, len); | -memset(p, 0, len); ) @@ type t, t2; identifier p, p2; expression skb; @@ t *p; ... ( -p = skb_put(skb, sizeof(t)); +p = skb_put_zero(skb, sizeof(t)); | -p = (t *)skb_put(skb, sizeof(t)); +p = skb_put_zero(skb, sizeof(t)); ) ... when != p ( p2 = (t2)p; -memset(p2, 0, sizeof(*p)); | -memset(p, 0, sizeof(*p)); ) @@ expression skb, len; @@ -memset(skb_put(skb, len), 0, len); +skb_put_zero(skb, len); Apply it to the tree (with one manual fixup to keep the comment in vxlan.c, which spatch removed.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The conflicts were two cases of overlapping changes in batman-adv and the qed driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14net: use skb_unref() in napi_consume_skb()Paolo Abeni
The commit 83ada39bb79d ("net: factor out a helper to decrement the skb refcount") provided and used a helper for decrementing skb usage, but I missed at least a spot for it. This change remove some more duplicated code reusing skb_unref() in napi_consume_skb(), too. The helper uses an additional, unneeded unlikely(!skb) test - napi_consume_skb() already check it a few lines above - but the compiler is smart enough to optimize the duplicated test out. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14bpf: permits narrower load from bpf program context fieldsYonghong Song
Currently, verifier will reject a program if it contains an narrower load from the bpf context structure. For example, __u8 h = __sk_buff->hash, or __u16 p = __sk_buff->protocol __u32 sample_period = bpf_perf_event_data->sample_period which are narrower loads of 4-byte or 8-byte field. This patch solves the issue by: . Introduce a new parameter ctx_field_size to carry the field size of narrower load from prog type specific *__is_valid_access validator back to verifier. . The non-zero ctx_field_size for a memory access indicates (1). underlying prog type specific convert_ctx_accesses supporting non-whole-field access (2). the current insn is a narrower or whole field access. . In verifier, for such loads where load memory size is less than ctx_field_size, verifier transforms it to a full field load followed by proper masking. . Currently, __sk_buff and bpf_perf_event_data->sample_period are supporting narrowing loads. . Narrower stores are still not allowed as typical ctx stores are just normal stores. Because of this change, some tests in verifier will fail and these tests are removed. As a bonus, rename some out of bound __sk_buff->cb access to proper field name and remove two redundant "skb cb oob" tests. Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-14dev_ioctl: copy only the smaller struct iwreq for wextJohannes Berg
Unfortunately, struct iwreq isn't a proper subset of struct ifreq, but is still handled by the same code path. Robert reported that then applications may (randomly) fault if the struct iwreq they pass happens to land within 8 bytes of the end of a mapping (the struct is only 32 bytes, vs. struct ifreq's 40 bytes). To fix this, pull out the code handling wireless extension ioctls and copy only the smaller structure in this case. This bug goes back a long time, I tracked that it was introduced into mainline in 2.1.15, over 20 years ago! This fixes https://bugzilla.kernel.org/show_bug.cgi?id=195869 Reported-by: Robert O'Callahan <robert@ocallahan.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2017-06-13net: rps: fix uninitialized symbol warningAshwanth Goli
This patch fixes uninitialized symbol warning that got introduced by the following commit 773fc8f6e8d6 ("net: rps: send out pending IPI's on CPU hotplug") Signed-off-by: Ashwanth Goli <ashwanth@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-12udp: avoid a cache miss on dequeuePaolo Abeni
Since UDP no more uses sk->destructor, we can clear completely the skb head state before enqueuing. Amend and use skb_release_head_state() for that. All head states share a single cacheline, which is not normally used/accesses on dequeue. We can avoid entirely accessing such cacheline implementing and using in the UDP code a specialized skb free helper which ignores the skb head state. This saves a cacheline miss at skb deallocation time. v1 -> v2: replaced secpath_reset() with skb_release_head_state() Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-12net: factor out a helper to decrement the skb refcountPaolo Abeni
The same code is replicated in 3 different places; move it to a common helper. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10bpf: add bpf_set_hash helper for tc progsDaniel Borkmann
Allow for tc BPF programs to set a skb->hash, apart from clearing and triggering a recalc that we have right now. It allows for BPF to implement a custom hashing routine for skb_get_hash(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10bpf: remove cg_skb_func_proto and use sk_filter_func_proto directlyDaniel Borkmann
Since cg_skb_func_proto() doesn't do anything else than just calling into sk_filter_func_proto(), remove it and set sk_filter_func_proto() directly for .get_func_proto callback. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10netns: fix error code when the nsid is already usedNicolas Dichtel
When the user tries to assign a specific nsid, idr_alloc() is called with the range [nsid, nsid+1]. If this nsid is already used, idr_alloc() returns ENOSPC (No space left on device). In our case, it's better to return EEXIST to make it clear that the nsid is not available. CC: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10netns: define extack error msg for nsis cmdsNicolas Dichtel
It helps the user to identify errors. CC: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-09net: rps: send out pending IPI's on CPU hotplugashwanth@codeaurora.org
IPI's from the victim cpu are not handled in dev_cpu_callback. So these pending IPI's would be sent to the remote cpu only when NET_RX is scheduled on the victim cpu and since this trigger is unpredictable it would result in packet latencies on the remote cpu. This patch add support to send the pending ipi's of victim cpu. Signed-off-by: Ashwanth Goli <ashwanth@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-09Fix an intermittent pr_emerg warning about lo becoming free.Krister Johansen
It looks like this: Message from syslogd@flamingo at Apr 26 00:45:00 ... kernel:unregister_netdevice: waiting for lo to become free. Usage count = 4 They seem to coincide with net namespace teardown. The message is emitted by netdev_wait_allrefs(). Forced a kdump in netdev_run_todo, but found that the refcount on the lo device was already 0 at the time we got to the panic. Used bcc to check the blocking in netdev_run_todo. The only places where we're off cpu there are in the rcu_barrier() and msleep() calls. That behavior is expected. The msleep time coincides with the amount of time we spend waiting for the refcount to reach zero; the rcu_barrier() wait times are not excessive. After looking through the list of callbacks that the netdevice notifiers invoke in this path, it appears that the dst_dev_event is the most interesting. The dst_ifdown path places a hold on the loopback_dev as part of releasing the dev associated with the original dst cache entry. Most of our notifier callbacks are straight-forward, but this one a) looks complex, and b) places a hold on the network interface in question. I constructed a new bcc script that watches various events in the liftime of a dst cache entry. Note that dst_ifdown will take a hold on the loopback device until the invalidated dst entry gets freed. [ __dst_free] on DST: ffff883ccabb7900 IF tap1008300eth0 invoked at 1282115677036183 __dst_free rcu_nocb_kthread kthread ret_from_fork Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08skbuff: only inherit relevant tx_flagsWillem de Bruijn
When inheriting tx_flags from one skbuff to another, always apply a mask to avoid overwriting unrelated other bits in the field. The two SKBTX_SHARED_FRAG cases clears all other bits. In practice, tx_flags are zero at this point now. But this is fragile. Timestamp flags are set, for instance, if in tcp_gso_segment, after this clear in skb_segment. The SKBTX_ANY_TSTAMP mask in __skb_tstamp_tx ensures that new skbs do not accidentally inherit flags such as SKBTX_SHARED_FRAG. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08tcp: add TCPMemoryPressuresChrono counterEric Dumazet
DRAM supply shortage and poor memory pressure tracking in TCP stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning limits) and tcp_mem[] quite hazardous. TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl limits being hit, but only tracking number of transitions. If TCP stack behavior under stress was perfect : 1) It would maintain memory usage close to the limit. 2) Memory pressure state would be entered for short times. We certainly prefer 100 events lasting 10ms compared to one event lasting 200 seconds. This patch adds a new SNMP counter tracking cumulative duration of memory pressure events, given in ms units. $ cat /proc/sys/net/ipv4/tcp_mem 3088 4117 6176 $ grep TCP /proc/net/sockstat TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140 $ nstat -n ; sleep 10 ; nstat |grep Pressure TcpExtTCPMemoryPressures 1700 TcpExtTCPMemoryPressuresChrono 5209 v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David instructed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08net: Zero ifla_vf_info in rtnl_fill_vfinfo()Mintz, Yuval
Some of the structure's fields are not initialized by the rtnetlink. If driver doesn't set those in ndo_get_vf_config(), they'd leak memory to user. Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> CC: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08tcp: Namespaceify sysctl_tcp_timestampsEric Dumazet
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07net: Fix inconsistent teardown and release of private netdev state.David S. Miller
Network devices can allocate reasources and private memory using netdev_ops->ndo_init(). However, the release of these resources can occur in one of two different places. Either netdev_ops->ndo_uninit() or netdev->destructor(). The decision of which operation frees the resources depends upon whether it is necessary for all netdev refs to be released before it is safe to perform the freeing. netdev_ops->ndo_uninit() presumably can occur right after the NETDEV_UNREGISTER notifier completes and the unicast and multicast address lists are flushed. netdev->destructor(), on the other hand, does not run until the netdev references all go away. Further complicating the situation is that netdev->destructor() almost universally does also a free_netdev(). This creates a problem for the logic in register_netdevice(). Because all callers of register_netdevice() manage the freeing of the netdev, and invoke free_netdev(dev) if register_netdevice() fails. If netdev_ops->ndo_init() succeeds, but something else fails inside of register_netdevice(), it does call ndo_ops->ndo_uninit(). But it is not able to invoke netdev->destructor(). This is because netdev->destructor() will do a free_netdev() and then the caller of register_netdevice() will do the same. However, this means that the resources that would normally be released by netdev->destructor() will not be. Over the years drivers have added local hacks to deal with this, by invoking their destructor parts by hand when register_netdevice() fails. Many drivers do not try to deal with this, and instead we have leaks. Let's close this hole by formalizing the distinction between what private things need to be freed up by netdev->destructor() and whether the driver needs unregister_netdevice() to perform the free_netdev(). netdev->priv_destructor() performs all actions to free up the private resources that used to be freed by netdev->destructor(), except for free_netdev(). netdev->needs_free_netdev is a boolean that indicates whether free_netdev() should be done at the end of unregister_netdevice(). Now, register_netdevice() can sanely release all resources after ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit() and netdev->priv_destructor(). And at the end of unregister_netdevice(), we invoke netdev->priv_destructor() and optionally call free_netdev(). Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07net: don't call strlen on non-terminated string in dev_set_alias()Alexander Potapenko
KMSAN reported a use of uninitialized memory in dev_set_alias(), which was caused by calling strlcpy() (which in turn called strlen()) on the user-supplied non-terminated string. Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Just some simple overlapping changes in marvell PHY driver and the DSA core code. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06net: sched: introduce a TRAP control actionJiri Pirko
There is need to instruct the HW offloaded path to push certain matched packets to cpu/kernel for further analysis. So this patch introduces a new TRAP control action to TC. For kernel datapath, this action does not make much sense. So with the same logic as in HW, new TRAP behaves similar to STOLEN. The skb is just dropped in the datapath (and virtually ejected to an upper level, which does not exist in case of kernel). Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Yotam Gigi <yotamg@mellanox.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-05devlink: fix potential memort leakHaishuang Yan
We must free allocated skb when genlmsg_put() return fails. Fixes: 1555d204e743 ("devlink: Support for pipeline debug (dpipe)") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflowJason A. Donenfeld
This is a defense-in-depth measure in response to bugs like 4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). There's not only a potential overflow of sglist items, but also a stack overflow potential, so we fix this by limiting the amount of recursion this function is allowed to do. Not actually providing a bounded base case is a future disaster that we can easily avoid here. As a small matter of house keeping, we take this opportunity to move the documentation comment over the actual function the documentation is for. While this could be implemented by using an explicit stack of skbuffs, when implementing this, the function complexity increased considerably, and I don't think such complexity and bloat is actually worth it. So, instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS, and measured the stack usage there. I also reverted the recent MIPS changes that give it a separate IRQ stack, so that I could experience some worst-case situations. I found that limiting it to 24 layers deep yielded a good stack usage with room for safety, as well as being much deeper than any driver actually ever creates. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: David Howells <dhowells@redhat.com> Cc: Sabrina Dubroca <sd@queasysnail.net> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04neigh: Really delete an arp/neigh entry on "ip neigh delete" or "arp -d"Sowmini Varadhan
The command # arp -s 62.2.0.1 a:b:c:d:e:f dev eth2 adds an entry like the following (listed by "arp -an") ? (62.2.0.1) at 0a:0b:0c:0d:0e:0f [ether] PERM on eth2 but the symmetric deletion command # arp -i eth2 -d 62.2.0.1 does not remove the PERM entry from the table, and instead leaves behind ? (62.2.0.1) at <incomplete> on eth2 The reason is that there is a refcnt of 1 for the arp_tbl itself (neigh_alloc starts off the entry with a refcnt of 1), thus the neigh_release() call from arp_invalidate() will (at best) just decrement the ref to 1, but will never actually free it from the table. To fix this, we need to do something like neigh_forced_gc: if the refcnt is 1 (i.e., on the table's ref), remove the entry from the table and free it. This patch refactors and shares common code between neigh_forced_gc and the newly added neigh_remove_one. A similar issue exists for IPv6 Neighbor Cache entries, and is fixed in a similar manner by this patch. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04sock: reset sk_err when the error queue is emptySoheil Hassas Yeganeh
Prior to f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb), sk_err was reset to the error of the skb on the head of the error queue. Applications, most notably ping, are relying on this behavior to reset sk_err for ICMP packets. Set sk_err to the ICMP error when there is an ICMP packet at the head of the error queue. Fixes: f5f99309fa74 (sock: do not set sk_err in sock_dequeue_err_skb) Reported-by: Cyril Hrubis <chrubis@suse.cz> Tested-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04net-procfs: Use vsnprintf extension %phNJoe Perches
Save a bit of code by using the kernel extension. $ size net/core/net-procfs.o* text data bss dec hex filename 3701 120 0 3821 eed net/core/net-procfs.o.new 3764 120 0 3884 f2c net/core/net-procfs.o.old Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04net/flow_dissector: add support for dissection of misc ip header fieldsOr Gerlitz
Add support for dissection of ip tos and ttl and ipv6 traffic-class and hoplimit. Both are dissected into the same struct. Uses similar call to ip dissection function as with tcp, arp and others. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-31bpf: track stack depth of classic bpf programsAlexei Starovoitov
To track stack depth of classic bpf programs we only need to analyze ST|STX instructions, since check_load_and_stores() verifies that programs can load from stack only after write. We also need to change the way cBPF stack slots map to eBPF stack, since typical classic programs are using slots 0 and 1, so they need to map to stack offsets -4 and -8 respectively in order to take advantage of small stack interpreter and JITs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-31rtnetlink: use the new rtnl_get_event() interfaceVlad Yasevich
Small clean-up to rtmsg_ifinfo() to use the rtnl_get_event() interface instead of using 'internal' values directly. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: add extack arg to lwtunnel build stateDavid Ahern
Pass extack arg down to lwtunnel_build_state and the build_state callbacks. Add messages for failures in lwtunnel_build_state, and add the extarg to nla_parse where possible in the build_state callbacks. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-30net: lwtunnel: Add extack to encap attr validationDavid Ahern
Pass extack down to lwtunnel_valid_encap_type and lwtunnel_valid_encap_type_attr. Add messages for unknown or unsupported encap types. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-27rtnl: Add support for netdev event to link messagesVlad Yasevich
When netdev events happen, a rtnetlink_event() handler will send messages for every event in it's white list. These messages contain current information about a particular device, but they do not include the iformation about which event just happened. So, it is impossible to tell what just happend for these events. This patch adds a new extension to RTM_NEWLINK message called IFLA_EVENT that would have an encoding of event that triggered this message. This would allow the the message consumer to easily determine if it needs to perform certain actions. Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Overlapping changes in drivers/net/phy/marvell.c, bug fix in 'net' restricting a HW workaround alongside cleanups in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-26ipv4: add reference counting to metricsEric Dumazet
Andrey Konovalov reported crashes in ipv4_mtu() I could reproduce the issue with KASAN kernels, between 10.246.7.151 and 10.246.7.152 : 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 & 2) At the same time run following loop : while : do ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 done Cong Wang attempted to add back rt->fi in commit 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting") but this proved to add some issues that were complex to solve. Instead, I suggested to add a refcount to the metrics themselves, being a standalone object (in particular, no reference to other objects) I tried to make this patch as small as possible to ease its backport, instead of being super clean. Note that we believe that only ipv4 dst need to take care of the metric refcount. But if this is wrong, this patch adds the basic infrastructure to extend this to other families. Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang for his efforts on this problem. Fixes: 2860583fe840 ("ipv4: Kill rt->fi") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25bpf: add bpf_clone_redirect to bpf_helper_changes_pkt_dataDaniel Borkmann
The bpf_clone_redirect() still needs to be listed in bpf_helper_changes_pkt_data() since we call into bpf_try_make_head_writable() from there, thus we need to invalidate prior pkt regs as well. Fixes: 36bbef52c7eb ("bpf: direct packet write and access for helpers for clsact progs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25net: move somaxconn init from sysctl codeRoman Kapl
The default value for somaxconn is set in sysctl_core_net_init(), but this function is not called when kernel is configured without CONFIG_SYSCTL. This results in the kernel not being able to accept TCP connections, because the backlog has zero size. Usually, the user ends up with: "TCP: request_sock_TCP: Possible SYN flooding on port 7. Dropping request. Check SNMP counters." If SYN cookies are not enabled the connection is rejected. Before ef547f2ac16 (tcp: remove max_qlen_log), the effects were less severe, because the backlog was always at least eight slots long. Signed-off-by: Roman Kapl <roman.kapl@sysgo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-24net: flow_dissector: add support for dissection of tcp flagsJiri Pirko
Add support for dissection of tcp flags. Uses similar function call to tcp dissection function as arp, mpls and others. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-24net: rtnetlink: bail out from rtnl_fdb_dump() on parse errorAlexander Potapenko
rtnl_fdb_dump() failed to check the result of nlmsg_parse(), which led to contents of |ifm| being uninitialized because nlh->nlmsglen was too small to accommodate |ifm|. The uninitialized data may affect some branches and result in unwanted effects, although kernel data doesn't seem to leak to the userspace directly. The bug has been detected with KMSAN and syzkaller. For the record, here is the KMSAN report: ================================================================== BUG: KMSAN: use of unitialized memory in rtnl_fdb_dump+0x5dc/0x1000 CPU: 0 PID: 1039 Comm: probe Not tainted 4.11.0-rc5+ #2727 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 dump_stack+0x143/0x1b0 lib/dump_stack.c:52 kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:1007 __kmsan_warning_32+0x66/0xb0 mm/kmsan/kmsan_instr.c:491 rtnl_fdb_dump+0x5dc/0x1000 net/core/rtnetlink.c:3230 netlink_dump+0x84f/0x1190 net/netlink/af_netlink.c:2168 __netlink_dump_start+0xc97/0xe50 net/netlink/af_netlink.c:2258 netlink_dump_start ./include/linux/netlink.h:165 rtnetlink_rcv_msg+0xae9/0xb40 net/core/rtnetlink.c:4094 netlink_rcv_skb+0x339/0x5a0 net/netlink/af_netlink.c:2339 rtnetlink_rcv+0x83/0xa0 net/core/rtnetlink.c:4110 netlink_unicast_kernel net/netlink/af_netlink.c:1272 netlink_unicast+0x13b7/0x1480 net/netlink/af_netlink.c:1298 netlink_sendmsg+0x10b8/0x10f0 net/netlink/af_netlink.c:1844 sock_sendmsg_nosec net/socket.c:633 sock_sendmsg net/socket.c:643 ___sys_sendmsg+0xd4b/0x10f0 net/socket.c:1997 __sys_sendmsg net/socket.c:2031 SYSC_sendmsg+0x2c6/0x3f0 net/socket.c:2042 SyS_sendmsg+0x87/0xb0 net/socket.c:2038 do_syscall_64+0x102/0x150 arch/x86/entry/common.c:285 entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:246 RIP: 0033:0x401300 RSP: 002b:00007ffc3b0e6d58 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000004002b0 RCX: 0000000000401300 RDX: 0000000000000000 RSI: 00007ffc3b0e6d80 RDI: 0000000000000003 RBP: 00007ffc3b0e6e00 R08: 000000000000000b R09: 0000000000000004 R10: 000000000000000d R11: 0000000000000246 R12: 0000000000000000 R13: 00000000004065a0 R14: 0000000000406630 R15: 0000000000000000 origin: 000000008fe00056 save_stack_trace+0x59/0x60 arch/x86/kernel/stacktrace.c:59 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:352 kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:247 kmsan_poison_shadow+0x6d/0xc0 mm/kmsan/kmsan.c:260 slab_alloc_node mm/slub.c:2743 __kmalloc_node_track_caller+0x1f4/0x390 mm/slub.c:4349 __kmalloc_reserve net/core/skbuff.c:138 __alloc_skb+0x2cd/0x740 net/core/skbuff.c:231 alloc_skb ./include/linux/skbuff.h:933 netlink_alloc_large_skb net/netlink/af_netlink.c:1144 netlink_sendmsg+0x934/0x10f0 net/netlink/af_netlink.c:1819 sock_sendmsg_nosec net/socket.c:633 sock_sendmsg net/socket.c:643 ___sys_sendmsg+0xd4b/0x10f0 net/socket.c:1997 __sys_sendmsg net/socket.c:2031 SYSC_sendmsg+0x2c6/0x3f0 net/socket.c:2042 SyS_sendmsg+0x87/0xb0 net/socket.c:2038 do_syscall_64+0x102/0x150 arch/x86/entry/common.c:285 return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:246 ================================================================== and the reproducer: ================================================================== #include <sys/socket.h> #include <net/if_arp.h> #include <linux/netlink.h> #include <stdint.h> int main() { int sock = socket(PF_NETLINK, SOCK_DGRAM | SOCK_NONBLOCK, 0); struct msghdr msg; memset(&msg, 0, sizeof(msg)); char nlmsg_buf[32]; memset(nlmsg_buf, 0, sizeof(nlmsg_buf)); struct nlmsghdr *nlmsg = nlmsg_buf; nlmsg->nlmsg_len = 0x11; nlmsg->nlmsg_type = 0x1e; // RTM_NEWROUTE = RTM_BASE + 0x0e // type = 0x0e = 1110b // kind = 2 nlmsg->nlmsg_flags = 0x101; // NLM_F_ROOT | NLM_F_REQUEST nlmsg->nlmsg_seq = 0; nlmsg->nlmsg_pid = 0; nlmsg_buf[16] = (char)7; struct iovec iov; iov.iov_base = nlmsg_buf; iov.iov_len = 17; msg.msg_iov = &iov; msg.msg_iovlen = 1; sendmsg(sock, &msg, 0); return 0; } ================================================================== Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21net: allow simultaneous SW and HW transmit timestampingMiroslav Lichvar
Add SOF_TIMESTAMPING_OPT_TX_SWHW option to allow an outgoing packet to be looped to the socket's error queue with a software timestamp even when a hardware transmit timestamp is expected to be provided by the driver. Applications using this option will receive two separate messages from the error queue, one with a software timestamp and the other with a hardware timestamp. As the hardware timestamp is saved to the shared skb info, which may happen before the first message with software timestamp is received by the application, the hardware timestamp is copied to the SCM_TIMESTAMPING control message only when the skb has no software timestamp or it is an incoming packet. While changing sw_tx_timestamp(), inline it in skb_tx_timestamp() as there are no other users. CC: Richard Cochran <richardcochran@gmail.com> CC: Willem de Bruijn <willemb@google.com> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21net: add function to retrieve original skb device using NAPI IDMiroslav Lichvar
Since commit b68581778cd0 ("net: Make skb->skb_iif always track skb->dev") skbs don't have the original index of the interface which received the packet. This information is now needed for a new control message related to hardware timestamping. Instead of adding a new field to skb, we can find the device by the NAPI ID if it is available, i.e. CONFIG_NET_RX_BUSY_POLL is enabled and the driver is using NAPI. Add dev_get_by_napi_id() and also skb_napi_id() to hide the CONFIG_NET_RX_BUSY_POLL ifdef. CC: Richard Cochran <richardcochran@gmail.com> Suggested-by: Willem de Bruijn <willemb@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21net: ethernet: update drivers to handle HWTSTAMP_FILTER_NTP_ALLMiroslav Lichvar
Include HWTSTAMP_FILTER_NTP_ALL in net_hwtstamp_validate() as a valid filter and update drivers which can timestamp all packets, or which explicitly list unsupported filters instead of using a default case, to handle the filter. CC: Richard Cochran <richardcochran@gmail.com> CC: Willem de Bruijn <willemb@google.com> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21net: define receive timestamp filter for NTPMiroslav Lichvar
Add HWTSTAMP_FILTER_NTP_ALL to the hwtstamp_rx_filters enum for timestamping of NTP packets. There is currently only one driver (phyter) that could support it directly. CC: Richard Cochran <richardcochran@gmail.com> CC: Willem de Bruijn <willemb@google.com> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19net: more accurate checksumming in validate_xmit_skb()Davide Caratti
skb_csum_hwoffload_help() uses netdev features and skb->csum_not_inet to determine if skb needs software computation of Internet Checksum or crc32c (or nothing, if this computation can be done by the hardware). Use it in place of skb_checksum_help() in validate_xmit_skb() to avoid corruption of non-GSO SCTP packets having skb->ip_summed equal to CHECKSUM_PARTIAL. While at it, remove references to skb_csum_off_chk* functions, since they are not present anymore in Linux _ see commit cf53b1da73bd ("Revert "net: Add driver helper functions to determine checksum offloadability""). Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19net: use skb->csum_not_inet to identify packets needing crc32cDavide Caratti
skb->csum_not_inet carries the indication on which algorithm is needed to compute checksum on skb in the transmit path, when skb->ip_summed is equal to CHECKSUM_PARTIAL. If skb carries a SCTP packet and crc32c hasn't been yet written in L4 header, skb->csum_not_inet is assigned to 1; otherwise, assume Internet Checksum is needed and thus set skb->csum_not_inet to 0. Suggested-by: Tom Herbert <tom@herbertland.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19sk_buff: remove support for csum_bad in sk_buffDavide Caratti
This bit was introduced with commit 5a21232983aa ("net: Support for csum_bad in skbuff") to reduce the stack workload when processing RX packets carrying a wrong Internet Checksum. Up to now, only one driver and GRO core are setting it. Suggested-by: Tom Herbert <tom@herbertland.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19net: introduce skb_crc32c_csum_helpDavide Caratti
skb_crc32c_csum_help is like skb_checksum_help, but it is designed for checksumming SCTP packets using crc32c (see RFC3309), provided that libcrc32c.ko has been loaded before. In case libcrc32c is not loaded, invoking skb_crc32c_csum_help on a skb results in one the following printouts: warn_crc32c_csum_update: attempt to compute crc32c without libcrc32c.ko warn_crc32c_csum_combine: attempt to compute crc32c without libcrc32c.ko Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-19skbuff: add stub to help computing crc32c on SCTP packetsDavide Caratti
sctp_compute_checksum requires crc32c symbol (provided by libcrc32c), so it can't be used in net core. Like it has been done previously with other symbols (e.g. ipv6_dst_lookup), introduce a stub struct skb_checksum_ops to allow computation of crc32c checksum in net core after sctp.ko (and thus libcrc32c) has been loaded. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-05-18Merge remote-tracking branch 'mauro-exp/docbook3' into death-to-docbookJonathan Corbet
Mauro says: This patch series convert the remaining DocBooks to ReST. The first version was originally send as 3 patch series: [PATCH 00/36] Convert DocBook documents to ReST [PATCH 0/5] Convert more books to ReST [PATCH 00/13] Get rid of DocBook The lsm book was added as if it were a text file under Documentation. The plan is to merge it with another file under Documentation/security, after both this series and a security Documentation patch series gets merged. It also adjusts some Sphinx-pedantic errors/warnings on some kernel-doc markups. I also added some patches here to add PDF output for all existing ReST books.