summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-02-27selftests: mlxsw: Add mlxsw libShalom Toledo
Add mlxsw lib for common defines, helpers etc. Signed-off-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27selftests: devlink_lib: Add devlink port helpersShalom Toledo
Add two devlink port helpers: * devlink port get by netdev * devlink cpu port get Signed-off-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27selftests: devlink_lib: Check devlink info command is supportedShalom Toledo
Sanity check for devlink info command. Signed-off-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27selftests: mlxsw: Add shared buffer configuration testShalom Toledo
Test physical ports' shared buffer configuration options using random values related to a specific configuration option. There are 3 configuration options: pool, TC bind and portpool. Each sub-test, test a different configuration option and random the related values as the follow: * For pools, pool's size will be randomized. * For TC bind, pool number and threshold will be randomized. * For portpools, threshold will be randomized. Signed-off-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27selftests: mlxsw: Use busywait helper in rtnetlink testDanielle Ratson
Rtnetlink test uses offload indication checks. Use a busywait helper and wait until the offload indication is set or fail if it reaches timeout. Signed-off-by: Danielle Ratson <danieller@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27selftests: mlxsw: Use busywait helper in vxlan testDanielle Ratson
Vxlan test uses offload indication checks. Use a busywait helper and wait until the offload indication is set or fail if it reaches timeout. Signed-off-by: Danielle Ratson <danieller@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27selftests: mlxsw: Use busywait helper in blackhole routes testDanielle Ratson
Blackhole routes test uses offload indication checks. Use busywait helper and wait until the routes offload indication is set or fail if it reaches timeout. Signed-off-by: Danielle Ratson <danieller@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27selftests: devlink_trap_l3_drops: Avoid race conditionIdo Schimmel
The test checks that packets are trapped when they should egress a router interface (RIF) that has become disabled. This is a temporary state in a RIF's deletion sequence. Currently, the test deletes the RIF by flushing all the IP addresses configured on the associated netdev (br0). However, this is racy, as this also flushes all the routes pointing to the netdev and if the routes are deleted from the device before the RIF is disabled, then no packets will try to egress the disabled RIF and the trap will not be triggered. Instead, trigger the deletion of the RIF by unlinking the mlxsw port from the bridge that is backing the RIF. Unlike before, this will not cause the kernel to delete the routes pointing to the bridge. Note that due to current mlxsw locking scheme the RIF is always deleted first, but this is going to change. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27selftests: add a mirror test to mlxsw tc flower restrictionsJiri Pirko
Include test of forbidding to have multiple mirror actions. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27selftests: add egress redirect test to mlxsw tc flower restrictionsJiri Pirko
Include test of forbidding to have redirect rule on egress-bound block. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27selftests: mlxsw: Add a RED selftestPetr Machata
This tests that below the queue minimum length, there is no dropping / marking, and above max, everything is dropped / marked. The test is structured as a core file with topology and test code, and three wrappers: one for RED used as a root Qdisc, and two for testing (W)RED under PRIO and ETS. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27selftests: forwarding: lib.sh: Add start_tcp_trafficPetr Machata
Extract a helper __start_traffic() configurable by protocol type. Allow passing through extra mausezahn arguments. Add a wrapper, start_tcp_traffic(). Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27Merge branch 'hinic-BugFixes'David S. Miller
Luo bin says: ==================== hinic: BugFixes the bug fixed in patch #2 has been present since the first commit. the bugs fixed in patch #1 and patch #3 have been present since the following commits: patch #1: 352f58b0d9f2 ("net-next/hinic: Set Rxq irq to specific cpu for NUMA") patch #3: 421e9526288b ("hinic: add rss support") ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27hinic: fix a bug of rss configurationLuo bin
should use real receive queue number to configure hw rss indirect table rather than maximal queue number Signed-off-by: Luo bin <luobin9@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27hinic: fix a bug of setting hw_ioctxtLuo bin
a reserved field is used to signify prime physical function index in the latest firmware version, so we must assign a value to it correctly Signed-off-by: Luo bin <luobin9@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27hinic: fix a irq affinity bugLuo bin
can not use a local variable as an input parameter of irq_set_affinity_hint Signed-off-by: Luo bin <luobin9@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27Merge tag 'docs-5.6-fixes' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation fixes from Jonathan Corbet: "A pair of docs-build fixes" * tag 'docs-5.6-fixes' of git://git.lwn.net/linux: docs: Fix empty parallelism argument docs: remove MPX from the x86 toc
2020-02-27Merge tag 'audit-pr-20200226' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: "Two fixes for problems found by syzbot: - Moving audit filter structure fields into a union caused some problems in the code which populates that filter structure. We keep the union (that idea is a good one), but we are fixing the code so that it doesn't needlessly set fields in the union and mess up the error handling. - The audit_receive_msg() function wasn't validating user input as well as it should in all cases, we add the necessary checks" * tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: always check the netlink payload length in audit_receive_msg() audit: fix error handling in audit_data_to_entry()
2020-02-26Merge branch 'VLANs-DSA-switches-and-multiple-bridges'David S. Miller
Russell King says: ==================== VLANs, DSA switches and multiple bridges This is a repost of the previously posted RFC back in December, which did not get fully reviewed. I've dropped the RFC tag this time as no one really found anything too problematical in the RFC posting. I've been trying to configure DSA for VLANs and not having much success. The setup is quite simple: - The main network is untagged - The wifi network is a vlan tagged with id $VN running over the main network. I have an Armada 388 Clearfog with a PCIe wifi card which I'm trying to setup to provide wifi access to the vlan $VN network, while the switch is also part of the main network. However, I'm encountering problems: 1) vlan support in DSA has a different behaviour from the Linux software bridge implementation. # bridge vlan port vlan ids lan1 1 PVID Egress Untagged ... shows the default setup - the bridge ports are all configured for vlan 1, untagged egress, and vlan 1 as the port vid. Issuing: # ip li set dev br0 type bridge vlan_filtering 1 with no other vlan configuration commands on a Linux software bridge continues to allow untagged traffic to flow across the bridge. This difference in behaviour is because the MV88E6xxx VTU is completely empty - because net/dsa ignores all vlan settings for a port if br_vlan_enabled(dp->bridge_dev) is false - this reflects the vlan filtering state of the bridge, not whether the bridge is vlan aware. What this means is that attempting to configure the bridge port vlans before enabling vlan filtering works for Linux software bridges, but fails for DSA bridges. 2) Assuming the above is sorted, we move on to the next issue, which is altogether more weird. Let's take a setup where we have a DSA bridge with lan1..6 in a bridge device, br0, with vlan filtering enabled. lan1 is the upstream port, lan2 is a downstream port that also wants to see traffic on vlan id $VN. Both lan1 and lan2 are configured for that: # bridge vlan add vid $VN dev lan1 # bridge vlan add vid $VN dev lan2 # ip li set br0 type bridge vlan_filtering 1 Untagged traffic can now pass between all the six lan ports, and vlan $VN between lan1 and lan2 only. The MV88E6xxx 8021q_mode debugfs file shows all lan ports are in mode "secure" - this is important! /sys/class/net/br0/bridge/vlan_filtering contains 1. tcpdumping from another machine on lan4 shows that no $VN traffic reaches it. Everything seems to be working correctly... In order to further bridge vlan $VN traffic to hostapd's wifi interface, things get a little more complex - we can't add hostapd's wifi interface to br0 directly, because hostapd will bring up the wifi interface and leak the main, untagged traffic onto the wifi. (hostapd does have vlan support, but only as a dynamic per-client thing, and there's no hooks I can see to allow script-based config of the network setup before hostapd up's the wifi interface.) So, what I tried was: # ip li add link br0 name br0.$VN type vlan id $VN # bridge vlan add vid $VN dev br0 self # ip li set dev br0.$VN up So far so good, we get a vlan interface on top of the bridge, and tcpdumping it shows we get traffic. The 8021q_mode file has not changed state. Everything still seems to be correct. # bridge addbr br1 Still nothing has changed. # bridge addif br1 br0.$VN And now the 8021q_mode debugfs file shows that all ports are now in "disabled" mode, but /sys/class/net/br0/bridge/vlan_filtering still contains '1'. In other words, br0 still thinks vlan filtering is enabled, but the hardware has had vlan filtering disabled. Adding some stack traces to an appropriate point indicates that this is because __switchdev_handle_port_attr_set() recurses down through the tree of interfaces, skipping over the vlan interface, applying br1's configuration to br0's ports. This surely can not be right - surely __switchdev_handle_port_attr_set() and similar should stop recursing down through another master bridge device? There are probably other network device classes that switchdev shouldn't recurse down too. I've considered whether switchdev is the right level to do it, and I think it is - as we want the check/set callbacks to be called for the top level device even if it is a master bridge device, but we don't want to recurse through a lower master bridge device. v2: dropped patch 3, since that has an outstanding issue, and my question on it has not been answered. Otherwise, these are the same patches. Maybe we can move forward with just these two? v3: include DSA ports in patch 2 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: dsa: mv88e6xxx: fix duplicate vlan warningRussell King
When setting VLANs on DSA switches, the VLAN is added to both the port concerned as well as the CPU port by dsa_slave_vlan_add(), as well as any DSA ports. If multiple ports are configured with the same VLAN ID, this triggers a warning on the CPU and DSA ports. Avoid this warning for CPU and DSA ports. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: switchdev: do not propagate bridge updates across bridgesRussell King
When configuring a tree of independent bridges, propagating changes from the upper bridge across a bridge master to the lower bridge ports brings surprises. For example, a lower bridge may have vlan filtering enabled. It may have a vlan interface attached to the bridge master, which may then be incorporated into another bridge. As soon as the lower bridge vlan interface is attached to the upper bridge, the lower bridge has vlan filtering disabled. This occurs because switchdev recursively applies its changes to all lower devices no matter what. Reviewed-by: Ido Schimmel <idosch@mellanox.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net/smc: check for valid ib_client_dataKarsten Graul
In smc_ib_remove_dev() check if the provided ib device was actually initialized for SMC before. Reported-by: syzbot+84484ccebdd4e5451d91@syzkaller.appspotmail.com Fixes: a4cf0443c414 ("smc: introduce SMC as an IB-client") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: stmmac: fix notifier registrationAaro Koskinen
We cannot register the same netdev notifier multiple times when probing stmmac devices. Register the notifier only once in module init, and also make debugfs creation/deletion safe against simultaneous notifier call. Fixes: 481a7d154cbb ("stmmac: debugfs entry name is not be changed when udev rename device name.") Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: phy: mscc: fix firmware pathsAntoine Tenart
The firmware paths for the VSC8584 PHYs not not contain the leading 'microchip/' directory, as used in linux-firmware, resulting in an error when probing the driver. This patch fixes it. Fixes: a5afc1678044 ("net: phy: mscc: add support for VSC8584 PHY") Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: qrtr: Fix error pointer vs NULL bugsDan Carpenter
The callers only expect NULL pointers, so returning an error pointer will lead to an Oops. Fixes: 0c2204a4ad71 ("net: qrtr: Migrate nameservice to kernel from userspace") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: phy: mscc: add missing shift for media operation mode selectionAntoine Tenart
This patch adds a missing shift for the media operation mode selection. This does not fix the driver as the current operation mode (copper) has a value of 0, but this wouldn't work for other modes. Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mptcp: add dummy icsk_sync_mss()Paolo Abeni
syzbot noted that the master MPTCP socket lacks the icsk_sync_mss callback, and was able to trigger a null pointer dereference: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 8e171067 P4D 8e171067 PUD 93fa2067 PMD 0 Oops: 0010 [#1] PREEMPT SMP KASAN CPU: 0 PID: 8984 Comm: syz-executor066 Not tainted 5.6.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffc900020b7b80 EFLAGS: 00010246 RAX: 1ffff110124ba600 RBX: 0000000000000000 RCX: ffff88809fefa600 RDX: ffff8880994cdb18 RSI: 0000000000000000 RDI: ffff8880925d3140 RBP: ffffc900020b7bd8 R08: ffffffff870225be R09: fffffbfff140652a R10: fffffbfff140652a R11: 0000000000000000 R12: ffff8880925d35d0 R13: ffff8880925d3140 R14: dffffc0000000000 R15: 1ffff110124ba6ba FS: 0000000001a0b880(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000000a6d6f000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: cipso_v4_sock_setattr+0x34b/0x470 net/ipv4/cipso_ipv4.c:1888 netlbl_sock_setattr+0x2a7/0x310 net/netlabel/netlabel_kapi.c:989 smack_netlabel security/smack/smack_lsm.c:2425 [inline] smack_inode_setsecurity+0x3da/0x4a0 security/smack/smack_lsm.c:2716 security_inode_setsecurity+0xb2/0x140 security/security.c:1364 __vfs_setxattr_noperm+0x16f/0x3e0 fs/xattr.c:197 vfs_setxattr fs/xattr.c:224 [inline] setxattr+0x335/0x430 fs/xattr.c:451 __do_sys_fsetxattr fs/xattr.c:506 [inline] __se_sys_fsetxattr+0x130/0x1b0 fs/xattr.c:495 __x64_sys_fsetxattr+0xbf/0xd0 fs/xattr.c:495 do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x440199 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffcadc19e48 EFLAGS: 00000246 ORIG_RAX: 00000000000000be RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440199 RDX: 0000000020000200 RSI: 00000000200001c0 RDI: 0000000000000003 RBP: 00000000006ca018 R08: 0000000000000003 R09: 00000000004002c8 R10: 0000000000000009 R11: 0000000000000246 R12: 0000000000401a20 R13: 0000000000401ab0 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: CR2: 0000000000000000 Address the issue adding a dummy icsk_sync_mss callback. To properly sync the subflows mss and options list we need some additional infrastructure, which will land to net-next. Reported-by: syzbot+f4dfece964792d80b139@syzkaller.appspotmail.com Fixes: 2303f994b3e1 ("mptcp: Associate MPTCP context with TCP socket") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: ena: fix broken interface between ENA driver and FWArthur Kiyanovski
In this commit we revert the part of commit 1a63443afd70 ("net/amazon: Ensure that driver version is aligned to the linux kernel"), which breaks the interface between the ENA driver and FW. We also replace the use of DRIVER_VERSION with DRIVER_GENERATION when we bring back the deleted constants that are used in interface with ENA device FW. This commit does not change the driver version reported to the user via ethtool, which remains the kernel version. Fixes: 1a63443afd70 ("net/amazon: Ensure that driver version is aligned to the linux kernel") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26Merge branch 'mptcp-update-mptcp-ack-sequence-outside-of-recv-path'David S. Miller
Florian Westphal says: ==================== mptcp: update mptcp ack sequence outside of recv path This series moves mptcp-level ack sequence update outside of the recvmsg path. Current approach has two problems: 1. There is delay between arrival of new data and the time we can ack this data. 2. If userspace doesn't call recv for some time, mptcp ack_seq is not updated at all, even if this data is queued in the subflow socket receive queue. Move skbs from the subflow socket receive queue to the mptcp-level receive queue, updating the mptcp-level ack sequence and have recv take skbs from the mptcp-level receive queue. The first place where we will attempt to update the mptcp level acks is from the subflows' data_ready callback, even before we make userspace aware of new data. Because of possible deadlock (we need to take the mptcp socket lock while already holding the subflow sockets lock), we may still need to defer the mptcp-level ack update. In such case, this work will be either done from work queue or recv path, depending on which runs sooner. In order to avoid pointless scheduling of the work queue, work will be queued from the mptcp sockets lock release callback. This allows to detect when the socket owner did drain the subflow socket receive queue. Please see individual patches for more information. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mptcp: defer work schedule until mptcp lock is releasedPaolo Abeni
Don't schedule the work queue right away, instead defer this to the lock release callback. This has the advantage that it will give recv path a chance to complete -- this might have moved all pending packets from the subflow to the mptcp receive queue, which allows to avoid the schedule_work(). Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mptcp: avoid work queue scheduling if possibleFlorian Westphal
We can't lock_sock() the mptcp socket from the subflow data_ready callback, it would result in ABBA deadlock with the subflow socket lock. We can however grab the spinlock: if that succeeds and the mptcp socket is not owned at the moment, we can process the new skbs right away without deferring this to the work queue. This avoids the schedule_work and hence the small delay until the work item is processed. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mptcp: remove mptcp_read_actorFlorian Westphal
Only used to discard stale data from the subflow, so move it where needed. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mptcp: add rmem queue accountingFlorian Westphal
If userspace never drains the receive buffers we must stop draining the subflow socket(s) at some point. This adds the needed rmem accouting for this. If the threshold is reached, we stop draining the subflows. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mptcp: update mptcp ack sequence from work queueFlorian Westphal
If userspace is not reading data, all the mptcp-level acks contain the ack_seq from the last time userspace read data rather than the most recent in-sequence value. This causes pointless retransmissions for data that is already queued. The reason for this is that all the mptcp protocol level processing happens at mptcp_recv time. This adds work queue to move skbs from the subflow sockets receive queue on the mptcp socket receive queue (which was not used so far). This allows us to announce the correct mptcp ack sequence in a timely fashion, even when the application does not call recv() on the mptcp socket for some time. We still wake userspace tasks waiting for POLLIN immediately: If the mptcp level receive queue is empty (because the work queue is still pending) it can be filled from in-sequence subflow sockets at recv time without a need to wait for the worker. The skb_orphan when moving skbs from subflow to mptcp level is needed, because the destructor (sock_rfree) relies on skb->sk (ssk!) lock being taken. A followup patch will add needed rmem accouting for the moved skbs. Other problem: In case application behaves as expected, and calls recv() as soon as mptcp socket becomes readable, the work queue will only waste cpu cycles. This will also be addressed in followup patches. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mptcp: add work queue skeletonPaolo Abeni
Will be extended with functionality in followup patches. Initial user is moving skbs from subflows receive queue to the mptcp-level receive queue. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mptcp: add and use mptcp_data_ready helperFlorian Westphal
allows us to schedule the work queue to drain the ssk receive queue in a followup patch. This is needed to avoid sending all-to-pessimistic mptcp-level acknowledgements. At this time, the ack_seq is what was last read by userspace instead of the highest in-sequence number queued for reading. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26Merge branch 'mlxsw-Small-driver-update'David S. Miller
Jiri Pirko says: ==================== mlxsw: Small driver update This patchset contains couple of patches not related to each other. They are small optimization and extension changes to the driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mlxsw: spectrum: Add mlxsw_sp_span_ops.buffsize_get for Spectrum-3Petr Machata
The buffer factor on Spectrum-3 is larger than on Spectrum-2. Add a new callback and use it for mlxsw_sp->span_ops on Spectrum-3. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mlxsw: spectrum: Initialize advertised speeds to supported speedsIdo Schimmel
During port initialization the driver instructs the device to only advertise speeds that can be supported by the port's current width. Since the device now returns the supported speeds based on the port's current width, the driver no longer needs to compute the speeds that can be advertised. Simplify port initialization by setting the advertised speeds to the queried supported speeds. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mlxsw: spectrum: Move the ECN-marked packet counter to ethtoolPetr Machata
Spectrum-1 and Spectrum-2 do not have a per-TC counter of number of packets marked by ECN. The value reported currently is the total number of marked packets. Showing this value at individual TC Qdiscs is misleading. Move the counter to ethtool instead. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26mlxsw: spectrum_switchdev: Optimize SFN records processingJiri Pirko
Currently, only one SFN query is done from repetitive work at a time, processing 64 entries. Another work iteration is scheduled in 100ms, that means that the max rate of learned FDB entries is limited to 6400/s. That is slow. Fix this by doing 2 optimizations: 1) Run 10 SFN queries at a time. 2) In case the SFN is not drained, schedule work with 0 delay to allow to continue processing rest of the records. On a testing setup with 500K entries the time to process decreased from 870secs to 10secs. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Tested-by: Alex Kushnarov <alexanderk@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: phy: corrected the return value for genphy_check_and_restart_aneg and ↵Sudheesh Mavila
genphy_c45_check_and_restart_aneg When auto-negotiation is not required, return value should be zero. Changes v1->v2: - improved comments and code as Andrew Lunn and Heiner Kallweit suggestion - fixed issue in genphy_c45_check_and_restart_aneg as Russell King suggestion. Fixes: 2a10ab043ac5 ("net: phy: add genphy_check_and_restart_aneg()") Fixes: 1af9f16840e9 ("net: phy: add genphy_c45_check_and_restart_aneg()") Signed-off-by: Sudheesh Mavila <sudheesh.mavila@amd.com> Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26af_llc: fix if-statement empty body warningRandy Dunlap
When debugging via dprintk() is not enabled, make the dprintk() macro be an empty do-while loop, as is done in <linux/sunrpc/debug.h>. This fixes a gcc warning when -Wextra is set: ../net/llc/af_llc.c:974:51: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] I have verified that there is not object code change (with gcc 7.5.0). Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: netdev@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26slip: not call free_netdev before rtnl_unlock in slip_openyangerkun
As the description before netdev_run_todo, we cannot call free_netdev before rtnl_unlock, fix it by reorder the code. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26Merge tag 'mlx5-updates-2020-02-25' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2020-02-25 The following series provides some misc updates to mlx5 driver: 1) From Maxim, Refactoring for mlx5e netdev channels recreation flow. - Add error handling - Add context to the preactivate hook - Use preactivate hook with context where it can be used and subsequently unify channel recreation flow everywhere. - Fix XPS cpumask to not reset upon channel recreation. 2) From Tariq: - Use indirect calls wrapper on RX. - Check LRO capability bit 3) Multiple small cleanups ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26Merge branch 'net-smc-improve-peer-ID-in-CLC-decline'David S. Miller
Hans Wippel says: ==================== net/smc: improve peer ID in CLC decline The following two patches improve the peer ID in CLC decline messages if RoCE devices are present in the host but no suitable device is found for a connection. The first patch reworks the peer ID initialization. The second patch contains the actual changes of the CLC decline messages. Changes v1 -> v2: * make smc_ib_is_valid_local_systemid() static in first patch * changed if in smc_clc_send_decline() to remove curly braces Changes RFC -> v1: * split the patch into two parts * removed zero assignment to global variable (thanks Leon) Thanks to Leon Romanovsky and Karsten Graul for the feedback! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net/smc: improve peer ID in CLC decline for SMC-RHans Wippel
According to RFC 7609, all CLC messages contain a peer ID that consists of a unique instance ID and the MAC address of one of the host's RoCE devices. But if a SMC-R connection cannot be established, e.g., because no matching pnet table entry is found, the current implementation uses a zero value in the CLC decline message although the host's peer ID is set to a proper value. If no RoCE and no ISM device is usable for a connection, there is no LGR and the LGR check in smc_clc_send_decline() prevents that the peer ID is copied into the CLC decline message for both SMC-D and SMC-R. So, this patch modifies the check to also accept the case of no LGR. Also, only a valid peer ID is copied into the decline message. Signed-off-by: Hans Wippel <ndev@hwipl.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net/smc: rework peer ID handlingHans Wippel
This patch initializes the peer ID to a random instance ID and a zero MAC address. If a RoCE device is in the host, the MAC address part of the peer ID is overwritten with the respective address. Also, a function for checking if the peer ID is valid is added. A peer ID is considered valid if the MAC address part contains a non-zero MAC address. Signed-off-by: Hans Wippel <ndev@hwipl.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26tcp-zerocopy: Update returned getsockopt() optlen.Arjun Roy
TCP receive zerocopy currently does not update the returned optlen for getsockopt() if the user passed in a larger than expected value. Thus, userspace cannot properly determine if all the fields are set in the passed-in struct. This patch sets the optlen for this case before returning, in keeping with the expected operation of getsockopt(). Fixes: c8856c051454 ("tcp-zerocopy: Return inq along with tcp receive zerocopy.") Signed-off-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26ipv6: restrict IPV6_ADDRFORM operationEric Dumazet
IPV6_ADDRFORM is able to transform IPv6 socket to IPv4 one. While this operation sounds illogical, we have to support it. One of the things it does for TCP socket is to switch sk->sk_prot to tcp_prot. We now have other layers playing with sk->sk_prot, so we should make sure to not interfere with them. This patch makes sure sk_prot is the default pointer for TCP IPv6 socket. syzbot reported : BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD a0113067 P4D a0113067 PUD a8771067 PMD 0 Oops: 0010 [#1] PREEMPT SMP KASAN CPU: 0 PID: 10686 Comm: syz-executor.0 Not tainted 5.6.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246 RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40 RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5 R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098 R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000 FS: 00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: inet_release+0x165/0x1c0 net/ipv4/af_inet.c:427 __sock_release net/socket.c:605 [inline] sock_close+0xe1/0x260 net/socket.c:1283 __fput+0x2e4/0x740 fs/file_table.c:280 ____fput+0x15/0x20 fs/file_table.c:313 task_work_run+0x176/0x1b0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_usermode_loop arch/x86/entry/common.c:164 [inline] prepare_exit_to_usermode+0x480/0x5b0 arch/x86/entry/common.c:195 syscall_return_slowpath+0x113/0x4a0 arch/x86/entry/common.c:278 do_syscall_64+0x11f/0x1c0 arch/x86/entry/common.c:304 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x45c429 Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f2ae75dac78 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: 0000000000000000 RBX: 00007f2ae75db6d4 RCX: 000000000045c429 RDX: 0000000000000001 RSI: 000000000000011a RDI: 0000000000000004 RBP: 000000000076bf20 R08: 0000000000000038 R09: 0000000000000000 R10: 0000000020000180 R11: 0000000000000246 R12: 00000000ffffffff R13: 0000000000000a9d R14: 00000000004ccfb4 R15: 000000000076bf2c Modules linked in: CR2: 0000000000000000 ---[ end trace 82567b5207e87bae ]--- RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246 RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40 RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5 R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098 R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000 FS: 00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+1938db17e275e85dc328@syzkaller.appspotmail.com Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>