summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-01-10mlxsw: spectrum: Fix typo in firmware upgrade messageIdo Schimmel
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10ipv6: sr: fix TLVs not being copied using setsockoptMathieu Xhonneux
Function ipv6_push_rthdr4 allows to add an IPv6 Segment Routing Header to a socket through setsockopt, but the current implementation doesn't copy possible TLVs at the end of the SRH received from userspace. Therefore, the execution of the following branch if (sr_has_hmac(sr_phdr)) { ... } will never complete since the len and type fields of a possible HMAC TLV are not copied, hence seg6_get_tlv_hmac will return an error, and the HMAC will not be computed. This commit adds a memcpy in case TLVs have been appended to the SRH. Fixes: a149e7c7ce81 ("ipv6: sr: add support for SRH injection through setsockopt") Acked-by: David Lebrun <dlebrun@google.com> Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10ipv6: fix possible mem leaks in ipv6_make_skb()Eric Dumazet
ip6_setup_cork() might return an error, while memory allocations have been done and must be rolled back. Fixes: 6422398c2ab0 ("ipv6: introduce ipv6_make_skb") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Reported-by: Mike Maloney <maloney@google.com> Acked-by: Mike Maloney <maloney@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10Merge branch 'mlxsw-couple-of-fixes'David S. Miller
Jiri Pirko says: ==================== mlxsw: couple of fixes Couple of small fixes for mlxsw driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10mlxsw: spectrum_qdisc: Don't use variable array in ↵Jiri Pirko
mlxsw_sp_tclass_congestion_enable Resolve the sparse warning: "sparse: Variable length array is used." Use 2 arrays for 2 PRM register accesses. Fixes: 96f17e0776c2 ("mlxsw: spectrum: Support RED qdisc offload") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Yuval Mintz <yuvalm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10mlxsw: pci: Wait after reset before accessing HWYuval Mintz
After performing reset driver polls on HW indication until learning that the reset is done, but immediately after reset the device becomes unresponsive which might lead to completion timeout on the first read. Wait for 100ms before starting the polling. Fixes: 233fa44bd67a ("mlxsw: pci: Implement reset done check") Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10tcp: make local function tcp_recv_timestamp staticWei Yongjun
Fixes the following sparse warning: net/ipv4/tcp.c:1736:6: warning: symbol 'tcp_recv_timestamp' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10net/mlx5e: fix error return code in mlx5e_alloc_rq()Wei Yongjun
Fix to return a negative error code from the xdp_rxq_info_reg() error handling case instead of 0, as done elsewhere in this function. Fixes: 0ddf543226ac ("xdp/mlx5: setup xdp_rxq_info") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10cxgb4vf: Fix SGE FL buffer initialization logic for 64K pagesArjun Vynipadath
We'd come in with SGE_FL_BUFFER_SIZE[0] and [1] both equal to 64KB and the extant logic would flag that as an error. This was already fixed in cxgb4 driver with "92ddcc7 cxgb4: Fix some small bugs in t4_sge_init_soft() when our Page Size is 64KB". Original Work by: Casey Leedom <leedom@chelsio.com> Signed-off-by: Arjun Vynipadath <arjun@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10tuntap: fix for "tuntap: XDP transmission"Stephen Rothwell
Fixes: fc72d1d54dd9 ("tuntap: XDP transmission") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10nfp: always unmask aux interrupts at initJakub Kicinski
The link state and exception interrupts may be masked when we probe. The firmware should in theory prevent sending (and automasking) those interrupts if the device is disabled, but if my reading of the FW code is correct there are firmwares out there with race conditions in this area. The interrupt may also be masked if previous driver which used the device was malfunctioning and we didn't load the FW (there is no other good way to comprehensively reset the PF). Note that FW unmasks the data interrupts by itself when vNIC is enabled, such helpful operation is not performed for LSC/EXN interrupts. Always unmask the auxiliary interrupts after request_irq(). On the remove path add missing PCI write flush before free_irq(). Fixes: 4c3523623dc0 ("net: add driver for Netronome NFP4000/NFP6000 NIC VFs") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10i40e: track id can be 0Jingjing Wu
track_id == 0 is valid for “read only” profiles when profile does not have any “write” commands. Signed-off-by: Jingjing Wu <jingjing.wu@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40e: change ppp name to ddpJingjing Wu
PPP name was going to be confusing since PPP already means point to point protocol. It is decided to change pipeline personalization profile(ppp) to dynamic device personalization(ddp). Signed-off-by: Jingjing Wu <jingjing.wu@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40evf: Drop i40evf_fire_sw_int as it is prone to racesAlexander Duyck
Having the interrupts firing while we are polling causes extra overhead and isn't needed for most systems out there. If an interrupt is lost us experiencing a 2s latency spike before recovering is still not acceptable and masks the issue. We are better off just identifying systems that lose interrupts and instead enable workarounds for those systems. To that end I am dropping the code that was strobing the interrupts as there is a narrow window where having them enabled can actually cause race issues anyway where a few stray packets might get misses if the interrupt is re-enabled and fires before we call napi_complete. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40evf: Clean-up flags for promisc mode to avoid high polling rateAlexander Duyck
If you enabled and disabled promiscuous mode on a VF you could easily put it into a state where it would start firing interrupts on all queues at a rate of 50+ interrupts per second even though there was no traffic present. The issue seems to have been a stray admin queue feature flag set that was leaving us in a high polling rate for the adminq task. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40evf: Do not clear MSI-X PBA manuallyAlexander Duyck
We should not be clearing the pending bit array for each vector manually. The documentation for the hardware states that when in MSI-X mode the pending bit array will be cleared automatically. Us clearing it ourselves just results in multiple opportunities for us to drop an interrupt. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40e: remove redundant initialization of read_sizeColin Ian King
Variable read_size is initialized and this value is never read, it is instead set inside the do-loop, hence the initialization is redundant and can be removed. Cleans up clang warning: drivers/net/ethernet/intel/i40e/i40e_nvm.c:390:6: warning: Value stored to 'read_size' during its initialization is never read Signed-off-by: Colin Ian King <colin.king@canonical.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40e/i40evf: Bump driver versionsAlice Michael
Bump the i40e driver from 2.1.14 to 2.3.2. Bump the i40evf driver from 3.0.1 to 3.2.2 Signed-off-by: Alice Michael <alice.michael@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40e: add helper conversion function for link_speedJacob Keller
We introduced the virtchnl interface in order to have an interface for talking to a virtual device driver which was host-driver agnostic. This interface has its own definitions, including one for link speed. The host driver has to talk to the virtchnl interface using these new definitions in order to remain compatible. Today, the i40e link_speed enumerations are value-exact matches for the virtchnl interface, so it was originally decided to simply use a typecast. However, this is unsafe, and makes it easier for future drivers to continue this unsafe practice. There is nothing guaranteeing these values are exact, and the type-cast would hide any compiler warning which indicates the problem. Rather than rely on this type cast, introduce a helper function which can convert the AdminQ link speed definition into a virtchnl definition. This can then be used by host driver implementations in order to safely convert to the interface recognized by the virtual functions. If the link speed is not able to be represented by the virtchnl definitions we'll report UNKNOWN which is the safest result. This will ensure that should the driver specific link_speeds actual bit definitions change, we do not report them incorrectly according to the VF. Additionally, this provides a better pattern for future drivers to copy, as it is more likely a future device may not use the exact same bit-wise definition as the current virtchnl interface. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40e: update VFs of link state after GET_VF_RESOURCESJacob Keller
We currently notify a VF of the link state after ENABLE_QUEUES, which is the last thing a VF does after being configured. Guests may not actually ENABLE_QUEUES until they get configured, and thus between driver load and device configuration the VF may show inaccurate link status. Fix this by also sending the link state after GET_VF_RESOURCES. Although we could remove the message following ENABLE_QUEUES, it's not that significant of a loss, so this patch just keeps both to ensure maximum compatibility with guests on various OSes. Specifically, without this patch guests running FreeBSD will display inaccurate link state until the device is brought up. This is mostly a cosmetic issue but can be confusing to system administrators. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40evf: hold the critical task bit lock while openingJacob Keller
If i40evf_open() is called quickly at the same time as a reset occurs (such as via ethtool) it is possible for the device to attempt to open while a reset is in progress. This occurs because the driver was not holding the critical task bit lock during i40evf_open, nor was it holding it around the call to i40evf_up_complete() in i40evf_reset_task(). We didn't hold the lock previously because calls to i40evf_down() would take the bit lock directly, and this would have caused a deadlock. To avoid this, we'll move the bit lock handling out of i40evf_down() and into the callers of this function. Additionally, we'll now hold the bit lock over the entire set of steps when going up or down, to ensure that we remain consistent. Ultimately this causes us to serialize the transitions between down and up properly, and avoid changing status while we're resetting. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40evf: release bit locks in reverse orderJacob Keller
Although not strictly necessary, it is customary to reverse the order in which we release locks that we acquire. This helps preserve lock ordering during future refactors, which can help avoid potential deadlock situations. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40evf: use spinlock to protect (mac|vlan)_filter_listJacob Keller
Stop overloading the __I40EVF_IN_CRITICAL_TASK bit lock to protect the mac_filter_list and vlan_filter_list. Instead, implement a spinlock to protect these two lists, similar to how we protect the hash in the i40e PF code. Ensure that every place where we access the list uses the spinlock to ensure consistency, and stop holding the critical section around blocks of code which only need access to the macvlan filter lists. This refactor helps simplify the locking behavior, and is necessary as a future refactor to the __I40EVF_IN_CRITICAL_TASK would cause a deadlock otherwise. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40evf: don't rely on netif_running() outside rtnl_lock()Jacob Keller
In i40evf_reset_task we use netif_running() to determine whether or not the device is currently up. This allows us to properly free queue memory and shut down things before we request the hardware reset. It turns out that we cannot be guaranteed of netif_running() returning false until the device is fully up, as the kernel core code sets __LINK_STATE_START prior to calling .ndo_open. Since we're not holding the rtnl_lock(), it's possible that the driver's i40evf_open handler function is currently being called while we're resetting. We can't simply hold the rtnl_lock() while checking netif_running() as this could cause a deadlock with the i40evf_open() function. Additionally, we can't avoid the deadlock by holding the rtnl_lock() over the whole reset path, as this essentially serializes all resets, and can cause massive delays if we have multiple VFs on a system. Instead, lets just check our own internal state __I40EVF_RUNNING state field. This allows us to ensure that the state is correct and is only set after we've finished bringing the device up. Without this change we might free data structures about device queues and other memory before they've been fully allocated. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10i40e: display priority_xon and priority_xoff statsAlice Michael
Display some more stats that were already being counted, to help users understand when priority xon/xoff packets are being sent/received Signed-off-by: Alice Michael <alice.michael@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-10net: fix xdp_rxq_info build issue when CONFIG_SYSFS is not setJesper Dangaard Brouer
The commit e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info") removed some ifdef CONFIG_SYSFS in net/core/dev.c, but forgot to remove the corresponding ifdef's in include/linux/netdevice.h. Fixes: e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10net: phy: marvell: mv88e6390 temperature sensor readingAndrew Lunn
The internal PHYs in the mv88e6390 switch have a temperature sensor. It uses a different register layout to other PHY currently supported. It also has an errata, in that some reads of the sensor result in bad values. So a number of reads need to be made, and the average taken. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-108021q: fix a memory leak for VLAN 0 deviceCong Wang
A vlan device with vid 0 is allow to creat by not able to be fully cleaned up by unregister_vlan_dev() which checks for vlan_id!=0. Also, VLAN 0 is probably not a valid number and it is kinda "reserved" for HW accelerating devices, but it is probably too late to reject it from creation even if makes sense. Instead, just remove the check in unregister_vlan_dev(). Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: ad1afb003939 ("vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)") Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10Merge branch 'net-create-dynamic-software-irq-moderation-library'David S. Miller
Andy Gospodarek says: ==================== net: create dynamic software irq moderation library This converts the dynamic interrupt moderation library from the mlx5e driver into a library so it can be used by any driver. The penultimate patch in this set adds support for this new dynamic interrupt moderation library in the bnxt_en driver and the last patch creates an entry in the MAINTAINERS file for this library. The main purpose of this code is to allow an administrator to make sure that default coalesce settings are optimized for low latency, but quickly adapt to handle high throughput/bulk traffic by altering how much time passes before popping an interrupt. For any new driver the following changes would be needed to use this library: - add elements in ring struct to track items needed by this library - create function that can be called to actually set coalesce settings for the driver Credit to Rob Rice and Lee Reed for doing some of the initial proof of concept and testing for this patch and Tal Gilboa and Or Gerlitz for their comments, etc on this set. v4: Fix build breakage for VF representers noticed by kbuild test robot. Thanks for being so courteous, kbuild test robot! v3: bnxt_en fix from Michael Chan, comment suggestion from Vasundhara Volam, and small mlx5e header file fix from Tal Gilboa. v2: Spelling fixes from Stephen Hemminger, bnxt_en suggestions from Michael Chan, spelling and formatting fixes from Or Gerlitz, and spelling and mlx5e changes suggested by Tal Gilboa. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10MAINTAINERS: add entry for Dynamic Interrupt ModerationAndy Gospodarek
Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10bnxt_en: add support for software dynamic interrupt moderationAndy Gospodarek
This implements the changes needed for the bnxt_en driver to add support for dynamic interrupt moderation per ring. This does add additional counters in the receive path, but testing shows that any additional instructions are offset by throughput gain when the default configuration is for low latency. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10net/dim: use struct net_dim_sample as arg to net_dimAndy Gospodarek
Simplify the arguments net_dim() by formatting them into a struct net_dim_sample before calling the function. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Suggested-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10net/mlx5e: Move dynamic interrupt coalescing code to include/linuxAndy Gospodarek
This move allows drivers to add private structure elements to track the number of packets, bytes, and interrupts events per ring. A driver also defines a workqueue handler to act on this collected data once per poll and modify the coalescing parameters per ring. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10net/mlx5e: Change Mellanox references in DIM codeAndy Gospodarek
Change all appropriate mlx5_am* and MLX5_AM* references to net_dim and NET_DIM, respectively, in code that handles dynamic interrupt moderation. Also change all references from 'am' to 'dim' when used as local variables and add generic profile references. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10net/mlx5e: Move generic functions to new fileAndy Gospodarek
These functions were identified as ones that could be made generic and used by multiple drivers. Most of the contents of en_rx_am.c are moved to net_dim.c. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10net/mlx5e: Move AM logic enumsAndy Gospodarek
More movement to help make this code more generic. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10net/mlx5e: Remove rq references in mlx5e_rx_amAndy Gospodarek
This makes mlx5e_am_sample more generic so that it can be called easily from a driver that does not use the same data structure to store these values in a single structure. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10net/mlx5e: Move interrupt moderation forward declarationsAndy Gospodarek
Move these to newly created file to prepare to move these functions to a library. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10net/mlx5e: Move interrupt moderation structs to new fileAndy Gospodarek
Create new header file to prepare to move code that handles irq moderation to a library that lives in a header file. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10Merge branch 'ipv6-Add-support-for-non-equal-cost-multipath'David S. Miller
Ido Schimmel says: ==================== ipv6: Add support for non-equal-cost multipath This set aims to add support for IPv6 non-equal-cost multipath routes. The first three patches convert multipath selection to use the hash-threshold method (RFC 2992) instead of modulo-N. The same method is employed by the IPv4 routing code since commit 0e884c78ee19 ("ipv4: L3 hash-based multipath"). Unlike modulo-N, with hash-threshold only the flows near the region boundaries are affected when a nexthop is added or removed. In addition, it allows us to easily add support for non-equal-cost multipath in the last patch by sizing the different regions according to the provided weights. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10ipv6: Add support for non-equal-cost multipathIdo Schimmel
The use of hash-threshold instead of modulo-N makes it trivial to add support for non-equal-cost multipath. Instead of dividing the multipath hash function's output space equally between the nexthops, each nexthop is assigned a region size which is proportional to its weight. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10ipv6: Use hash-threshold instead of modulo-NIdo Schimmel
Now that each nexthop stores its region boundary in the multipath hash function's output space, we can use hash-threshold instead of modulo-N in multipath selection. This reduces the number of checks we need to perform during lookup, as dead and linkdown nexthops are assigned a negative region boundary. In addition, in contrast to modulo-N, only flows near region boundaries are affected when a nexthop is added or removed. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10ipv6: Use a 31-bit multipath hashIdo Schimmel
The hash thresholds assigned to IPv6 nexthops are in the range of [-1, 2^31 - 1], where a negative value is assigned to nexthops that should not be considered during multipath selection. Therefore, in a similar fashion to IPv4, we need to use the upper 31-bits of the multipath hash for multipath selection. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10ipv6: Calculate hash thresholds for IPv6 nexthopsIdo Schimmel
Before we convert IPv6 to use hash-threshold instead of modulo-N, we first need each nexthop to store its region boundary in the hash function's output space. The boundary is calculated by dividing the output space equally between the different active nexthops. That is, nexthops that are not dead or linkdown. The boundaries are rebalanced whenever a nexthop is added or removed to a multipath route and whenever a nexthop becomes active or inactive. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10Merge tag 'wireless-drivers-for-davem-2018-01-09' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers Kalle Valo says: ==================== wireless-drivers fixes for 4.15 Hopefully the last set of fixes for 4.15. iwlwifi * fix DMA mapping regression since v4.14 wcn36xx * fix dynamic power save which has been broken since the driver was commited ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10of_mdio: avoid MDIO bus removal when a PHY is missingMadalin Bucur
If one of the child devices is missing the of_mdiobus_register_phy() call will return -ENODEV. When a missing device is encountered the registration of the remaining PHYs is stopped and the MDIO bus will fail to register. Propagate all errors except ENODEV to avoid it. Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10caif_usb: use strlcpy() instead of strncpy()Xiongfeng Wang
gcc-8 reports net/caif/caif_usb.c: In function 'cfusbl_device_notify': ./include/linux/string.h:245:9: warning: '__builtin_strncpy' output may be truncated copying 15 bytes from a string of length 15 [-Wstringop-truncation] The compiler require that the input param 'len' of strncpy() should be greater than the length of the src string, so that '\0' is copied as well. We can just use strlcpy() to avoid this warning. Signed-off-by: Xiongfeng Wang <xiongfeng.wang@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10vhost_net: batch used ring update in rxJason Wang
This patch tries to batched used ring update during RX. This is pretty fit for the case when guest is much faster (e.g dpdk based backend). In this case, used ring is almost empty: - we may get serious cache line misses/contending on both used ring and used idx. - at most 1 packet could be dequeued at one time, batching in guest does not make much effect. Update used ring in a batch can help since guest won't access the used ring until used idx was advanced for several descriptors and since we advance used ring for every N packets, guest will only need to access used idx for every N packet since it can cache the used idx. To have a better interaction for both batch dequeuing and dpdk batching, VHOST_RX_BATCH was used as the maximum number of descriptors that could be batched. Test were done between two machines with 2.40GHz Intel(R) Xeon(R) CPU E5-2630 connected back to back through ixgbe. Traffic were generated on one remote ixgbe through MoonGen and measure the RX pps through testpmd in guest when do xdp_redirect_map from local ixgbe to tap. RX pps were increased from 3.05 Mpps to 4.00 Mpps (about 31% improvement). One possible concern for this is the implications for TCP (especially latency sensitive workload). Result[1] does not show obvious changes for most of the netperf test (RR, TX, and RX). And we do get some improvements for RX on some specific size. Guest RX: size/sessions/+thu%/+normalize% 64/ 1/ +2%/ +2% 64/ 2/ +2%/ -1% 64/ 4/ +1%/ +1% 64/ 8/ 0%/ 0% 256/ 1/ +6%/ -3% 256/ 2/ -3%/ +2% 256/ 4/ +11%/ +11% 256/ 8/ 0%/ 0% 512/ 1/ +4%/ 0% 512/ 2/ +2%/ +2% 512/ 4/ 0%/ -1% 512/ 8/ -8%/ -8% 1024/ 1/ -7%/ -17% 1024/ 2/ -8%/ -7% 1024/ 4/ +1%/ 0% 1024/ 8/ 0%/ 0% 2048/ 1/ +30%/ +14% 2048/ 2/ +46%/ +40% 2048/ 4/ 0%/ 0% 2048/ 8/ 0%/ 0% 4096/ 1/ +23%/ +22% 4096/ 2/ +26%/ +23% 4096/ 4/ 0%/ +1% 4096/ 8/ 0%/ 0% 16384/ 1/ -2%/ -3% 16384/ 2/ +1%/ -4% 16384/ 4/ -1%/ -3% 16384/ 8/ 0%/ -1% 65535/ 1/ +15%/ +7% 65535/ 2/ +4%/ +7% 65535/ 4/ 0%/ +1% 65535/ 8/ 0%/ 0% TCP_RR: size/sessions/+thu%/+normalize% 1/ 1/ 0%/ +1% 1/ 25/ +2%/ +1% 1/ 50/ +4%/ +1% 64/ 1/ 0%/ -4% 64/ 25/ +2%/ +1% 64/ 50/ 0%/ -1% 256/ 1/ 0%/ 0% 256/ 25/ 0%/ 0% 256/ 50/ +4%/ +2% Guest TX: size/sessions/+thu%/+normalize% 64/ 1/ +4%/ -2% 64/ 2/ -6%/ -5% 64/ 4/ +3%/ +6% 64/ 8/ 0%/ +3% 256/ 1/ +15%/ +16% 256/ 2/ +11%/ +12% 256/ 4/ +1%/ 0% 256/ 8/ +5%/ +5% 512/ 1/ -1%/ -6% 512/ 2/ 0%/ -8% 512/ 4/ -2%/ +4% 512/ 8/ +6%/ +9% 1024/ 1/ +3%/ +1% 1024/ 2/ +3%/ +9% 1024/ 4/ 0%/ +7% 1024/ 8/ 0%/ +7% 2048/ 1/ +8%/ +2% 2048/ 2/ +3%/ -1% 2048/ 4/ -1%/ +11% 2048/ 8/ +3%/ +9% 4096/ 1/ +8%/ +8% 4096/ 2/ 0%/ -7% 4096/ 4/ +4%/ +4% 4096/ 8/ +2%/ +5% 16384/ 1/ -3%/ +1% 16384/ 2/ -1%/ -12% 16384/ 4/ -1%/ +5% 16384/ 8/ 0%/ +1% 65535/ 1/ 0%/ -3% 65535/ 2/ +5%/ +16% 65535/ 4/ +1%/ +2% 65535/ 8/ +1%/ -1% Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10doc: clarification about setting SO_ZEROCOPYKornilios Kourtis
Signed-off-by: Kornilios Kourtis <kou@zurich.ibm.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10Merge tag 'mlx5-updates-2018-01-08' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux mlx5-updates-2018-01-08 Four patches from Or that add Hairpin support to mlx5: =========================================================== From: Or Gerlitz <ogerlitz@mellanox.com> We refer the ability of NIC HW to fwd packet received on one port to the other port (also from a port to itself) as hairpin. The application API is based on ingress tc/flower rules set on the NIC with the mirred redirect action. Other actions can apply to packets during the redirect. Hairpin allows to offload the data-path of various SW DDoS gateways, load-balancers, etc to HW. Packets go through all the required processing in HW (header re-write, encap/decap, push/pop vlan) and then forwarded, CPU stays at practically zero usage. HW Flow counters are used by the control plane for monitoring and accounting. Hairpin is implemented by pairing a receive queue (RQ) to send queue (SQ). All the flows that share <recv NIC, mirred NIC> are redirected through the same hairpin pair. Currently, only header-rewrite is supported as a packet modification action. I'd like to thanks Elijah Shakkour <elijahs@mellanox.com> for implementing this functionality on HW simulator, before it was avail in the FW so the driver code could be tested early. =========================================================== From Feras three patches that provide very small changes that allow IPoIB to support RX timestamping for child interfaces, simply by hooking the mlx5e timestamping PTP ioctl to IPoIB child interface netdev profile. One patch from Gal to fix a spilling mistake. Two patches from Eugenia adds drop counters to VF statistics to be reported as part of VF statistics in netlink (iproute2) and implemented them in mlx5 eswitch. Signed-off-by: David S. Miller <davem@davemloft.net>