summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-02-26net/smc: fix cleanup for linkgroup setup failuresUrsula Braun
If an SMC connection to a certain peer is setup the first time, a new linkgroup is created. In case of setup failures, such a linkgroup is unusable and should disappear. As a first step the linkgroup is removed from the linkgroup list in smc_lgr_forget(). There are 2 problems: smc_listen_decline() might be called before linkgroup creation resulting in a crash due to calling smc_lgr_forget() with parameter NULL. If a setup failure occurs after linkgroup creation, the connection is never unregistered from the linkgroup, preventing linkgroup freeing. This patch introduces an enhanced smc_lgr_cleanup_early() function which * contains a linkgroup check for early smc_listen_decline() invocations * invokes smc_conn_free() to guarantee unregistering of the connection. * schedules fast linkgroup removal of the unusable linkgroup And the unused function smcd_conn_free() is removed from smc_core.h. Fixes: 3b2dec2603d5b ("net/smc: restructure client and server code in af_smc") Fixes: 2a0674fffb6bc ("net/smc: improve abnormal termination of link groups") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26Merge branch 'net-fix-sysfs-permssions-when-device-changes-network'David S. Miller
Christian Brauner says: ==================== net: fix sysfs permssions when device changes network /* v7 */ This is v7 with a build warning fixup that slipped past me the last time. It removes to unused variables in sysfs_group_change_owner(). I observed no warning when building just now. /* v6 */ This is v6 with two small fixups. I missed adapting the commit message to reflect the renamed helper for changing the owner of sysfs files and I also forgot to make the new dpm helper static inline. /* v5 */ This is v5 with a small fixup requested by Rafael. /* v4 */ This is v4 with more documentation and other fixes that Greg requested. /* v3 */ This is v3 with explicit uid and gid parameters added to functions that change sysfs object ownership as Greg requested. (I've tagged this with net-next since it's triggered by a bug for network device files but it also touches driver core aspects so it's not clear-cut. I can of course split this series into separate patchsets.) We have been struggling with a bug surrounding the ownership of network device sysfs files when moving network devices between network namespaces owned by different user namespaces reported by multiple users. Currently, when moving network devices between network namespaces the ownership of the corresponding sysfs entries is not changed. This leads to problems when tools try to operate on the corresponding sysfs files. I also causes a bug when creating a network device in a network namespaces owned by a user namespace and moving that network device back to the host network namespaces. Because when a network device is created in a network namespaces it will be owned by the root user of the user namespace and all its associated sysfs files will also be owned by the root user of the corresponding user namespace. If such a network device has to be moved back to the host network namespace the permissions will still be set to the root user of the owning user namespaces of the originating network namespace. This means unprivileged users can e.g. re-trigger uevents for such incorrectly owned devices on the host or in other network namespaces. They can also modify the settings of the device itself through sysfs when they wouldn't be able to do the same through netlink. Both of these things are unwanted. For example, quite a few workloads will create network devices in the host network namespace. Other tools will then proceed to move such devices between network namespaces owner by other user namespaces. While the ownership of the device itself is updated in net/core/net-sysfs.c:dev_change_net_namespace() the corresponding sysfs entry for the device is not. Below you'll find that moving a network device (here a veth device) from a network namespace into another network namespaces owned by a different user namespace with a different id mapping. As you can see the permissions are wrong even though it is owned by the userns root user after it has been moved and can be interacted with through netlink: drwxr-xr-x 5 nobody nobody 0 Jan 25 18:08 . drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 .. -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_assign_type -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_len -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 address -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 broadcast -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_changes -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_down_count -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_up_count -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_id -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_port -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dormant -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 duplex -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 flags -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 gro_flush_timeout -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifalias -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifindex -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 iflink -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 link_mode -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 mtu -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 name_assign_type -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 netdev_group -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 operstate -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_id -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_name -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_switch_id drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 power -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 proto_down drwxr-xr-x 4 nobody nobody 0 Jan 25 18:09 queues -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 speed drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 statistics lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:08 subsystem -> ../../../../class/net -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 tx_queue_len -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 type -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:08 uevent Constrast this with creating a device of the same type in the network namespace directly. In this case the device's sysfs permissions will be correctly updated. (Please also note, that in a lot of workloads this strategy of creating the network device directly in the network device to workaround this issue can not be used. Either because the network device is dedicated after it has been created or because it used by a process that is heavily sandboxed and couldn't create network devices itself.): drwxr-xr-x 5 root root 0 Jan 25 18:12 . drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 .. -r--r--r-- 1 root root 4096 Jan 25 18:12 addr_assign_type -r--r--r-- 1 root root 4096 Jan 25 18:12 addr_len -r--r--r-- 1 root root 4096 Jan 25 18:12 address -r--r--r-- 1 root root 4096 Jan 25 18:12 broadcast -rw-r--r-- 1 root root 4096 Jan 25 18:12 carrier -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_changes -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_down_count -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_up_count -r--r--r-- 1 root root 4096 Jan 25 18:12 dev_id -r--r--r-- 1 root root 4096 Jan 25 18:12 dev_port -r--r--r-- 1 root root 4096 Jan 25 18:12 dormant -r--r--r-- 1 root root 4096 Jan 25 18:12 duplex -rw-r--r-- 1 root root 4096 Jan 25 18:12 flags -rw-r--r-- 1 root root 4096 Jan 25 18:12 gro_flush_timeout -rw-r--r-- 1 root root 4096 Jan 25 18:12 ifalias -r--r--r-- 1 root root 4096 Jan 25 18:12 ifindex -r--r--r-- 1 root root 4096 Jan 25 18:12 iflink -r--r--r-- 1 root root 4096 Jan 25 18:12 link_mode -rw-r--r-- 1 root root 4096 Jan 25 18:12 mtu -r--r--r-- 1 root root 4096 Jan 25 18:12 name_assign_type -rw-r--r-- 1 root root 4096 Jan 25 18:12 netdev_group -r--r--r-- 1 root root 4096 Jan 25 18:12 operstate -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_id -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_name -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_switch_id drwxr-xr-x 2 root root 0 Jan 25 18:12 power -rw-r--r-- 1 root root 4096 Jan 25 18:12 proto_down drwxr-xr-x 4 root root 0 Jan 25 18:12 queues -r--r--r-- 1 root root 4096 Jan 25 18:12 speed drwxr-xr-x 2 root root 0 Jan 25 18:12 statistics lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:12 subsystem -> ../../../../class/net -rw-r--r-- 1 root root 4096 Jan 25 18:12 tx_queue_len -r--r--r-- 1 root root 4096 Jan 25 18:12 type -rw-r--r-- 1 root root 4096 Jan 25 18:12 uevent Now, when creating a network device in a network namespace owned by a user namespace and moving it to the host the permissions will be set to the id that the user namespace root user has been mapped to on the host leading to all sorts of permission issues mentioned above: 458752 drwxr-xr-x 5 458752 458752 0 Jan 25 18:12 . drwxr-xr-x 9 root root 0 Jan 25 18:08 .. -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_assign_type -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_len -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 address -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 broadcast -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_changes -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_down_count -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_up_count -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_id -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_port -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dormant -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 duplex -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 flags -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 gro_flush_timeout -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 ifalias -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 ifindex -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 iflink -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 link_mode -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 mtu -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 name_assign_type -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 netdev_group -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 operstate -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_id -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_name -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_switch_id drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 power -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 proto_down drwxr-xr-x 4 458752 458752 0 Jan 25 18:12 queues -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 speed drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 statistics lrwxrwxrwx 1 root root 0 Jan 25 18:12 subsystem -> ../../../../class/net -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 tx_queue_len -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 type -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 uevent Fix this by changing the basic sysfs files associated with network devices when moving them between network namespaces. To this end we add some infrastructure to sysfs. The patchset takes care to only do this when the owning user namespaces changes and the kids differ. So there's only a performance overhead, when the owning user namespace of the network namespace is different __and__ the kid mappings for the root user are different for the two user namespaces: Assume we have a netdev eth0 which we create in netns1 owned by userns1. userns1 has an id mapping of 0 100000 100000. Now we move eth0 into netns2 which is owned by userns2 which also defines an id mapping of 0 100000 100000. In this case sysfs doesn't need updating. The patch will handle this case and not do any needless work. Now assume eth0 is moved into netns3 which is owned by userns3 which defines an id mapping of 0 123456 65536. In this case the root user in each namespace corresponds to different kid and sysfs needs updating. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: fix sysfs permssions when device changes network namespaceChristian Brauner
Now that we moved all the helpers in place and make use netdev_change_owner() to fixup the permissions when moving network devices between network namespaces. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net-sysfs: add queue_change_owner()Christian Brauner
Add a function to change the owner of the queue entries for a network device when it is moved between network namespaces. Currently, when moving network devices between network namespaces the ownership of the corresponding queue sysfs entries are not changed. This leads to problems when tools try to operate on the corresponding sysfs files. Fix this. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net-sysfs: add netdev_change_owner()Christian Brauner
Add a function to change the owner of a network device when it is moved between network namespaces. Currently, when moving network devices between network namespaces the ownership of the corresponding sysfs entries is not changed. This leads to problems when tools try to operate on the corresponding sysfs files. This leads to a bug whereby a network device that is created in a network namespaces owned by a user namespace will have its corresponding sysfs entry owned by the root user of the corresponding user namespace. If such a network device has to be moved back to the host network namespace the permissions will still be set to the user namespaces. This means unprivileged users can e.g. trigger uevents for such incorrectly owned devices. They can also modify the settings of the device itself. Both of these things are unwanted. For example, workloads will create network devices in the host network namespace. Other tools will then proceed to move such devices between network namespaces owner by other user namespaces. While the ownership of the device itself is updated in net/core/net-sysfs.c:dev_change_net_namespace() the corresponding sysfs entry for the device is not: drwxr-xr-x 5 nobody nobody 0 Jan 25 18:08 . drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 .. -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_assign_type -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_len -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 address -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 broadcast -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_changes -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_down_count -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_up_count -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_id -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_port -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dormant -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 duplex -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 flags -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 gro_flush_timeout -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifalias -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifindex -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 iflink -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 link_mode -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 mtu -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 name_assign_type -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 netdev_group -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 operstate -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_id -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_name -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_switch_id drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 power -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 proto_down drwxr-xr-x 4 nobody nobody 0 Jan 25 18:09 queues -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 speed drwxr-xr-x 2 nobody nobody 0 Jan 25 18:09 statistics lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:08 subsystem -> ../../../../class/net -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 tx_queue_len -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 type -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:08 uevent However, if a device is created directly in the network namespace then the device's sysfs permissions will be correctly updated: drwxr-xr-x 5 root root 0 Jan 25 18:12 . drwxr-xr-x 9 nobody nobody 0 Jan 25 18:08 .. -r--r--r-- 1 root root 4096 Jan 25 18:12 addr_assign_type -r--r--r-- 1 root root 4096 Jan 25 18:12 addr_len -r--r--r-- 1 root root 4096 Jan 25 18:12 address -r--r--r-- 1 root root 4096 Jan 25 18:12 broadcast -rw-r--r-- 1 root root 4096 Jan 25 18:12 carrier -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_changes -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_down_count -r--r--r-- 1 root root 4096 Jan 25 18:12 carrier_up_count -r--r--r-- 1 root root 4096 Jan 25 18:12 dev_id -r--r--r-- 1 root root 4096 Jan 25 18:12 dev_port -r--r--r-- 1 root root 4096 Jan 25 18:12 dormant -r--r--r-- 1 root root 4096 Jan 25 18:12 duplex -rw-r--r-- 1 root root 4096 Jan 25 18:12 flags -rw-r--r-- 1 root root 4096 Jan 25 18:12 gro_flush_timeout -rw-r--r-- 1 root root 4096 Jan 25 18:12 ifalias -r--r--r-- 1 root root 4096 Jan 25 18:12 ifindex -r--r--r-- 1 root root 4096 Jan 25 18:12 iflink -r--r--r-- 1 root root 4096 Jan 25 18:12 link_mode -rw-r--r-- 1 root root 4096 Jan 25 18:12 mtu -r--r--r-- 1 root root 4096 Jan 25 18:12 name_assign_type -rw-r--r-- 1 root root 4096 Jan 25 18:12 netdev_group -r--r--r-- 1 root root 4096 Jan 25 18:12 operstate -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_id -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_port_name -r--r--r-- 1 root root 4096 Jan 25 18:12 phys_switch_id drwxr-xr-x 2 root root 0 Jan 25 18:12 power -rw-r--r-- 1 root root 4096 Jan 25 18:12 proto_down drwxr-xr-x 4 root root 0 Jan 25 18:12 queues -r--r--r-- 1 root root 4096 Jan 25 18:12 speed drwxr-xr-x 2 root root 0 Jan 25 18:12 statistics lrwxrwxrwx 1 nobody nobody 0 Jan 25 18:12 subsystem -> ../../../../class/net -rw-r--r-- 1 root root 4096 Jan 25 18:12 tx_queue_len -r--r--r-- 1 root root 4096 Jan 25 18:12 type -rw-r--r-- 1 root root 4096 Jan 25 18:12 uevent Now, when creating a network device in a network namespace owned by a user namespace and moving it to the host the permissions will be set to the id that the user namespace root user has been mapped to on the host leading to all sorts of permission issues: 458752 drwxr-xr-x 5 458752 458752 0 Jan 25 18:12 . drwxr-xr-x 9 root root 0 Jan 25 18:08 .. -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_assign_type -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 addr_len -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 address -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 broadcast -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_changes -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_down_count -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 carrier_up_count -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_id -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dev_port -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 dormant -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 duplex -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 flags -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 gro_flush_timeout -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 ifalias -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 ifindex -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 iflink -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 link_mode -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 mtu -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 name_assign_type -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 netdev_group -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 operstate -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_id -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_port_name -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 phys_switch_id drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 power -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 proto_down drwxr-xr-x 4 458752 458752 0 Jan 25 18:12 queues -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 speed drwxr-xr-x 2 458752 458752 0 Jan 25 18:12 statistics lrwxrwxrwx 1 root root 0 Jan 25 18:12 subsystem -> ../../../../class/net -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 tx_queue_len -r--r--r-- 1 458752 458752 4096 Jan 25 18:12 type -rw-r--r-- 1 458752 458752 4096 Jan 25 18:12 uevent Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26drivers/base/power: add dpm_sysfs_change_owner()Christian Brauner
Add a helper to change the owner of a device's power entries. This needs to happen when the ownership of a device is changed, e.g. when moving network devices between network namespaces. This function will be used to correctly account for ownership changes, e.g. when moving network devices between network namespaces. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26device: add device_change_owner()Christian Brauner
Add a helper to change the owner of a device's sysfs entries. This needs to happen when the ownership of a device is changed, e.g. when moving network devices between network namespaces. This function will be used to correctly account for ownership changes, e.g. when moving network devices between network namespaces. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26sysfs: add sysfs_change_owner()Christian Brauner
Add a helper to change the owner of sysfs objects. This function will be used to correctly account for kobject ownership changes, e.g. when moving network devices between network namespaces. This mirrors how a kobject is added through driver core which in its guts is done via kobject_add_internal() which in summary creates the main directory via create_dir(), populates that directory with the groups associated with the ktype of the kobject (if any) and populates the directory with the basic attributes associated with the ktype of the kobject (if any). These are the basic steps that are associated with adding a kobject in sysfs. Any additional properties are added by the specific subsystem itself (not by driver core) after it has registered the device. So for the example of network devices, a network device will e.g. register a queue subdirectory under the basic sysfs directory for the network device and than further subdirectories within that queues subdirectory. But that is all specific to network devices and they call the corresponding sysfs functions to do that directly when they create those queue objects. So anything that a subsystem adds outside of what driver core does must also be changed by it (That's already true for removal of files it created outside of driver core.) and it's the same for ownership changes. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26sysfs: add sysfs_group{s}_change_owner()Christian Brauner
Add helpers to change the owner of sysfs groups. This function will be used to correctly account for kobject ownership changes, e.g. when moving network devices between network namespaces. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26sysfs: add sysfs_link_change_owner()Christian Brauner
Add a helper to change the owner of a sysfs link. This function will be used to correctly account for kobject ownership changes, e.g. when moving network devices between network namespaces. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26sysfs: add sysfs_file_change_owner()Christian Brauner
Add helpers to change the owner of a sysfs files. This function will be used to correctly account for kobject ownership changes, e.g. when moving network devices between network namespaces. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: bcmgenet: Clear ID_MODE_DIS in EXT_RGMII_OOB_CTRL when not neededNicolas Saenz Julienne
Outdated Raspberry Pi 4 firmware might configure the external PHY as rgmii although the kernel currently sets it as rgmii-rxid. This makes connections unreliable as ID_MODE_DIS is left enabled. To avoid this, explicitly clear that bit whenever we don't need it. Fixes: da38802211cc ("net: bcmgenet: Add RGMII_RXID support") Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26sched: act: count in the size of action flags bitfieldJiri Pirko
The put of the flags was added by the commit referenced in fixes tag, however the size of the message was not extended accordingly. Fix this by adding size of the flags bitfield to the message size. Fixes: e38226786022 ("net: sched: update action implementations to support flags") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27kbuild: get rid of trailing slash from subdir- exampleMasahiro Yamada
obj-* needs a trailing slash for a directory, but subdir-* does not. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-02-26net: core: devlink.c: Use built-in RCU list checkingMadhuparna Bhowmik
list_for_each_entry_rcu() has built-in RCU and lock checking. Pass cond argument to list_for_each_entry_rcu() to silence false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled. The devlink->lock is held when devlink_dpipe_table_find() is called in non RCU read side section. Therefore, pass struct devlink to devlink_dpipe_table_find() for lockdep checking. Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: cisco: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] Lastly, fix the following checkpatch warning: CHECK: Prefer kernel type 'u32' over 'u_int32_t' #61: FILE: drivers/net/ethernet/cisco/enic/vnic_devcmd.h:653: + u_int32_t val[]; This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: marvell: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: hns: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26sfc: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Martin Habets <mhabets@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26qlogic: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was detected with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: dsa: bcm_sf2: Forcibly configure IMP port for 1Gb/secFlorian Fainelli
We are still experiencing some packet loss with the existing advanced congestion buffering (ACB) settings with the IMP port configured for 2Gb/sec, so revert to conservative link speeds that do not produce packet loss until this is resolved. Fixes: 8f1880cbe8d0 ("net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec") Fixes: de34d7084edd ("net: dsa: bcm_sf2: Only 7278 supports 2Gb/sec IMP port") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26Revert "net: dsa: bcm_sf2: Also configure Port 5 for 2Gb/sec on 7278"Florian Fainelli
This reverts commit 7458bd540fa0a90220b9e8c349d910d9dde9caf8 ("net: dsa: bcm_sf2: Also configure Port 5 for 2Gb/sec on 7278") as it causes advanced congestion buffering issues with 7278 switch devices when using their internal Giabit PHY. While this is being debugged, continue with conservative defaults that work and do not cause packet loss. Fixes: 7458bd540fa0 ("net: dsa: bcm_sf2: Also configure Port 5 for 2Gb/sec on 7278") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes: 1) Perform garbage collection from workqueue to fix rcu detected stall in ipset hash set types, from Jozsef Kadlecsik. 2) Fix the forceadd evaluation path, also from Jozsef. 3) Fix nft_set_pipapo selftest, from Stefano Brivio. 4) Crash when add-flush-add element in pipapo set, also from Stefano. Add test to cover this crash. 5) Remove sysctl entry under mutex in hashlimit, from Cong Wang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26Merge tag 'tag-chrome-platform-fixes-for-v5.6-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux Pull chrome platform fix from Benson Leung: "Fix a build warning" * tag 'tag-chrome-platform-fixes-for-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux: platform/chrome: wilco_ec: Include asm/unaligned instead of linux/ path
2020-02-26bnxt_en: add newline to netdev_*() format stringsJonathan Lemon
Add missing newlines to netdev_* format strings so the lines aren't buffered by the printk subsystem. Nitpicked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26netfilter: xt_hashlimit: unregister proc file before releasing mutexCong Wang
Before releasing the global mutex, we only unlink the hashtable from the hash list, its proc file is still not unregistered at this point. So syzbot could trigger a race condition where a parallel htable_create() could register the same file immediately after the mutex is released. Move htable_remove_proc_entry() back to mutex protection to fix this. And, fold htable_destroy() into htable_put() to make the code slightly easier to understand. Reported-and-tested-by: syzbot+d195fd3b9a364ddd6731@syzkaller.appspotmail.com Fixes: c4a3922d2d20 ("netfilter: xt_hashlimit: reduce hashlimit_mutex scope for htable_put()") Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-26ethtool: limit bitset sizeMichal Kubecek
Syzbot reported that ethnl_compact_sanity_checks() can be tricked into reading past the end of ETHTOOL_A_BITSET_VALUE and ETHTOOL_A_BITSET_MASK attributes and even the message by passing a value between (u32)(-31) and (u32)(-1) as ETHTOOL_A_BITSET_SIZE. The problem is that DIV_ROUND_UP(attr_nbits, 32) is 0 for such values so that zero length ETHTOOL_A_BITSET_VALUE will pass the length check but ethnl_bitmap32_not_zero() check would try to access up to 512 MB of attribute "payload". Prevent this overflow byt limiting the bitset size. Technically, compact bitset format would allow bitset sizes up to almost 2^18 (so that the nest size does not exceed U16_MAX) but bitsets used by ethtool are much shorter. S16_MAX, the largest value which can be directly used as an upper limit in policy, should be a reasonable compromise. Fixes: 10b518d4e6dd ("ethtool: netlink bitset handling") Reported-by: syzbot+7fd4ed5b4234ab1fdccd@syzkaller.appspotmail.com Reported-by: syzbot+709b7a64d57978247e44@syzkaller.appspotmail.com Reported-by: syzbot+983cb8fb2d17a7af549d@syzkaller.appspotmail.com Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26net: Fix Tx hash bound checkingAmritha Nambiar
Fixes the lower and upper bounds when there are multiple TCs and traffic is on the the same TC on the same device. The lower bound is represented by 'qoffset' and the upper limit for hash value is 'qcount + qoffset'. This gives a clean Rx to Tx queue mapping when there are multiple TCs, as the queue indices for upper TCs will be offset by 'qoffset'. v2: Fixed commit description based on comments. Fixes: 1b837d489e06 ("net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash") Fixes: eadec877ce9c ("net: Add support for subordinate traffic classes to netdev_pick_tx") Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26Merge tag 'trace-v5.6-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing and bootconfig updates: "Fixes and changes to bootconfig before it goes live in a release. Change in API of bootconfig (before it comes live in a release): - Have a magic value "BOOTCONFIG" in initrd to know a bootconfig exists - Set CONFIG_BOOT_CONFIG to 'n' by default - Show error if "bootconfig" on cmdline but not compiled in - Prevent redefining the same value - Have a way to append values - Added a SELECT BLK_DEV_INITRD to fix a build failure Synthetic event fixes: - Switch to raw_smp_processor_id() for recording CPU value in preempt section. (No care for what the value actually is) - Fix samples always recording u64 values - Fix endianess - Check number of values matches number of fields - Fix a printing bug Fix of trace_printk() breaking postponed start up tests Make a function static that is only used in a single file" * tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue bootconfig: Add append value operator support bootconfig: Prohibit re-defining value on same key bootconfig: Print array as multiple commands for legacy command line bootconfig: Reject subkey and value on same parent key tools/bootconfig: Remove unneeded error message silencer bootconfig: Add bootconfig magic word for indicating bootconfig explicitly bootconfig: Set CONFIG_BOOT_CONFIG=n by default tracing: Clear trace_state when starting trace bootconfig: Mark boot_config_checksum() static tracing: Disable trace_printk() on post poned tests tracing: Have synthetic event test use raw_smp_processor_id() tracing: Fix number printing bug in print_synth_event() tracing: Check that number of vals matches number of synth event fields tracing: Make synth_event trace functions endian-correct tracing: Make sure synth_event_trace() example always uses u64
2020-02-26Merge tag 'linux-kselftest-kunit-5.6-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull Kunit fixes from Shuah Khan: "This Kselftest kunit update consists of fixes to documentation and the run-time tool from Brendan Higgins and Heidi Fahim" * tag 'linux-kselftest-kunit-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: run kunit_tool from any directory kunit: test: Improve error messages for kunit_tool when kunitconfig is invalid Documentation: kunit: fixed sphinx error in code block
2020-02-26Merge tag 'linux-kselftest-5.6-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull Kselftest fixes from Shuah Khan: - fixes to TIMEOUT failures and out-of-tree compilation compilation errors from Michael Ellerman. - declutter git status fix from Christophe Leroy * tag 'linux-kselftest-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests/rseq: Fix out-of-tree compilation selftests: Install settings files to fix TIMEOUT failures selftest/lkdtm: Don't pollute 'git status'
2020-02-26Revert "KVM: x86: enable -Werror"Christoph Hellwig
This reverts commit ead68df94d248c80fdbae220ae5425eb5af2e753. Using the -Werror flag breaks the build for me due to mostly harmless KASAN or similar warnings: arch/x86/kvm/x86.c: In function ‘kvm_timer_init’: arch/x86/kvm/x86.c:7209:1: error: the frame size of 1112 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] Feel free to add a CONFIG_WERROR if you care strong enough, but don't break peoples builds for absolutely no good reason. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-26signal: avoid double atomic counter increments for user accountingLinus Torvalds
When queueing a signal, we increment both the users count of pending signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount of the user struct itself (because we keep a reference to the user in the signal structure in order to correctly account for it when freeing). That turns out to be fairly expensive, because both of them are atomic updates, and particularly under extreme signal handling pressure on big machines, you can get a lot of cache contention on the user struct. That can then cause horrid cacheline ping-pong when you do these multiple accesses. So change the reference counting to only pin the user for the _first_ pending signal, and to unpin it when the last pending signal is dequeued. That means that when a user sees a lot of concurrent signal queuing - which is the only situation when this matters - the only atomic access needed is generally the 'sigpending' count update. This was noticed because of a particularly odd timing artifact on a dual-socket 96C/192T Cascade Lake platform: when you get into bad contention, on that machine for some reason seems to be much worse when the contention happens in the upper 32-byte half of the cacheline. As a result, the kernel test robot will-it-scale 'signal1' benchmark had an odd performance regression simply due to random alignment of the 'struct user_struct' (and pointed to a completely unrelated and apparently nonsensical commit for the regression). Avoiding the double increments (and decrements on the dequeueing side, of course) makes for much less contention and hugely improved performance on that will-it-scale microbenchmark. Quoting Feng Tang: "It makes a big difference, that the performance score is tripled! bump from original 17000 to 54000. Also the gap between 5.0-rc6 and 5.0-rc6+Jiri's patch is reduced to around 2%" [ The "2% gap" is the odd cacheline placement difference on that platform: under the extreme contention case, the effect of which half of the cacheline was hot was 5%, so with the reduced contention the odd timing artifact is reduced too ] It does help in the non-contended case too, but is not nearly as noticeable. Reported-and-tested-by: Feng Tang <feng.tang@intel.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Huang, Ying <ying.huang@intel.com> Cc: Philip Li <philip.li@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-26Merge branch 'bpf-bpftool-probes'Daniel Borkmann
Michal Rostecki says: ==================== Feature probes in bpftool related to bpf_probe_write_user and bpf_trace_printk helpers emit dmesg warnings which might be confusing for people running bpftool on production environments. This patch series addresses that by filtering them out by default and introducing the new positional argument "full" which enables all available probes. The main motivation behind those changes is ability the fact that some probes (for example those related to "trace" or "write_user" helpers) emit dmesg messages which might be confusing for people who are running on production environments. For details see the Cilium issue[0]. v1 -> v2: - Do not expose regex filters to users, keep filtering logic internal, expose only the "full" option for including probes which emit dmesg warnings. v2 -> v3: - Do not use regex for filtering out probes, use function IDs directly. - Fix bash completion - in v2 only "prefix" was proposed after "macros", "dev" and "kernel" were not. - Rephrase the man page paragraph, highlight helper function names. - Remove tests which parse the plain output of bpftool (except the header/macros test), focus on testing JSON output instead. - Add test which compares the output with and without "full" option. v3 -> v4: - Use enum to check for helper functions. - Make selftests compatible with older versions of Python 3.x than 3.7. [0] https://github.com/cilium/cilium/issues/10048 ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2020-02-26selftests/bpf: Add test for "bpftool feature" commandMichal Rostecki
Add Python module with tests for "bpftool feature" command, which mainly checks whether the "full" option is working properly. Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200226165941.6379-6-mrostecki@opensuse.org
2020-02-26bpftool: Update bash completion for "bpftool feature" commandMichal Rostecki
Update bash completion for "bpftool feature" command with the new argument: "full". Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200226165941.6379-5-mrostecki@opensuse.org
2020-02-26bpftool: Update documentation of "bpftool feature" commandMichal Rostecki
Update documentation of "bpftool feature" command with information about new arguments: "full". Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200226165941.6379-4-mrostecki@opensuse.org
2020-02-26bpftool: Make probes which emit dmesg warnings optionalMichal Rostecki
Probes related to bpf_probe_write_user and bpf_trace_printk helpers emit dmesg warnings which might be confusing for people running bpftool on production environments. This change filters them out by default and introduces the new positional argument "full" which enables all available probes. Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200226165941.6379-3-mrostecki@opensuse.org
2020-02-26bpftool: Move out sections to separate functionsMichal Rostecki
Remove all calls of print_end_then_start_section function and for loops out from the do_probe function. Instead, provide separate functions for each section (like i.e. section_helpers) which are called in do_probe. This change is motivated by better readability. Signed-off-by: Michal Rostecki <mrostecki@opensuse.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20200226165941.6379-2-mrostecki@opensuse.org
2020-02-27kbuild: add dt_binding_check to PHONY in a correct placeMasahiro Yamada
The dt_binding_check is added to PHONY, but it is invisible when $(dtstree) is empty. So, it is not specified as phony for ARCH=x86 etc. Add it to PHONY outside the ifneq ... endif block. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Rob Herring <robh@kernel.org>
2020-02-27kbuild: add dtbs_check to PHONYMasahiro Yamada
The dtbs_check should be a phony target, but currently it is not specified so. 'make dtbs_check' works even if a file named 'dtbs_check' exists because it depends on another phony target, scripts_dtc, but we should not rely on it. Add dtbs_check to PHONY. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Rob Herring <robh@kernel.org>
2020-02-27kbuild: remove unneeded semicolon at the end of cmd_dtb_checkMasahiro Yamada
This trailing semicolon is unneeded. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Rob Herring <robh@kernel.org>
2020-02-27kbuild: fix DT binding schema rule to detect command line changesMasahiro Yamada
This if_change_rule is not working properly; it cannot detect any command line change. The reason is because cmd-check in scripts/Kbuild.include compares $(cmd_$@) and $(cmd_$1), but cmd_dtc_dt_yaml does not exist here. For if_change_rule to work properly, the stem part of cmd_* and rule_* must match. Because this cmd_and_fixdep invokes cmd_dtc, this rule must be named rule_dtc. Fixes: 4f0e3a57d6eb ("kbuild: Add support for DT binding schema checks") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Acked-by: Rob Herring <robh@kernel.org>
2020-02-27kbuild: remove wrong documentation about mandatory-yMasahiro Yamada
This sentence does not make sense in the section about mandatory-y. This seems to be a copy-paste mistake of commit fcc8487d477a ("uapi: export all headers under uapi directories"). The correct description would be "The convention is to list one mandatory-y per line ...". I just removed it instead of fixing it. If such information is needed, it could be commented in include/asm-generic/Kbuild and include/uapi/asm-generic/Kbuild. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-02-27kbuild: add comment for V=2 modeRandy Dunlap
Complete the comments for valid values of KBUILD_VERBOSE, specifically for KBUILD_VERBOSE=2. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-02-26iavf: use tc_cls_can_offload_and_chain0() instead of chain checkJiri Pirko
Looks like the iavf code actually experienced a race condition, when a developer took code before the check for chain 0 was put to helper. So use tc_cls_can_offload_and_chain0() helper instead of direct check and move the check to _cb() so this is similar to i40e code. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26bpftool: Support struct_ops, tracing, ext prog typesAndrey Ignatov
Add support for prog types that were added to kernel but not present in bpftool yet: struct_ops, tracing, ext prog types and corresponding section names. Before: # bpftool p l ... 184: type 26 name test_subprog3 tag dda135a7dc0daf54 gpl loaded_at 2020-02-25T13:28:33-0800 uid 0 xlated 112B jited 103B memlock 4096B map_ids 136 btf_id 85 185: type 28 name new_get_skb_len tag d2de5b87d8e5dc49 gpl loaded_at 2020-02-25T13:28:33-0800 uid 0 xlated 72B jited 69B memlock 4096B map_ids 136 btf_id 85 After: # bpftool p l ... 184: tracing name test_subprog3 tag dda135a7dc0daf54 gpl loaded_at 2020-02-25T13:28:33-0800 uid 0 xlated 112B jited 103B memlock 4096B map_ids 136 btf_id 85 185: ext name new_get_skb_len tag d2de5b87d8e5dc49 gpl loaded_at 2020-02-25T13:28:33-0800 uid 0 xlated 72B jited 69B memlock 4096B map_ids 136 btf_id 85 Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200225223441.689109-1-rdna@fb.com
2020-02-26scripts/bpf: Switch to more portable python3 shebangScott Branden
Change "/usr/bin/python3" to "/usr/bin/env python3" for more portable solution in bpf_helpers_doc.py. Signed-off-by: Scott Branden <scott.branden@broadcom.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200225205426.6975-1-scott.branden@broadcom.com
2020-02-26selftests: nft_concat_range: Add test for reported add/flush/add issueStefano Brivio
Add a specific test for the crash reported by Phil Sutter and addressed in the previous patch. The test cases that, in my intention, should have covered these cases, that is, the ones from the 'concurrency' section, don't run these sequences tightly enough and spectacularly failed to catch this. While at it, define a convenient way to add these kind of tests, by adding a "reported issues" test section. It's more convenient, for this particular test, to execute the set setup in its own function. However, future test cases like this one might need to call setup functions, and will typically need no tools other than nft, so allow for this in check_tools(). The original form of the reproducer used here was provided by Phil. Reported-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-26nft_set_pipapo: Actually fetch key data in nft_pipapo_remove()Stefano Brivio
Phil reports that adding elements, flushing and re-adding them right away: nft add table t '{ set s { type ipv4_addr . inet_service; flags interval; }; }' nft add element t s '{ 10.0.0.1 . 22-25, 10.0.0.1 . 10-20 }' nft flush set t s nft add element t s '{ 10.0.0.1 . 10-20, 10.0.0.1 . 22-25 }' triggers, almost reliably, a crash like this one: [ 71.319848] general protection fault, probably for non-canonical address 0x6f6b6e696c2e756e: 0000 [#1] PREEMPT SMP PTI [ 71.321540] CPU: 3 PID: 1201 Comm: kworker/3:2 Not tainted 5.6.0-rc1-00377-g2bb07f4e1d861 #192 [ 71.322746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014 [ 71.324430] Workqueue: events nf_tables_trans_destroy_work [nf_tables] [ 71.325387] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables] [ 71.326164] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b [ 71.328423] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282 [ 71.329225] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0 [ 71.330365] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a [ 71.331473] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000 [ 71.332627] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2 [ 71.333615] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0 [ 71.334596] FS: 0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000 [ 71.335780] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 71.336577] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0 [ 71.337533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 71.338557] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 71.339718] Call Trace: [ 71.340093] nft_pipapo_destroy+0x7a/0x170 [nf_tables_set] [ 71.340973] nft_set_destroy+0x20/0x50 [nf_tables] [ 71.341879] nf_tables_trans_destroy_work+0x246/0x260 [nf_tables] [ 71.342916] process_one_work+0x1d5/0x3c0 [ 71.343601] worker_thread+0x4a/0x3c0 [ 71.344229] kthread+0xfb/0x130 [ 71.344780] ? process_one_work+0x3c0/0x3c0 [ 71.345477] ? kthread_park+0x90/0x90 [ 71.346129] ret_from_fork+0x35/0x40 [ 71.346748] Modules linked in: nf_tables_set nf_tables nfnetlink 8021q [last unloaded: nfnetlink] [ 71.348153] ---[ end trace 2eaa8149ca759bcc ]--- [ 71.349066] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables] [ 71.350016] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b [ 71.350017] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282 [ 71.350019] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0 [ 71.350019] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a [ 71.350020] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000 [ 71.350021] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2 [ 71.350022] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0 [ 71.350025] FS: 0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000 [ 71.350026] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 71.350027] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0 [ 71.350028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 71.350028] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 71.350030] Kernel panic - not syncing: Fatal exception [ 71.350412] Kernel Offset: disabled [ 71.365922] ---[ end Kernel panic - not syncing: Fatal exception ]--- which is caused by dangling elements that have been deactivated, but never removed. On a flush operation, nft_pipapo_walk() walks through all the elements in the mapping table, which are then deactivated by nft_flush_set(), one by one, and added to the commit list for removal. Element data is then freed. On transaction commit, nft_pipapo_remove() is called, and failed to remove these elements, leading to the stale references in the mapping. The first symptom of this, revealed by KASan, is a one-byte use-after-free in subsequent calls to nft_pipapo_walk(), which is usually not enough to trigger a panic. When stale elements are used more heavily, though, such as double-free via nft_pipapo_destroy() as in Phil's case, the problem becomes more noticeable. The issue comes from that fact that, on a flush operation, nft_pipapo_remove() won't get the actual key data via elem->key, elements to be deleted upon commit won't be found by the lookup via pipapo_get(), and removal will be skipped. Key data should be fetched via nft_set_ext_key(), instead. Reported-by: Phil Sutter <phil@nwl.cc> Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>