Age | Commit message (Collapse) | Author |
|
Breno Leitao says:
====================
net: collect tstats automatically
The commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf") added a field in struct_netdevice, which tells what
type of statistics the driver supports.
That field is used primarily to allocate stats structures automatically,
but, it also could leveraged to simplify the drivers even further, such
as, if the driver relies in the default stats collection, then it
doesn't need to assign to .ndo_get_stats64. That means that drivers only
assign functions to .ndo_get_stats64 if they are using something
special.
I started to move some of these drivers[1][2][3] to use the core
allocation, and with this change in, I just need to touch the driver
once, and be able to simplify the whole stats allocation and collection
for generic case.
There are 44 devices today that could benefit from this simplification.
# grep -r .ndo_get_stats64 | grep dev_get_tstats64 | wc -l
44
As of today, netnext only has the `sit` driver fully ported to core
stats allocation, hence the second patch.
Links:
[1] https://lore.kernel.org/all/20240227182338.2739884-1-leitao@debian.org/
[2] https://lore.kernel.org/all/20240222144117.1370101-1-leitao@debian.org/
[3] https://lore.kernel.org/all/20240223115839.3572852-1-leitao@debian.org/
====================
Link: https://lore.kernel.org/r/20240228113125.3473685-1-leitao@debian.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
If the driver is using the network core allocation mechanism, by setting
NETDEV_PCPU_STAT_TSTATS, as this driver is, then, it doesn't need to set
the dev_get_tstats64() generic .ndo_get_stats64 function pointer. Since
the network core calls it automatically, and .ndo_get_stats64 should
only be set if the driver needs special treatment.
This simplifies the driver, since all the generic statistics is now
handled by core.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
If the network driver is relying in the net core to do stats allocation,
then we want to dev_get_tstats64() instead of netdev_stats_to_stats64(),
since there are per-cpu stats that needs to be taken in consideration.
This will also simplify the drivers in regard to statistics. Once the
driver sets NETDEV_PCPU_STAT_TSTATS, it doesn't not need to allocate the
stacks, neither it needs to set `.ndo_get_stats64 = dev_get_tstats64`
for the generic stats collection function anymore.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This is just a trivial fix for a typo in a comment, no functional
changes.
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Link: https://lore.kernel.org/r/20240228112447.1490926-1-siyanteng@loongson.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
Patch #1 restores NFPROTO_INET with nft_compat, from Ignat Korchagin.
Patch #2 fixes an issue with bridge netfilter and broadcast/multicast
packets.
There is a day 0 bug in br_netfilter when used with connection tracking.
Conntrack assumes that an nf_conn structure that is not yet added to
hash table ("unconfirmed"), is only visible by the current cpu that is
processing the sk_buff.
For bridge this isn't true, sk_buff can get cloned in between, and
clones can be processed in parallel on different cpu.
This patch disables NAT and conntrack helpers for multicast packets.
Patch #3 adds a selftest to cover for the br_netfilter bug.
netfilter pull request 24-02-29
* tag 'nf-24-02-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: add bridge conntrack + multicast test case
netfilter: bridge: confirm multicast packets before passing them up the stack
netfilter: nf_tables: allow NFPROTO_INET in nft_(match/target)_validate()
====================
Link: https://lore.kernel.org/r/20240229000135.8780-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Current HSR implementation uses following supervisory frame (even for
HSRv1 the HSR tag is not is not present):
00000000: 01 15 4e 00 01 2d XX YY ZZ 94 77 10 88 fb 00 01
00000010: 7e 1c 17 06 XX YY ZZ 94 77 10 1e 06 XX YY ZZ 94
00000020: 77 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000030: 00 00 00 00 00 00 00 00 00 00 00 00
The current code adds extra two bytes (i.e. sizeof(struct hsr_sup_tlv))
when offset for skb_pull() is calculated.
This is wrong, as both 'struct hsrv1_ethhdr_sp' and 'hsrv0_ethhdr_sp'
already have 'struct hsr_sup_tag' defined in them, so there is no need
for adding extra two bytes.
This code was working correctly as with no RedBox support, the check for
HSR_TLV_EOT (0x00) was off by two bytes, which were corresponding to
zeroed padded bytes for minimal packet size.
Fixes: eafaa88b3eb7 ("net: hsr: Add support for redbox supervision frames")
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240228085644.3618044-1-lukma@denx.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The input parameter 'level' in do_raw_set/getsockopt() is not used.
Therefore, remove it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240228072505.640550-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Robert Marko says:
====================
net: dsa: mv88e6xxx: add Amethyst specific SMI GPIO function
Amethyst family (MV88E6191X/6193X/6393X) has a simplified SMI GPIO setting
via the Scratch and Misc register so it requires family specific function.
In the v1 review, Andrew pointed out that it would make sense to rename the
existing mv88e6xxx_g2_scratch_gpio_set_smi as it only works on the MV6390
family.
Changes in v2:
* Add rename of mv88e6xxx_g2_scratch_gpio_set_smi to
mv88e6390_g2_scratch_gpio_set_smi
====================
Link: https://lore.kernel.org/r/20240227175457.2766628-1-robimarko@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The existing mv88e6390_g2_scratch_gpio_set_smi() cannot be used on the
88E6393X as it requires certain P0_MODE, it also checks the CPU mode
as it impacts the bit setting value.
This is all irrelevant for Amethyst (MV88E6191X/6193X/6393X) as only
the default value of the SMI_PHY Config bit is set to CPU_MGD bootstrap
pin value but it can be changed without restrictions so that GPIO pins
9 and 10 are used as SMI pins.
So, introduce Amethyst specific function and call that if the Amethyst
family wants to setup the external PHY.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Robert Marko <robimarko@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The name mv88e6xxx_g2_scratch_gpio_set_smi is a bit ambiguous as it appears
to only be applicable to the 6390 family, so lets rename it to
mv88e6390_g2_scratch_gpio_set_smi to make it more obvious.
Signed-off-by: Robert Marko <robimarko@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
I missed that inet6_dump_addr() is calling in6_dump_addrs()
from two points.
First one under RTNL protection, and second one under rcu_read_lock().
Since we want to remove RTNL use from inet6_dump_addr() very soon,
no longer assume in6_dump_addrs() is protected by RTNL (even
if this is still the case).
Use rcu_read_lock() earlier to fix this lockdep splat:
WARNING: suspicious RCU usage
6.8.0-rc5-syzkaller-01618-gf8cbf6bde4c8 #0 Not tainted
net/ipv6/addrconf.c:5317 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
3 locks held by syz-executor.2/8834:
#0: ffff88802f554678 (nlk_cb_mutex-ROUTE){+.+.}-{3:3}, at: __netlink_dump_start+0x119/0x780 net/netlink/af_netlink.c:2338
#1: ffffffff8f377a88 (rtnl_mutex){+.+.}-{3:3}, at: netlink_dump+0x676/0xda0 net/netlink/af_netlink.c:2265
#2: ffff88807e5f0580 (&ndev->lock){++--}-{2:2}, at: in6_dump_addrs+0xb8/0x1de0 net/ipv6/addrconf.c:5279
stack backtrace:
CPU: 1 PID: 8834 Comm: syz-executor.2 Not tainted 6.8.0-rc5-syzkaller-01618-gf8cbf6bde4c8 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
lockdep_rcu_suspicious+0x220/0x340 kernel/locking/lockdep.c:6712
in6_dump_addrs+0x1b47/0x1de0 net/ipv6/addrconf.c:5317
inet6_dump_addr+0x1597/0x1690 net/ipv6/addrconf.c:5428
netlink_dump+0x6a6/0xda0 net/netlink/af_netlink.c:2266
__netlink_dump_start+0x59d/0x780 net/netlink/af_netlink.c:2374
netlink_dump_start include/linux/netlink.h:340 [inline]
rtnetlink_rcv_msg+0xcf7/0x10d0 net/core/rtnetlink.c:6555
netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2547
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8e0/0xcb0 net/netlink/af_netlink.c:1902
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
___sys_sendmsg net/socket.c:2638 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
Fixes: c3718936ec47 ("ipv6: anycast: complete RCU handling of struct ifacaddr6")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227222259.4081489-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
skb_defer_free_flush() is currently called from net_rx_action()
and napi_threaded_poll().
We should also call it from __napi_busy_loop() otherwise
there is the risk the percpu queue can grow until an IPI
is forced from skb_attempt_defer_free() adding a latency spike.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227210105.3815474-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
By moving some fields around, this patch shrinks
holes size from 56 to 32, saving 24 bytes on 64bit arches.
After the patch pahole gives the following for 'struct tcp_sock':
/* size: 2304, cachelines: 36, members: 162 */
/* sum members: 2234, holes: 6, sum holes: 32 */
/* sum bitfield members: 34 bits, bit holes: 5, sum bit holes: 14 bits */
/* padding: 32 */
/* paddings: 3, sum paddings: 10 */
/* forced alignments: 1, forced holes: 1, sum forced holes: 12 */
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227192721.3558982-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The i211 requires the same PTP timestamp adjustments as the i210,
according to its datasheet. To ensure consistent timestamping across
different platforms, this change extends the existing adjustments to
include the i211.
The adjustment result are tested and comparable for i210 and i211 based
systems.
Fixes: 3f544d2a4d5c ("igb: adjust PTP timestamps for Tx/Rx latency")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240227184942.362710-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If br_multicast_init_stats() fails, there is no need to set lockdep
classes. Just return from the error path.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20240227182338.2739884-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Remove the allocation in the bridge driver and leverage the network
core allocation.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20240227182338.2739884-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rename some test cases to avoid overlapping test names which is
problematic for the kernel test robot. No changes in the test's logic.
Suggested-by: Yujie Liu <yujie.liu@intel.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20240227170418.491442-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In the commit d73ef2d69c0d ("rtnetlink: let rtnl_bridge_setlink checks
IFLA_BRIDGE_MODE length"), an adjustment was made to the old loop logic
in the function `rtnl_bridge_setlink` to enable the loop to also check
the length of the IFLA_BRIDGE_MODE attribute. However, this adjustment
removed the `break` statement and led to an error logic of the flags
writing back at the end of this function.
if (have_flags)
memcpy(nla_data(attr), &flags, sizeof(flags));
// attr should point to IFLA_BRIDGE_FLAGS NLA !!!
Before the mentioned commit, the `attr` is granted to be IFLA_BRIDGE_FLAGS.
However, this is not necessarily true fow now as the updated loop will let
the attr point to the last NLA, even an invalid NLA which could cause
overflow writes.
This patch introduces a new variable `br_flag` to save the NLA pointer
that points to IFLA_BRIDGE_FLAGS and uses it to resolve the mentioned
error logic.
Fixes: d73ef2d69c0d ("rtnetlink: let rtnl_bridge_setlink checks IFLA_BRIDGE_MODE length")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20240227121128.608110-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since commit 446fda4f2682 ("[NetLabel]: CIPSOv4 engine"), *bitmap_walk
function only returns -1. Nearly 18 years have passed, -2 scenes never
come up, so there's no need to consider it.
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/20240227093604.3574241-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Eric Dumazet says:
====================
inet: implement lockless RTM_GETNETCONF ops
This series removes RTNL use for RTM_GETNETCONF operations on AF_INET.
- Annotate data-races to avoid possible KCSAN splats.
- "ip -4 netconf show dev XXX" can be implemented without RTNL [1]
- "ip -4 netconf" dumps can be implemented using RCU instead of RTNL [1]
[1] This only refers to RTM_GETNETCONF operation, "ip" command
also uses RTM_GETLINK dumps which are using RTNL at this moment.
====================
Link: https://lore.kernel.org/r/20240227092411.2315725-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
1) inet_netconf_dump_devconf() can run under RCU protection
instead of RTNL.
2) properly return 0 at the end of a dump, avoiding an
an extra recvmsg() system call.
3) Do not use inet_base_seq() anymore, for_each_netdev_dump()
has nice properties. Restarting a GETDEVCONF dump if a device has
been added/removed or if net->ipv4.dev_addr_genid has changed is moot.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
"ip -4 netconf show dev XXXX" no longer acquires RTNL.
Return -ENODEV instead of -EINVAL if no netdev or idev can be found.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add READ_ONCE() in ipv4_devconf_get() and corresponding
WRITE_ONCE() in ipv4_devconf_set()
Add IPV4_DEVCONF_RO() and IPV4_DEVCONF_ALL_RO() macros,
and use them when reading devconf fields.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240227092411.2315725-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit d1d77120bc28 ("net: phy: dp83826: support TX data voltage tuning")
introduced a regression in that WOL is not disabled by default for DP83826.
WOL should normally be enabled through ethtool.
Fixes: d1d77120bc28 ("net: phy: dp83826: support TX data voltage tuning")
Signed-off-by: Catalin Popescu <catalin.popescu@leica-geosystems.com>
Link: https://lore.kernel.org/r/20240226162339.696461-1-catalin.popescu@leica-geosystems.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240228030658.3512782-1-chengming.zhou@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Jakub Kicinski says:
====================
tools: ynl: stop using libmnl
There is no strong reason to stop using libmnl in ynl but there
are a few small ones which add up.
First (as I remembered immediately after hitting send on v1),
C++ compilers do not like the libmnl for_each macros.
I haven't tried it myself, but having all the code directly
in YNL makes it easier for folks porting to C++ to modify them
and/or make YNL more C++ friendly.
Second, we do much more advanced netlink level parsing in ynl
than libmnl so it's hard to say that libmnl abstracts much from us.
The fact that this series, removing the libmnl dependency, only
adds <300 LoC shows that code savings aren't huge.
OTOH when new types are added (e.g. auto-int) we need to add
compatibility to deal with older version of libmnl (in fact,
even tho patches have been sent months ago, auto-ints are still
not supported in libmnl.git).
Thrid, the dependency makes ynl less self contained, and harder
to vendor in. Whether vendoring libraries into projects is a good
idea is a separate discussion, nonetheless, people want to do it.
Fourth, there are small annoyances with the libmnl APIs which
are hard to fix in backward-compatible ways. See the last patch
for example.
All in all, libmnl is a great library, but with all the code
generation and structured parsing, ynl is better served by going
its own way.
v1: https://lore.kernel.org/all/20240222235614.180876-1-kuba@kernel.org/
====================
Link: https://lore.kernel.org/r/20240227223032.1835527-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
To stick to libmnl wrappers in the past we had to use poll()
to check if there are any outstanding notifications on the socket.
This is no longer necessary, we can use MSG_DONTWAIT.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-16-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We don't use libmnl any more.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-15-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Most libmnl socket helpers can be replaced by direct calls to
the underlying libc API. We need portid, the netlink manpage
suggests we bind() address of zero.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-14-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Create a local version of the MNL_CB_* parser control values.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-13-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
All YNL parsing callbacks take struct ynl_parse_arg as the argument.
Make that official by using a local callback type instead of mnl_cb_t.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There's only one set of callbacks in YNL, for netlink control
messages, and most of them are trivial. So implement the message
walking directly without depending on mnl_cb_run2().
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
ynl_recv_ack() is simple and it's the only user of mnl_cb_run().
Now that ynl_sock_read_msgs() exists it's actually less code
to use ynl_sock_read_msgs() instead of being special.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
All callers to mnl_cb_run2() call mnl_socket_recvfrom() right before.
Wrap the two in a helper, take typed arguments (struct ynl_parse_arg),
instead of hoping that all callers remember that parser error handling
requires yarg.
In case of ynl_sock_read_family() we will no longer check for kernel
returning no data, but that would be a kernel bug, not worth complicating
the code to catch this. Calling mnl_cb_run2() on an empty buffer
is legal and results in STOP (1).
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit f2ba1e5e2208 ("tools: ynl-gen: stop generating common notification handlers")
removed the last caller of the parse_cb_run() helper.
We no longer need to export ynl_cb_array.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
All YNL parsing code expects a pointer to struct ynl_parse_arg AKA yarg.
For dump was pass in struct ynl_dump_state, which works fine, because
struct ynl_dump_state and struct ynl_parse_arg have identical layout
for the members that matter.. but it's a bit hacky.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
libc doesn't have an ARRAY_SIZE() create one locally.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Create helpers for accessing payloads of struct nlmsg.
Use them instead of the libmnl ones.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Create ynl_attr_for_each*() iteration helpers.
Use them instead of the mnl ones.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Don't use mnl attr helpers, we're trying to remove the libmnl
dependency. Create both signed and unsigned helpers, libmnl
had unsigned helpers, so code generator no longer needs
the mnl_type() hack.
The new helpers are written from first principles, but are
hopefully not too buggy.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The temporary auto-int helpers are not really correct.
We can't treat signed and unsigned ints the same when
determining whether we need full 8B. I realized this
before sending the patch to add support in libmnl.
Unfortunately, that patch has not been merged,
so time to fix our local helpers. Use the mnl* name
for now, subsequent patches will address that.
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240227223032.1835527-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We never increment the group number iterator, so all groups
get recorded into index 0 of the mcast_groups[] array.
As a result YNL can only handle using the last group.
For example using the "netdev" sample on kernel with
page pool commands results in:
$ ./samples/netdev
YNL: Multicast group 'mgmt' not found
Most families have only one multicast group, so this hasn't
been noticed. Plus perhaps developers usually test the last
group which would have worked.
Fixes: 86878f14d71a ("tools: ynl: user space helpers")
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20240226214019.1255242-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since commit 7c59c9c8f202 ("tools: ynl: generate code for ovs families")
we need relatively recent OvS headers to get YNL to compile.
Add the direct include workaround to fix compilation on less
up-to-date OSes like CentOS 9.
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240226225806.1301152-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add test case for multicast packet confirm race.
Without preceding patch, this should result in:
WARNING: CPU: 0 PID: 38 at net/netfilter/nf_conntrack_core.c:1198 __nf_conntrack_confirm+0x3ed/0x5f0
Workqueue: events_unbound macvlan_process_broadcast
RIP: 0010:__nf_conntrack_confirm+0x3ed/0x5f0
? __nf_conntrack_confirm+0x3ed/0x5f0
nf_confirm+0x2ad/0x2d0
nf_hook_slow+0x36/0xd0
ip_local_deliver+0xce/0x110
__netif_receive_skb_one_core+0x4f/0x70
process_backlog+0x8c/0x130
[..]
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
conntrack nf_confirm logic cannot handle cloned skbs referencing
the same nf_conn entry, which will happen for multicast (broadcast)
frames on bridges.
Example:
macvlan0
|
br0
/ \
ethX ethY
ethX (or Y) receives a L2 multicast or broadcast packet containing
an IP packet, flow is not yet in conntrack table.
1. skb passes through bridge and fake-ip (br_netfilter)Prerouting.
-> skb->_nfct now references a unconfirmed entry
2. skb is broad/mcast packet. bridge now passes clones out on each bridge
interface.
3. skb gets passed up the stack.
4. In macvlan case, macvlan driver retains clone(s) of the mcast skb
and schedules a work queue to send them out on the lower devices.
The clone skb->_nfct is not a copy, it is the same entry as the
original skb. The macvlan rx handler then returns RX_HANDLER_PASS.
5. Normal conntrack hooks (in NF_INET_LOCAL_IN) confirm the orig skb.
The Macvlan broadcast worker and normal confirm path will race.
This race will not happen if step 2 already confirmed a clone. In that
case later steps perform skb_clone() with skb->_nfct already confirmed (in
hash table). This works fine.
But such confirmation won't happen when eb/ip/nftables rules dropped the
packets before they reached the nf_confirm step in postrouting.
Pablo points out that nf_conntrack_bridge doesn't allow use of stateful
nat, so we can safely discard the nf_conn entry and let inet call
conntrack again.
This doesn't work for bridge netfilter: skb could have a nat
transformation. Also bridge nf prevents re-invocation of inet prerouting
via 'sabotage_in' hook.
Work around this problem by explicit confirmation of the entry at LOCAL_IN
time, before upper layer has a chance to clone the unconfirmed entry.
The downside is that this disables NAT and conntrack helpers.
Alternative fix would be to add locking to all code parts that deal with
unconfirmed packets, but even if that could be done in a sane way this
opens up other problems, for example:
-m physdev --physdev-out eth0 -j SNAT --snat-to 1.2.3.4
-m physdev --physdev-out eth1 -j SNAT --snat-to 1.2.3.5
For multicast case, only one of such conflicting mappings will be
created, conntrack only handles 1:1 NAT mappings.
Users should set create a setup that explicitly marks such traffic
NOTRACK (conntrack bypass) to avoid this, but we cannot auto-bypass
them, ruleset might have accept rules for untracked traffic already,
so user-visible behaviour would change.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217777
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Commit d0009effa886 ("netfilter: nf_tables: validate NFPROTO_* family") added
some validation of NFPROTO_* families in the nft_compat module, but it broke
the ability to use legacy iptables modules in dual-stack nftables.
While with legacy iptables one had to independently manage IPv4 and IPv6
tables, with nftables it is possible to have dual-stack tables sharing the
rules. Moreover, it was possible to use rules based on legacy iptables
match/target modules in dual-stack nftables.
As an example, the program from [2] creates an INET dual-stack family table
using an xt_bpf based rule, which looks like the following (the actual output
was generated with a patched nft tool as the current nft tool does not parse
dual stack tables with legacy match rules, so consider it for illustrative
purposes only):
table inet testfw {
chain input {
type filter hook prerouting priority filter; policy accept;
bytecode counter packets 0 bytes 0 accept
}
}
After d0009effa886 ("netfilter: nf_tables: validate NFPROTO_* family") we get
EOPNOTSUPP for the above program.
Fix this by allowing NFPROTO_INET for nft_(match/target)_validate(), but also
restrict the functions to classic iptables hooks.
Changes in v3:
* clarify that upstream nft will not display such configuration properly and
that the output was generated with a patched nft tool
* remove example program from commit description and link to it instead
* no code changes otherwise
Changes in v2:
* restrict nft_(match/target)_validate() to classic iptables hooks
* rewrite example program to use unmodified libnftnl
Fixes: d0009effa886 ("netfilter: nf_tables: validate NFPROTO_* family")
Link: https://lore.kernel.org/all/Zc1PfoWN38UuFJRI@calendula/T/#mc947262582c90fec044c7a3398cc92fac7afea72 [1]
Link: https://lore.kernel.org/all/20240220145509.53357-1-ignat@cloudflare.com/ [2]
Reported-by: Jordan Griege <jgriege@cloudflare.com>
Signed-off-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Puranjay Mohan says:
====================
bpf, arm64: use BPF prog pack allocator in BPF JIT
Changes in V8 => V9:
V8: https://lore.kernel.org/bpf/20240221145106.105995-1-puranjay12@gmail.com/
1. Rebased on bpf-next/master
2. Added Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Changes in V7 => V8:
V7: https://lore.kernel.org/bpf/20240125133159.85086-1-puranjay12@gmail.com/
1. Rebase on bpf-next/master
2. Fix __text_poke() by removing usage of 'ret' that was never set.
Changes in V6 => V7:
V6: https://lore.kernel.org/all/20240124164917.119997-1-puranjay12@gmail.com/
1. Rebase on bpf-next/master.
Changes in V5 => V6:
V5: https://lore.kernel.org/all/20230908144320.2474-1-puranjay12@gmail.com/
1. Implement a text poke api to reduce code repeatition.
2. Use flush_icache_range() in place of caches_clean_inval_pou() in the
functions that modify code.
3. Optimize the bpf_jit_free() by not copying the all instructions on
the rw image to the ro_image
Changes in V4 => v5:
1. Remove the patch for making prog pack allocator portable as it will come
through the RISCV tree[1].
2. Add a new function aarch64_insn_set() to be used in
bpf_arch_text_invalidate() for putting illegal instructions after a
program is removed. The earlier implementation of bpf_arch_text_invalidate()
was calling aarch64_insn_patch_text_nosync() in a loop and making it slow
because each call invalidated the cache.
Here is test_tag now:
[root@ip-172-31-6-176 bpf]# time ./test_tag
test_tag: OK (40945 tests)
real 0m19.695s
user 0m1.514s
sys 0m17.841s
test_tag without these patches:
[root@ip-172-31-6-176 bpf]# time ./test_tag
test_tag: OK (40945 tests)
real 0m21.487s
user 0m1.647s
sys 0m19.106s
test_tag in the previous version was really slow > 2 minutes. see [2]
3. Add cache invalidation in aarch64_insn_copy() so other users can call the
function without worrying about the cache. Currently only bpf_arch_text_copy()
is using it, but there might be more users in the future.
Chanes in V3 => V4: Changes only in 3rd patch
1. Fix the I-cache maintenance: Clean the data cache and invalidate the i-Cache
only *after* the instructions have been copied to the ROX region.
Chanes in V2 => V3: Changes only in 3rd patch
1. Set prog = orig_prog; in the failure path of bpf_jit_binary_pack_finalize()
call.
2. Add comments explaining the usage of the offsets in the exception table.
Changes in v1 => v2:
1. Make the naming consistent in the 3rd patch:
ro_image and image
ro_header and header
ro_image_ptr and image_ptr
2. Use names dst/src in place of addr/opcode in second patch.
3. Add Acked-by: Song Liu <song@kernel.org> in 1st and 2nd patch.
BPF programs currently consume a page each on ARM64. For systems with many BPF
programs, this adds significant pressure to instruction TLB. High iTLB pressure
usually causes slow down for the whole system.
Song Liu introduced the BPF prog pack allocator[3] to mitigate the above issue.
It packs multiple BPF programs into a single huge page. It is currently only
enabled for the x86_64 BPF JIT.
This patch series enables the BPF prog pack allocator for the ARM64 BPF JIT.
====================================================
Performance Analysis of prog pack allocator on ARM64
====================================================
To test the performance of the BPF prog pack allocator on ARM64, a stresser
tool[4] was built. This tool loads 8 BPF programs on the system and triggers
5 of them in an infinite loop by doing system calls.
The runner script starts 20 instances of the above which loads 8*20=160 BPF
programs on the system, 5*20=100 of which are being constantly triggered.
In the above environment we try to build Python-3.8.4 and try to find different
iTLB metrics for the compilation done by gcc-12.2.0.
The source code[5] is configured with the following command:
./configure --enable-optimizations --with-ensurepip=install
Then the runner script is executed with the following command:
./run.sh "perf stat -e ITLB_WALK,L1I_TLB,INST_RETIRED,iTLB-load-misses -a make -j32"
This builds Python while 160 BPF programs are loaded and 100 are being constantly
triggered and measures iTLB related metrics.
The output of the above command is discussed below before and after enabling the
BPF prog pack allocator.
The tests were run on qemu-system-aarch64 with 32 cpus, 4G memory, -machine virt,
-cpu host, and -enable-kvm.
Results
-------
Before enabling prog pack allocator:
------------------------------------
Performance counter stats for 'system wide':
333278635 ITLB_WALK
6762692976558 L1I_TLB
25359571423901 INST_RETIRED
15824054789 iTLB-load-misses
189.029769053 seconds time elapsed
After enabling prog pack allocator:
-----------------------------------
Performance counter stats for 'system wide':
190333544 ITLB_WALK
6712712386528 L1I_TLB
25278233304411 INST_RETIRED
5716757866 iTLB-load-misses
185.392650561 seconds time elapsed
Improvements in metrics
-----------------------
Compilation time ---> 1.92% faster
iTLB-load-misses/Sec (Less is better) ---> 63.16% decrease
ITLB_WALK/1000 INST_RETIRED (Less is better) ---> 42.71% decrease
ITLB_Walk/L1I_TLB (Less is better) ---> 42.47% decrease
[1] https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/commit/?h=for-next&id=20e490adea279d49d57b800475938f5b67926d98
[2] https://lore.kernel.org/all/CANk7y0gcP3dF2mESLp5JN1+9iDfgtiWRFGqLkCgZD6wby1kQOw@mail.gmail.com/
[3] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/
[4] https://github.com/puranjaymohan/BPF-Allocator-Bench
[5] https://www.python.org/ftp/python/3.8.4/Python-3.8.4.tgz
====================
Link: https://lore.kernel.org/r/20240228141824.119877-1-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Use bpf_jit_binary_pack_alloc for memory management of JIT binaries in
ARM64 BPF JIT. The bpf_jit_binary_pack_alloc creates a pair of RW and RX
buffers. The JIT writes the program into the RW buffer. When the JIT is
done, the program is copied to the final RX buffer
with bpf_jit_binary_pack_finalize.
Implement bpf_arch_text_copy() and bpf_arch_text_invalidate() for ARM64
JIT as these functions are required by bpf_jit_binary_pack allocator.
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240228141824.119877-3-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The text_poke API is used to implement functions like memcpy() and
memset() for instruction memory (RO+X). The implementation is similar to
the x86 version.
This will be used by the BPF JIT to write and modify BPF programs. There
could be more users of this in the future.
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240228141824.119877-2-puranjay12@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Revert a recent EC driver change that introduced an unexpected and
undesirable user-visible difference in behavior (Rafael Wysocki)"
* tag 'acpi-6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "ACPI: EC: Use a spin lock without disabing interrupts"
|