summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-04-27SUNRPC: Be even lazier about releasing pagesChuck Lever
A single RPC transaction that touches only a couple of pages means rq_pvec will not be even close to full in svc_xpt_release(). This is a common case. Instead, just leave the pages in rq_pvec until it is completely full. This improves the efficiency of the batch release mechanism on workloads that involve small RPC messages. The rq_pvec is also fully emptied just before thread exit. Reviewed-by: Calum Mackay <calum.mackay@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26SUNRPC: Convert svc_xprt_release() to the release_pages() APIChuck Lever
Instead of invoking put_page() one-at-a-time, pass the "response" portion of rq_pages directly to release_pages() to reduce the number of times each nfsd thread invokes a page allocator API. Since svc_xprt_release() is not invoked while a client is waiting for an RPC Reply, this is not expected to directly impact mean request latencies on a lightly or moderately loaded server. However as workload intensity increases, I expect somewhat better scalability: the same number of server threads should be able to handle more work. Reviewed-by: Calum Mackay <calum.mackay@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26SUNRPC: Relocate svc_free_res_pages()Chuck Lever
Clean-up: There doesn't seem to be a reason why this function is stuck in a header. One thing it prevents is the convenient addition of tracing. Moving it to a source file also makes the rq_respages clean-up logic easier to find. Reviewed-by: Calum Mackay <calum.mackay@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26nfsd: simplify the delayed disposal list codeJeff Layton
When queueing a dispose list to the appropriate "freeme" lists, it pointlessly queues the objects one at a time to an intermediate list. Remove a few helpers and just open code a list_move to make it more clear and efficient. Better document the resulting functions with kerneldoc comments. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26SUNRPC: Ignore return value of ->xpo_sendtoChuck Lever
Clean up: All callers of svc_process() ignore its return value, so svc_process() can safely be converted to return void. Ditto for svc_send(). The return value of ->xpo_sendto() is now used only as part of a trace event. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26SUNRPC: Ensure server-side sockets have a sock->fileChuck Lever
The TLS handshake upcall mechanism requires a non-NULL sock->file on the socket it hands to user space. svc_sock_free() already releases sock->file properly if one exists. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()Chuck Lever
There have been several bugs over the years where the NFSD splice actor has attempted to write outside the rq_pages array. This is a "should never happen" condition, but if for some reason the pipe splice actor should attempt to walk past the end of rq_pages, it needs to terminate the READ operation to prevent corruption of the pointer addresses in the fields just beyond the array. A server crash is thus prevented. Since the code is not behaving, the READ operation returns -EIO to the client. None of the READ payload data can be trusted if the splice actor isn't operating as expected. Suggested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2023-04-26sunrpc: simplify two-level sysctl registration for svcrdma_parm_tableLuis Chamberlain
There is no need to declare two tables to just create directories, this can be easily be done with a prefix path with register_sysctl(). Simplify this registration. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26SUNRPC: return proper error from get_expiry()NeilBrown
The get_expiry() function currently returns a timestamp, and uses the special return value of 0 to indicate an error. Unfortunately this causes a problem when 0 is the correct return value. On a system with no RTC it is possible that the boot time will be seen to be "3". When exportfs probes to see if a particular filesystem supports NFS export it tries to cache information with an expiry time of "3". The intention is for this to be "long in the past". Even with no RTC it will not be far in the future (at most a second or two) so this is harmless. But if the boot time happens to have been calculated to be "3", then get_expiry will fail incorrectly as it converts the number to "seconds since bootime" - 0. To avoid this problem we change get_expiry() to report the error quite separately from the expiry time. The error is now the return value. The expiry time is reported through a by-reference parameter. Reported-by: Jerry Zhang <jerry@skydio.com> Tested-by: Jerry Zhang <jerry@skydio.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26lockd: add some client-side tracepointsJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26nfs: move nfs_fhandle_hash to common include fileJeff Layton
lockd needs to be able to hash filehandles for tracepoints. Move the nfs_fhandle_hash() helper to a common nfs include file. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26lockd: server should unlock lock if client rejects the grantJeff Layton
Currently lockd just dequeues the block and ignores it if the client sends a GRANT_RES with a status of nlm_lck_denied. That status is an indicator that the client has rejected the lock, so the right thing to do is to unlock the lock we were trying to grant. Reported-by: Yongcheng Yang <yoyang@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26lockd: fix races in client GRANTED_MSG wait logicJeff Layton
After the wait for a grant is done (for whatever reason), nlmclnt_block updates the status of the nlm_rqst with the status of the block. At the point it does this, however, the block is still queued its status could change at any time. This is particularly a problem when the waiting task is signaled during the wait. We can end up giving up on the lock just before the GRANTED_MSG callback comes in, and accept it even though the lock request gets back an error, leaving a dangling lock on the server. Since the nlm_wait never lives beyond the end of nlmclnt_lock, put it on the stack and add functions to allow us to enqueue and dequeue the block. Enqueue it just before the lock/wait loop, and dequeue it just after we exit the loop instead of waiting until the end of the function. Also, scrape the status at the time that we dequeue it to ensure that it's final. Reported-by: Yongcheng Yang <yoyang@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26lockd: move struct nlm_wait to lockd.hJeff Layton
The next patch needs struct nlm_wait in fs/lockd/clntproc.c, so move the definition to a shared header file. As an added clean-up, drop the unused b_reclaim field. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26lockd: remove 2 unused helper functionsJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26lockd: purge resources held on behalf of nlm clients when shutting downJeff Layton
It's easily possible for the server to have an outstanding lock when we go to shut down. When that happens, we often get a warning like this in the kernel log: lockd: couldn't shutdown host module for net f0000000! This is because the shutdown procedures skip removing any hosts that still have outstanding resources (locks). Eventually, things seem to get cleaned up anyway, but the log message is unsettling, and server shutdown doesn't seem to be working the way it was intended. Ensure that we tear down any resources held on behalf of a client when tearing one down for server shutdown. Reported-by: Yongcheng Yang <yoyang@redhat.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26NFSD: Convert filecache to rhltableChuck Lever
While we were converting the nfs4_file hashtable to use the kernel's resizable hashtable data structure, Neil Brown observed that the list variant (rhltable) would be better for managing nfsd_file items as well. The nfsd_file hash table will contain multiple entries for the same inode -- these should be kept together on a list. And, it could be possible for exotic or malicious client behavior to cause the hash table to resize itself on every insertion. A nice simplification is that rhltable_lookup() can return a list that contains only nfsd_file items that match a given inode, which enables us to eliminate specialized hash table helper functions and use the default functions provided by the rhashtable implementation). Since we are now storing nfsd_file items for the same inode on a single list, that effectively reduces the number of hash entries that have to be tracked in the hash table. The mininum bucket count is therefore lowered. Light testing with fstests generic/531 show no regressions. Suggested-by: Neil Brown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26nfsd: allow reaping files still under writebackJeff Layton
On most filesystems, there is no reason to delay reaping an nfsd_file just because its underlying inode is still under writeback. nfsd just relies on client activity or the local flusher threads to do writeback. The main exception is NFS, which flushes all of its dirty data on last close. Add a new EXPORT_OP_FLUSH_ON_CLOSE flag to allow filesystems to signal that they do this, and only skip closing files under writeback on such filesystems. Also, remove a redundant NULL file pointer check in nfsd_file_check_writeback, and clean up nfs's export op flag definitions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26nfsd: update comment over __nfsd_file_cache_purgeJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26nfsd: don't take/put an extra reference when putting a fileJeff Layton
The last thing that filp_close does is an fput, so don't bother taking and putting the extra reference. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26nfsd: add some comments to nfsd_file_do_acquireJeff Layton
David Howells mentioned that he found this bit of code confusing, so sprinkle in some comments to clarify. Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26nfsd: don't kill nfsd_files because of lease break errorJeff Layton
An error from break_lease is non-fatal, so we needn't destroy the nfsd_file in that case. Just put the reference like we normally would and return the error. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26nfsd: simplify test_bit return in NFSD_FILE_KEY_FULL comparatorJeff Layton
test_bit returns bool, so we can just compare the result of that to the key->gc value without the "!!". Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26nfsd: NFSD_FILE_KEY_INODE only needs to find GC'ed entriesJeff Layton
Since v4 files are expected to be long-lived, there's little value in closing them out of the cache when there is conflicting access. Change the comparator to also match the gc value in the key. Change both of the current users of that key to set the gc value in the key to "true". Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26nfsd: don't open-code clear_and_wake_up_bitJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26net: phy: hide the PHYLIB_LEDS knobPaolo Abeni
commit 4bb7aac70b5d ("net: phy: fix circular LEDS_CLASS dependencies") solved a build failure, but introduces a new config knob with a default 'y' value: PHYLIB_LEDS. The latter is against the current new config policy. The exception was raised to allow the user to catch bad configurations without led support. Anyway the current definition of PHYLIB_LEDS does not fit the above goal: if LEDS_CLASS is disabled, the new config will be available only with PHYLIB disabled, too. Hide the mentioned config, to preserve the randconfig testing done so far, while respecting the mentioned policy. Suggested-by: Andrew Lunn <andrew@lunn.ch> Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/d82489be8ed911c383c3447e9abf469995ccf39a.1682496488.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
No conflicts. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-25net: phy: marvell-88x2222: remove unnecessary (void*) conversionswuych
Pointer variables of void * type do not require type cast. Signed-off-by: wuych <yunchuan@nfschina.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-25tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.Kuniyuki Iwashima
syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY skbs. We can reproduce the problem with these sequences: sk = socket(AF_INET, SOCK_DGRAM, 0) sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE) sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1) sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53)) sk.close() sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets skb->cb->ubuf.refcnt to 1, and calls sock_hold(). Here, struct ubuf_info_msgzc indirectly holds a refcnt of the socket. When the skb is sent, __skb_tstamp_tx() clones it and puts the clone into the socket's error queue with the TX timestamp. When the original skb is received locally, skb_copy_ubufs() calls skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt. This additional count is decremented while freeing the skb, but struct ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is not called. The last refcnt is not released unless we retrieve the TX timestamped skb by recvmsg(). Since we clear the error queue in inet_sock_destruct() after the socket's refcnt reaches 0, there is a circular dependency. If we close() the socket holding such skbs, we never call sock_put() and leak the count, sk, and skb. TCP has the same problem, and commit e0c8bccd40fc ("net: stream: purge sk_error_queue in sk_stream_kill_queues()") tried to fix it by calling skb_queue_purge() during close(). However, there is a small chance that skb queued in a qdisc or device could be put into the error queue after the skb_queue_purge() call. In __skb_tstamp_tx(), the cloned skb should not have a reference to the ubuf to remove the circular dependency, but skb_clone() does not call skb_copy_ubufs() for zerocopy skb. So, we need to call skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs(). [0]: BUG: memory leak unreferenced object 0xffff88800c6d2d00 (size 1152): comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00 ................ 02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............ backtrace: [<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024 [<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083 [<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline] [<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245 [<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515 [<00000000b9b11231>] sock_create net/socket.c:1566 [inline] [<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline] [<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline] [<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636 [<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline] [<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline] [<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647 [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd BUG: memory leak unreferenced object 0xffff888017633a00 (size 240): comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff .........-m..... backtrace: [<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497 [<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline] [<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596 [<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline] [<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370 [<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037 [<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652 [<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253 [<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819 [<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline] [<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734 [<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117 [<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline] [<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline] [<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125 [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY") Fixes: b5947e5d1e71 ("udp: msg_zerocopy") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-25net: amd: Fix link leak when verifying config failedGencen Gan
After failing to verify configuration, it returns directly without releasing link, which may cause memory leak. Paolo Abeni thinks that the whole code of this driver is quite "suboptimal" and looks unmainatained since at least ~15y, so he suggests that we could simply remove the whole driver, please take it into consideration. Simon Horman suggests that the fix label should be set to "Linux-2.6.12-rc2" considering that the problem has existed since the driver was introduced and the commit above doesn't seem to exist in net/net-next. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Gan Gecen <gangecen@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-24net: phy: marvell: Fix inconsistent indenting in led_blink_setChristian Marangi
Fix inconsistent indeinting in m88e1318_led_blink_set reported by kernel test robot, probably done by the presence of an if condition dropped in later revision of the same code. Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202304240007.0VEX8QYG-lkp@intel.com/ Fixes: ea9e86485dec ("net: phy: marvell: Implement led_blink_set()") Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230423172800.3470-1-ansuelsmth@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24lan966x: Don't use xdp_frame when action is XDP_TXHoratiu Vultur
When the action of an xdp program was XDP_TX, lan966x was creating a xdp_frame and use this one to send the frame back. But it is also possible to send back the frame without needing a xdp_frame, because it is possible to send it back using the page. And then once the frame is transmitted is possible to use directly page_pool_recycle_direct as lan966x is using page pools. This would save some CPU usage on this path, which results in higher number of transmitted frames. Bellow are the statistics: Frame size: Improvement: 64 ~8% 256 ~11% 512 ~8% 1000 ~0% 1500 ~0% Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/r/20230422142344.3630602-1-horatiu.vultur@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-04-24 We've added 5 non-merge commits during the last 3 day(s) which contain a total of 7 files changed, 87 insertions(+), 44 deletions(-). The main changes are: 1) Workaround for bpf iter selftest due to lack of subprog support in precision tracking, from Andrii. 2) Disable bpf_refcount_acquire kfunc until races are fixed, from Dave. 3) One more test_verifier test converted from asm macro to asm in C, from Eduard. 4) Fix build with NETFILTER=y INET=n config, from Florian. 5) Add __rcu_read_{lock,unlock} into deny list, from Yafang. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: avoid mark_all_scalars_precise() trigger in one of iter tests bpf: Add __rcu_read_{lock,unlock} into btf id deny list bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed selftests/bpf: verifier/prevent_map_lookup converted to inline assembly bpf: fix link failure with NETFILTER=y INET=n ==================== Link: https://lore.kernel.org/r/20230425005648.86714-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24Merge branch 'tsnep-xdp-socket-zero-copy-support'Jakub Kicinski
Gerhard Engleder says: ==================== tsnep: XDP socket zero-copy support Implement XDP socket zero-copy support for tsnep driver. I tried to follow existing drivers like igc as far as possible. But one main difference is that tsnep does not need any reconfiguration for XDP BPF program setup. So I decided to keep this behavior no matter if a XSK pool is used or not. As a result, tsnep starts using the XSK pool even if no XDP BPF program is available. Another difference is that I tried to prevent potentially failing allocations during XSK pool setup. E.g. both memory models for page pool and XSK pool are registered all the time. Thus, XSK pool setup cannot end up with not working queues. Some prework is done to reduce the last two XSK commits to actual XSK changes. ==================== Link: https://lore.kernel.org/r/20230421194656.48063-1-gerhard@engleder-embedded.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24tsnep: Add XDP socket zero-copy TX supportGerhard Engleder
Send and complete XSK pool frames within TX NAPI context. NAPI context is triggered by ndo_xsk_wakeup. Test results with A53 1.2GHz: xdpsock txonly copy mode, 64 byte frames: pps pkts 1.00 tx 284,409 11,398,144 Two CPUs with 100% and 10% utilization. xdpsock txonly zero-copy mode, 64 byte frames: pps pkts 1.00 tx 511,929 5,890,368 Two CPUs with 100% and 1% utilization. xdpsock l2fwd copy mode, 64 byte frames: pps pkts 1.00 rx 248,985 7,315,885 tx 248,921 7,315,885 Two CPUs with 100% and 10% utilization. xdpsock l2fwd zero-copy mode, 64 byte frames: pps pkts 1.00 rx 254,735 3,039,456 tx 254,735 3,039,456 Two CPUs with 100% and 4% utilization. Packet rate increases and CPU utilization is reduced in both cases. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24tsnep: Add XDP socket zero-copy RX supportGerhard Engleder
Add support for XSK zero-copy to RX path. The setup of the XSK pool can be done at runtime. If the netdev is running, then the queue must be disabled and enabled during reconfiguration. This can be done easily with functions introduced in previous commits. A more important property is that, if the netdev is running, then the setup of the XSK pool shall not stop the netdev in case of errors. A broken netdev after a failed XSK pool setup is bad behavior. Therefore, the allocation and setup of resources during XSK pool setup is done only before any queue is disabled. Additionally, freeing and later allocation of resources is eliminated in some cases. Page pool entries are kept for later use. Two memory models are registered in parallel. As a result, the XSK pool setup cannot fail during queue reconfiguration. In contrast to other drivers, XSK pool setup and XDP BPF program setup are separate actions. XSK pool setup can be done without any XDP BPF program. The XDP BPF program can be added, removed or changed without any reconfiguration of the XSK pool. Test results with A53 1.2GHz: xdpsock rxdrop copy mode, 64 byte frames: pps pkts 1.00 rx 856,054 10,625,775 Two CPUs with both 100% utilization. xdpsock rxdrop zero-copy mode, 64 byte frames: pps pkts 1.00 rx 889,388 4,615,284 Two CPUs with 100% and 20% utilization. Packet rate increases and CPU utilization is reduced. 100% CPU load seems to the base load. This load is consumed by ksoftirqd just for dropping the generated packets without xdpsock running. Using batch API reduced CPU utilization slightly, but measurements are not stable enough to provide meaningful numbers. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24tsnep: Move skb receive action to separate functionGerhard Engleder
The function tsnep_rx_poll() is already pretty long and the skb receive action can be reused for XSK zero-copy support. Move page based skb receive to separate function. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24tsnep: Add functions for queue enable/disableGerhard Engleder
Move queue enable and disable code to separate functions. This way the activation and deactivation of the queues are defined actions, which can be used in future execution paths. This functions will be used for the queue reconfiguration at runtime, which is necessary for XSK zero-copy support. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24tsnep: Rework TX/RX queue initializationGerhard Engleder
Make initialization of TX and RX queues less dynamic by moving some initialization from netdev open/close to device probing. Additionally, move some initialization code to separate functions to enable future use in other execution paths. This is done as preparation for queue reconfigure at runtime, which is necessary for XSK zero-copy support. Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24tsnep: Replace modulo operation with maskGerhard Engleder
TX/RX ring size is static and power of 2 to enable compiler to optimize modulo operation to mask operation. Make this optimization already in the code and don't rely on the compiler. CPU utilisation during high packet rate has not changed. So no performance improvement has been measured. But it is best practice to prevent modulo operations. Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24net: phy: dp83867: Add led_brightness_set supportAlexander Stein
Up to 4 LEDs can be attached to the PHY, add support for setting brightness manually. Signed-off-by: Alexander Stein <alexander.stein@ew.tq-group.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230424134625.303957-1-alexander.stein@ew.tq-group.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24net: phy: Fix reading LED reg propertyAlexander Stein
'reg' is always encoded in 32 bits, thus it has to be read using the function with the corresponding bit width. Fixes: 01e5b728e9e4 ("net: phy: Add a binding for PHY LEDs") Signed-off-by: Alexander Stein <alexander.stein@ew.tq-group.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20230424141648.317944-1-alexander.stein@ew.tq-group.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24drivers: nfc: nfcsim: remove return value check of `dev_dir`Jianuo Kuang
Smatch complains that: nfcsim_debugfs_init_dev() warn: 'dev_dir' is an error pointer or valid According to the documentation of the debugfs_create_dir() function, there is no need to check the return value of this function. Just delete the dead code. Signed-off-by: Jianuo Kuang <u202110722@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20230424024140.34607-1-u202110722@hust.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24net: phy: dp83867: Remove unnecessary (void*) conversionswuych
Pointer variables of void * type do not require type cast. Signed-off-by: wuych <yunchuan@nfschina.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230424101550.664319-1-yunchuan@nfschina.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24net: ethtool: coalesce: try to make user settings stick twiceJakub Kicinski
SET_COALESCE may change operation mode and parameters in one call. Changing operation mode may cause the driver to reset the parameter values to what is a reasonable default for new operation mode. Since driver does not know which parameters come from user and which are echoed back from ->get, driver may ignore the parameters when switching operation modes. This used to be inevitable for ioctl() but in netlink we know which parameters are actually specified by the user. We could inform which parameters were set by the user but this would lead to a lot of code duplication in the drivers. Instead try to call the drivers twice if both mode and params are changed. The set method already checks if any params need updating so in case the driver did the right thing the first time around - there will be no second call to it's ->set method (only an extra call to ->get()). For mlx5 for example before this patch we'd see: # ethtool -C eth0 adaptive-rx on adaptive-tx on # ethtool -C eth0 adaptive-rx off adaptive-tx off \ tx-usecs 123 rx-usecs 123 Adaptive RX: off TX: off rx-usecs: 3 rx-frames: 32 tx-usecs: 16 tx-frames: 32 [...] After the change: # ethtool -C eth0 adaptive-rx on adaptive-tx on # ethtool -C eth0 adaptive-rx off adaptive-tx off \ tx-usecs 123 rx-usecs 123 Adaptive RX: off TX: off rx-usecs: 123 rx-frames: 32 tx-usecs: 123 tx-frames: 32 [...] This only works for netlink, so it's a small discrepancy between netlink and ioctl(). Since we anticipate most users to move to netlink I believe it's worth making their lives easier. Link: https://lore.kernel.org/r/20230420233302.944382-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24Merge branch 'update-coding-style-and-check-alloc_frag'Jakub Kicinski
Haiyang Zhang says: ==================== Update coding style and check alloc_frag Follow up patches for the jumbo frame support. As suggested by Jakub Kicinski, update coding style, and check napi_alloc_frag for possible fallback to single pages. ==================== Link: https://lore.kernel.org/r/1682096818-30056-1-git-send-email-haiyangz@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24net: mana: Check if netdev/napi_alloc_frag returns single pageHaiyang Zhang
netdev/napi_alloc_frag() may fall back to single page which is smaller than the requested size. Add error checking to avoid memory overwritten. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24net: mana: Rename mana_refill_rxoob and remove some empty linesHaiyang Zhang
Rename mana_refill_rxoob for naming consistency. And remove some empty lines between function call and error checking. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24Merge branch 'add-page_pool-support-for-page-recycling-in-veth-driver'Jakub Kicinski
Lorenzo Bianconi says: ==================== add page_pool support for page recycling in veth driver Introduce page_pool support in veth driver in order to recycle pages in veth_convert_skb_to_xdp_buff routine and avoid reallocating the skb through the page allocator when we run a xdp program on the device and we receive skbs from the stack. ==================== Link: https://lore.kernel.org/r/cover.1682188837.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-24net: veth: add page_pool statsLorenzo Bianconi
Introduce page_pool stats support to report info about local page_pool through ethtool Tested-by: Maryam Tahhan <mtahhan@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>