summaryrefslogtreecommitdiff
path: root/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
AgeCommit message (Collapse)Author
2023-06-18svcrdma: Fix stale commentChuck Lever
Commit 7d81ee8722d6 ("svcrdma: Single-stage RDMA Read") changed the behavior of svc_rdma_recvfrom() but neglected to update the documenting comment. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17svcrdma: Prevent page release when nothing was receivedChuck Lever
I noticed that svc_rqst_release_pages() was still unnecessarily releasing a page when svc_rdma_recvfrom() returns zero. Fixes: a53d5cb0646a ("svcrdma: Avoid releasing a page in svc_xprt_release()") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12svcrdma: Clean up allocation of svc_rdma_recv_ctxtChuck Lever
The physical device's favored NUMA node ID is available when allocating a recv_ctxt. Use that value instead of relying on the assumption that the memory allocation happens to be running on a node close to the device. This clean up eliminates the hack of destroying recv_ctxts that were not created by the receive CQ thread -- recv_ctxts are now always allocated on a "good" node. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-14SUNRPC: always free ctxt when freeing deferred requestNeilBrown
Since the ->xprt_ctxt pointer was added to svc_deferred_req, it has not been sufficient to use kfree() to free a deferred request. We may need to free the ctxt as well. As freeing the ctxt is all that ->xpo_release_rqst() does, we repurpose it to explicit do that even when the ctxt is not stored in an rqst. So we now have ->xpo_release_ctxt() which is given an xprt and a ctxt, which may have been taken either from an rqst or from a dreq. The caller is now responsible for clearing that pointer after the call to ->xpo_release_ctxt. We also clear dr->xprt_ctxt when the ctxt is moved into a new rqst when revisiting a deferred request. This ensures there is only one pointer to the ctxt, so the risk of double freeing in future is reduced. The new code in svc_xprt_release which releases both the ctxt and any rq_deferred depends on this. Fixes: 773f91b2cf3f ("SUNRPC: Fix NFSD's request deferral on RDMA transports") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-02-20SUNRPC: Remove ->xpo_secure_port()Chuck Lever
There's no need for the cost of this extra virtual function call during every RPC transaction: the RQ_SECURE bit can be set properly in ->xpo_recvfrom() instead. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-05-19SUNRPC: Remove svc_rqst::rq_xprt_hlenChuck Lever
Clean up: This field is now always set to zero. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-04-06SUNRPC: Fix NFSD's request deferral on RDMA transportsChuck Lever
Trond Myklebust reports an NFSD crash in svc_rdma_sendto(). Further investigation shows that the crash occurred while NFSD was handling a deferred request. This patch addresses two inter-related issues that prevent request deferral from working correctly for RPC/RDMA requests: 1. Prevent the crash by ensuring that the original svc_rqst::rq_xprt_ctxt value is available when the request is revisited. Otherwise svc_rdma_sendto() does not have a Receive context available with which to construct its reply. 2. Possibly since before commit 71641d99ce03 ("svcrdma: Properly compute .len and .buflen for received RPC Calls"), svc_rdma_recvfrom() did not include the transport header in the returned xdr_buf. There should have been no need for svc_defer() and friends to save and restore that header, as of that commit. This issue is addressed in a backport-friendly way by simply having svc_rdma_recvfrom() set rq_xprt_hlen to zero unconditionally, just as svc_tcp_recvfrom() does. This enables svc_deferred_recv() to correctly reconstruct an RPC message received via RPC/RDMA. Reported-by: Trond Myklebust <trondmy@hammerspace.com> Link: https://lore.kernel.org/linux-nfs/82662b7190f26fb304eb0ab1bb04279072439d4e.camel@hammerspace.com/ Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org>
2021-10-04svcrdma: Split the svcrdma_wc_receive() tracepointChuck Lever
There are currently three separate purposes being served by a single tracepoint here. They need to be split up. svcrdma_wc_recv: - status is always zero, so there's no value in recording it. - vendor_err is meaningless unless status is not zero, so there's no value in recording it. - This tracepoint is needed only when developing modifications, so it should be left disabled most of the time. svcrdma_wc_recv_flush: - As above, needed only rarely, and not an error. svcrdma_wc_recv_err: - received is always zero, so there's no value in recording it. - This tracepoint can be left enabled because completion errors are run-time problems (except for FLUSHED_ERR). - Tracepoint name now ends in _err to reflect its purpose. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-03-31svcrdma: Clean up dto_q critical section in svc_rdma_recvfrom()Chuck Lever
This, to me, seems less cluttered and less redundant. I was hoping it could help reduce lock contention on the dto_q lock by reducing the size of the critical section, but alas, the only improvement is readability. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-31svcrdma: Remove svc_rdma_recv_ctxt::rc_pages and ::rc_argChuck Lever
These fields are no longer used. The size of struct svc_rdma_recv_ctxt is now less than 300 bytes on x86_64, down from 2440 bytes. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-31svcrdma: Remove sc_read_complete_qChuck Lever
Now that svc_rdma_recvfrom() waits for Read completion, sc_read_complete_q is no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-31svcrdma: Single-stage RDMA ReadChuck Lever
Currently the generic RPC server layer calls svc_rdma_recvfrom() twice to retrieve an RPC message that uses Read chunks. I'm not exactly sure why this design was chosen originally. Instead, let's wait for the Read chunk completion inline in the first call to svc_rdma_recvfrom(). The goal is to eliminate some page allocator churn. rdma_read_complete() replaces pages in the second svc_rqst by calling put_page() repeatedly while the upper layer waits for the request to be constructed, which adds unnecessary NFS WRITE round- trip latency. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Tom Talpey <tom@talpey.com>
2021-03-22SUNRPC: Move svc_xprt_received() call sitesChuck Lever
Currently, XPT_BUSY is not cleared until xpo_recvfrom returns. That effectively blocks the receipt and handling of the next RPC message until the current one has been taken off the transport. This strict ordering is a requirement for socket transports. For our kernel RPC/RDMA transport implementation, however, dequeuing an ingress message is nothing more than a list_del(). The transport can safely be marked un-busy as soon as that is done. To keep the changes simpler, this patch just moves the svc_xprt_received() call site from svc_handle_xprt() into the transports, so that the actual optimization can be done in a subsequent patch. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-22svcrdma: Add a "deferred close" helperChuck Lever
Refactor a bit of commonly used logic so that every site that wants a close deferred to an nfsd thread does all the right things (set_bit(XPT_CLOSE) then enqueue). Also, once XPT_CLOSE is set on a transport, it is never cleared. If XPT_CLOSE is already set, then the close is already being handled and the enqueue can be skipped. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-22svcrdma: Maintain a Receive water markChuck Lever
Post more Receives when the number of pending Receives drops below a water mark. The batch mechanism is disabled if the underlying device cannot support a reasonably-sized Receive Queue. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-22svcrdma: Use svc_rdma_refresh_recvs() in wc_receiveChuck Lever
Replace svc_rdma_post_recv() with the new batch receive mechanism. For the moment it is posting just a single Receive WR at a time, so no change in behavior is expected. Since svc_rdma_wc_receive() was the last call site for svc_rdma_post_recv(), it is removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-22svcrdma: Add a batch Receive posting mechanismChuck Lever
Introduce a server-side mechanism similar to commit e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)") to post Receive WRs in batch. Its first consumer is svc_rdma_post_recvs(), which posts the initial set of Receive WRs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-22svcrdma: Remove stale comment for svc_rdma_wc_receive()Chuck Lever
xprt pinning was removed in commit 365e9992b90f ("svcrdma: Remove transport reference counting"), but this comment was not updated to reflect that change. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-22svcrdma: RPCDBG_FACILITY is no longer usedChuck Lever
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-03-11svcrdma: Revert "svcrdma: Reduce Receive doorbell rate"Chuck Lever
I tested commit 43042b90cae1 ("svcrdma: Reduce Receive doorbell rate") with mlx4 (IB) and software iWARP and didn't find any issues. However, I recently got my hardware iWARP setup back on line (FastLinQ) and it's crashing hard on this commit (confirmed via bisect). The failure mode is complex. - After a connection is established, the first Receive completes normally. - But the second and third Receives have garbage in their Receive buffers. The server responds with ERR_VERS as a result. - When the client tears down the connection to retry, a couple of posted Receives flush twice, and that corrupts the recv_ctxt free list. - __svc_rdma_free then faults or loops infinitely while destroying the xprt's recv_ctxts. Since 43042b90cae1 ("svcrdma: Reduce Receive doorbell rate") does not fix a bug but is a scalability enhancement, it's safe and appropriate to revert it while working on a replacement. Fixes: 43042b90cae1 ("svcrdma: Reduce Receive doorbell rate") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: DMA-sync the receive buffer in svc_rdma_recvfrom()Chuck Lever
The Receive completion handler doesn't look at the contents of the Receive buffer. The DMA sync isn't terribly expensive but it's one less thing that needs to be done by the Receive completion handler, which is single-threaded (per svc_xprt). This helps scalability. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-01-25svcrdma: Reduce Receive doorbell rateChuck Lever
This is similar to commit e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)") which added Receive batching to the client. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-25svcrdma: Convert rdma_stat_recv to a per-CPU counterChuck Lever
Receives are frequent events. Avoid the overhead of a memory bus lock cycle for counting a value that is hardly every used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use the new parsed chunk list when pulling Read chunksChuck Lever
As a pre-requisite for handling multiple Read chunks in each Read list, convert svc_rdma_recv_read_chunk() to use the new parsed Read chunk list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Remove chunk list pointersChuck Lever
Clean up: These pointers are no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Support multiple write chunks when pulling upChuck Lever
When counting the number of SGEs needed to construct a Send request, do not count result payloads. And, when copying the Reply message into the pull-up buffer, result payloads are not to be copied to the Send buffer. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to construct RDMA WritesChuck Lever
Refactor: Instead of re-parsing the ingress RPC Call transport header when constructing RDMA Writes, use the new parsed chunk lists for the Write list and Reply chunk, which are version-agnostic and already XDR-decoded. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to detect reverse direction repliesChuck Lever
Refactor: Don't duplicate header decoding smarts here. Instead, use the new parsed chunk lists. Note that the XID sanity test is also removed. The XID is already looked up by the cb handler, and is rejected if it's not recognized. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Use parsed chunk lists to derive the inv_rkeyChuck Lever
Refactor: Don't duplicate header decoding smarts here. Instead, use the new parsed chunk lists. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30svcrdma: Add a "parsed chunk list" data structureChuck Lever
This simple data structure binds the location of each data payload inside of an RPC message to the chunk that will be used to push it to or pull it from the client. There are several benefits to this small additional overhead: * It enables support for more than one chunk in incoming Read and Write lists. * It translates the version-specific on-the-wire format into a generic in-memory structure, enabling support for multiple versions of the RPC/RDMA transport protocol. * It enables the server to re-organize a chunk list if it needs to adjust where Read chunk data lands in server memory without altering the contents of the XDR-encoded Receive buffer. Construction of these lists is done while sanity checking each incoming RPC/RDMA header. Subsequent patches will make use of the generated data structures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-28svcrdma: Remove transport reference countingChuck Lever
Jason tells me that a ULP cannot rely on getting an ESTABLISHED and DISCONNECTED event pair for each connection, so transport reference counting in the CM event handler will never be reliable. Now that we have ib_drain_qp(), svcrdma should no longer need to hold transport references while Sends and Receives are posted. So remove the get/put call sites in the CM event handlers. This eliminates a significant source of locked memory bus traffic. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-28svcrdma: Fix another Receive buffer leakChuck Lever
During a connection tear down, the Receive queue is flushed before the device resources are freed. Typically, all the Receives flush with IB_WR_FLUSH_ERR. However, any pending successful Receives flush with IB_WR_SUCCESS, and the server automatically posts a fresh Receive to replace the completing one. This happens even after the connection has closed and the RQ is drained. Receives that are posted after the RQ is drained appear never to complete, causing a Receive resource leak. The leaked Receive buffer is left DMA-mapped. To prevent these late-posted recv_ctxt's from leaking, block new Receive posting after XPT_CLOSE is set. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-13svcrdma: Record Receive completion ID in svc_rdma_decode_rqstChuck Lever
When recording a trace event in the Receive path, tie decoding results and errors to an incoming Receive completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-13svcrdma: Introduce Receive completion IDsChuck Lever
Set up a completion ID in each svc_rdma_recv_ctxt. The ID is used to match an incoming Receive completion to a transport and to a previous ib_post_recv(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-13svcrdma: Add common XDR decoders for RDMA and Read segmentsChuck Lever
Clean up: De-duplicate some code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-13SUNRPC: Add helpers for decoding list discriminators symbolicallyChuck Lever
Use these helpers in a few spots to demonstrate their use. The remaining open-coded discriminator checks in rpcrdma will be addressed in subsequent patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-13svcrdma: Consolidate send_error helper functionsChuck Lever
Final refactor: Replace internals of svc_rdma_send_error() with a simple call to svc_rdma_send_error_msg(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-07-13svcrdma: Add @rctxt parameter to svc_rdma_send_error() functionsChuck Lever
Another step towards making svc_rdma_send_error_msg() and svc_rdma_send_error() similar enough to eliminate one of them. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18svcrdma: Rename tracepoints that record header decoding errorsChuck Lever
Clean up: Use a consistent naming convention so that these trace points can be enabled quickly via a glob. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18svcrdma: Fix backchannel return codeChuck Lever
Way back when I was writing the RPC/RDMA server-side backchannel code, I misread the TCP backchannel reply handler logic. When svc_tcp_recvfrom() successfully receives a backchannel reply, it does not return -EAGAIN. It sets XPT_DATA and returns zero. Update svc_rdma_recvfrom() to return zero. Here, XPT_DATA doesn't need to be set again: it is set whenever a new message is received, behind a spin lock in a single threaded context. Also, if handling the cb reply is not successful, the message is simply dropped. There's no special message framing to deal with as there is in the TCP case. Now that the handle_bc_reply() return value is ignored, I've removed the dprintk call sites in the error exit of handle_bc_reply() in favor of trace points in other areas that already report the error cases. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-17svcrdma: Fix leak of svc_rdma_recv_ctxt objectsChuck Lever
Utilize the xpo_release_rqst transport method to ensure that each rqstp's svc_rdma_recv_ctxt object is released even when the server cannot return a Reply for that rqstp. Without this fix, each RPC whose Reply cannot be sent leaks one svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped Receive buffer, and any pages that might be part of the Reply message. The leak is infrequent unless the network fabric is unreliable or Kerberos is in use, as GSS sequence window overruns, which result in connection loss, are more common on fast transports. Fixes: 3a88092ee319 ("svcrdma: Preserve Receive buffer until svc_rdma_sendto") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Fix double sync of transport header bufferChuck Lever
Performance optimization: Avoid syncing the transport buffer twice when Reply buffer pull-up is necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Refactor chunk list encodersChuck Lever
Same idea as the receive-side changes I did a while back: use xdr_stream helpers rather than open-coding the XDR chunk list encoders. This builds the Reply transport header from beginning to end without backtracking. As additional clean-ups, fill in documenting comments for the XDR encoders and sprinkle some trace points in the new encoding functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: De-duplicate code that locates Write and Reply chunksChuck Lever
Cache the locations of the Requester-provided Write list and Reply chunk so that the Send path doesn't need to parse the Call header again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16svcrdma: Use struct xdr_stream to decode ingress transport headersChuck Lever
The logic that checks incoming network headers has to be scrupulous. De-duplicate: replace open-coded buffer overflow checks with the use of xdr_stream helpers that are used most everywhere else XDR decoding is done. One minor change to the sanity checks: instead of checking the length of individual segments, cap the length of the whole chunk to be sure it can fit in the set of pages available in rq_pages. This should be a better test of whether the server can handle the chunks in each request. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16nfsd: Fix NFSv4 READ on RDMA when using readvChuck Lever
svcrdma expects that the payload falls precisely into the xdr_buf page vector. This does not seem to be the case for nfsd4_encode_readv(). This code is called only when fops->splice_read is missing or when RQ_SPLICE_OK is clear, so it's not a noticeable problem in many common cases. Add new transport method: ->xpo_read_payload so that when a READ payload does not fit exactly in rq_res's page vector, the XDR encoder can inform the RPC transport exactly where that payload is, without the payload's XDR pad. That way, when a Write chunk is present, the transport knows what byte range in the Reply message is supposed to be matched with the chunk. Note that the Linux NFS server implementation of NFS/RDMA can currently handle only one Write chunk per RPC-over-RDMA message. This simplifies the implementation of this fix. Fixes: b04209806384 ("nfsd4: allow exotic read compounds") Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053 Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2019-08-19svcrdma: Use llist for managing cache of recv_ctxtsChuck Lever
Use a wait-free mechanism for managing the svc_rdma_recv_ctxts free list. Subsequently, sc_recv_lock can be eliminated. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-02-06svcrdma: Remove syslog warnings in work completion handlersChuck Lever
These can result in a lot of log noise, and are able to be triggered by client misbehavior. Since there are trace points in these handlers now, there's no need to spam the log. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-02-06svcrpc: fix unlikely races preventing queueing of socketsJ. Bruce Fields
In the rpc server, When something happens that might be reason to wake up a thread to do something, what we do is - modify xpt_flags, sk_sock->flags, xpt_reserved, or xpt_nr_rqsts to indicate the new situation - call svc_xprt_enqueue() to decide whether to wake up a thread. svc_xprt_enqueue may require multiple conditions to be true before queueing up a thread to handle the xprt. In the SMP case, one of the other CPU's may have set another required condition, and in that case, although both CPUs run svc_xprt_enqueue(), it's possible that neither call sees the writes done by the other CPU in time, and neither one recognizes that all the required conditions have been set. A socket could therefore be ignored indefinitely. Add memory barries to ensure that any svc_xprt_enqueue() call will always see the conditions changed by other CPUs before deciding to ignore a socket. I've never seen this race reported. In the unlikely event it happens, another event will usually come along and the problem will fix itself. So I don't think this is worth backporting to stable. Chuck tried this patch and said "I don't see any performance regressions, but my server has only a single last-level CPU cache." Tested-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-11-28svcrdma: Optimize the logic that selects the R_key to invalidateChuck Lever
o Select the R_key to invalidate while the CPU cache still contains the received RPC Call transport header, rather than waiting until we're about to send the RPC Reply. o Choose Send With Invalidate if there is exactly one distinct R_key in the received transport header. If there's more than one, the client will have to perform local invalidation after it has already waited for remote invalidation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>