summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-11-20bpf: extract __check_reg_arg() utility functionEduard Zingerman
Split check_reg_arg() into two utility functions: - check_reg_arg() operating on registers from current verifier state; - __check_reg_arg() operating on a specific set of registers passed as a parameter; The __check_reg_arg() function would be used by a follow-up change for callbacks handling. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231121020701.26440-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20selftests/bpf: fix bpf_loop_bench for new callback verification schemeEduard Zingerman
This is a preparatory change. A follow-up patch "bpf: verify callbacks as if they are called unknown number of times" changes logic for callbacks handling. While previously callbacks were verified as a single function call, new scheme takes into account that callbacks could be executed unknown number of times. This has dire implications for bpf_loop_bench: SEC("fentry/" SYS_PREFIX "sys_getpgid") int benchmark(void *ctx) { for (int i = 0; i < 1000; i++) { bpf_loop(nr_loops, empty_callback, NULL, 0); __sync_add_and_fetch(&hits, nr_loops); } return 0; } W/o callbacks change verifier sees it as a 1000 calls to empty_callback(). However, with callbacks change things become exponential: - i=0: state exploring empty_callback is scheduled with i=0 (a); - i=1: state exploring empty_callback is scheduled with i=1; ... - i=999: state exploring empty_callback is scheduled with i=999; - state (a) is popped from stack; - i=1: state exploring empty_callback is scheduled with i=1; ... Avoid this issue by rewriting outer loop as bpf_loop(). Unfortunately, this adds a function call to a loop at runtime, which negatively affects performance: throughput latency before: 149.919 ± 0.168 M ops/s, 6.670 ns/op after : 137.040 ± 0.187 M ops/s, 7.297 ns/op Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231121020701.26440-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20selftests/bpf: track string payload offset as scalar in strobemetaEduard Zingerman
This change prepares strobemeta for update in callbacks verification logic. To allow bpf_loop() verification converge when multiple callback iterations are considered: - track offset inside strobemeta_payload->payload directly as scalar value; - at each iteration make sure that remaining strobemeta_payload->payload capacity is sufficient for execution of read_{map,str}_var functions; - make sure that offset is tracked as unbound scalar between iterations, otherwise verifier won't be able infer that bpf_loop callback reaches identical states. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231121020701.26440-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20selftests/bpf: track tcp payload offset as scalar in xdp_synproxyEduard Zingerman
This change prepares syncookie_{tc,xdp} for update in callbakcs verification logic. To allow bpf_loop() verification converge when multiple callback itreations are considered: - track offset inside TCP payload explicitly, not as a part of the pointer; - make sure that offset does not exceed MAX_PACKET_OFF enforced by verifier; - make sure that offset is tracked as unbound scalar between iterations, otherwise verifier won't be able infer that bpf_loop callback reaches identical states. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231121020701.26440-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20eventfs: Do not invalidate dentry in create_file/dir_dentry()Steven Rostedt (Google)
With the call to simple_recursive_removal() on the entire eventfs sub system when the directory is removed, it performs the d_invalidate on all the dentries when it is removed. There's no need to do clean ups when a dentry is being created while the directory is being deleted. As dentries are cleaned up by the simpler_recursive_removal(), trying to do d_invalidate() in these functions will cause the dentry to be invalidated twice, and crash the kernel. Link: https://lore.kernel.org/all/20231116123016.140576-1-naresh.kamboju@linaro.org/ Link: https://lkml.kernel.org/r/20231120235154.422970988@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 407c6726ca71 ("eventfs: Use simple_recursive_removal() to clean up dentries") Reported-by: Mark Rutland <mark.rutland@arm.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-11-20eventfs: Remove expectation that ei->is_freed means ei->dentry == NULLSteven Rostedt (Google)
The logic to free the eventfs_inode (ei) use to set is_freed and clear the "dentry" field under the eventfs_mutex. But that changed when a race was found where the ei->dentry needed to be cleared when the last dput() was called on it. But there was still logic that checked if ei->dentry was not NULL and is_freed is set, and would warn if it was. But since that situation was changed and the ei->dentry isn't cleared until the last dput() is called on it while the ei->is_freed is set, do not test for that condition anymore, and change the comments to reflect that. Link: https://lkml.kernel.org/r/20231120235154.265826243@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 020010fbfa20 ("eventfs: Delete eventfs_inode when the last dentry is freed") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-11-20io_uring: fix off-by one bvec indexKeith Busch
If the offset equals the bv_len of the first registered bvec, then the request does not include any of that first bvec. Skip it so that drivers don't have to deal with a zero length bvec, which was observed to break NVMe's PRP list creation. Cc: stable@vger.kernel.org Fixes: bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed buffers") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231120221831.2646460-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20s390/dasd: protect device queue against concurrent accessJan Höppner
In dasd_profile_start() the amount of requests on the device queue are counted. The access to the device queue is unprotected against concurrent access. With a lot of parallel I/O, especially with alias devices enabled, the device queue can change while dasd_profile_start() is accessing the queue. In the worst case this leads to a kernel panic due to incorrect pointer accesses. Fix this by taking the device lock before accessing the queue and counting the requests. Additionally the check for a valid profile data pointer can be done earlier to avoid unnecessary locking in a hot path. Cc: <stable@vger.kernel.org> Fixes: 4fa52aa7a82f ("[S390] dasd: add enhanced DASD statistics interface") Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20231025132437.1223363-3-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20s390/dasd: resolve spelling mistakeMuhammad Muzammil
resolve typing mistake from pimary to primary Signed-off-by: Muhammad Muzammil <m.muzzammilashraf@gmail.com> Link: https://lore.kernel.org/r/20231010043140.28416-1-m.muzzammilashraf@gmail.com Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20231025132437.1223363-2-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20Merge branch 'bpf_redirect_peer fixes'Martin KaFai Lau
Daniel Borkmann says: ==================== This fixes bpf_redirect_peer stats accounting for veth and netkit, and adds tstats in the first place for the latter. Utilise indirect call wrapper for bpf_redirect_peer, and improve test coverage of the latter also for netkit devices. Details in the patches, thanks! The series was targeted at bpf originally, and is done here as well, so it can trigger BPF CI. Jakub, if you think directly going via net is better since the majority of the diff touches net anyway, that is fine, too. Thanks! v2 -> v3: - Add kdoc for pcpu_stat_type (Simon) - Reject invalid type value in netdev_do_alloc_pcpu_stats (Simon) - Add Reviewed-by tags from list v1 -> v2: - Move stats allocation/freeing into net core (Jakub) - As prepwork for the above, move vrf's dstats over into the core - Add a check into stats alloc to enforce tstats upon implementing ndo_get_peer_dev - Add Acked-by tags from list Daniel Borkmann (6): net, vrf: Move dstats structure to core net: Move {l,t,d}stats allocation to core and convert veth & vrf netkit: Add tstats per-CPU traffic counters bpf, netkit: Add indirect call wrapper for fetching peer dev selftests/bpf: De-veth-ize the tc_redirect test case selftests/bpf: Add netkit to tc_redirect selftest ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20selftests/bpf: Add netkit to tc_redirect selftestDaniel Borkmann
Extend the existing tc_redirect selftest to also cover netkit devices for exercising the bpf_redirect_peer() code paths, so that we have both veth as well as netkit covered, all tests still pass after this change. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20231114004220.6495-9-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20selftests/bpf: De-veth-ize the tc_redirect test caseDaniel Borkmann
No functional changes to the test case, but just renaming various functions, variables, etc, to remove veth part of their name for making it more generic and reusable later on (e.g. for netkit). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20231114004220.6495-8-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20bpf, netkit: Add indirect call wrapper for fetching peer devDaniel Borkmann
ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of indirect call wrapper and therefore optimize the bpf_redirect_peer() internal handling a bit. Add a small skb_get_peer_dev() wrapper which utilizes the INDIRECT_CALL_1() macro instead of open coding. Future work could potentially add a peer pointer directly into struct net_device in future and convert veth and netkit over to use it so that eventually ndo_get_peer_dev can be removed. Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231114004220.6495-7-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20bpf: Fix dev's rx stats for bpf_redirect_peer trafficPeilin Ye
Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium) is not accounted for in the RX stats of supported devices (that is, veth and netkit), confusing user space metrics collectors such as cAdvisor [0], as reported by Youlun. Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the @tstats per-CPU counters (instead of @lstats, or @dstats). To make this more fool-proof, error out when ndo_get_peer_dev is set but @tstats are not selected. [0] Specifically, the "container_network_receive_{byte,packet}s_total" counters are affected. Fixes: 9aa1206e8f48 ("bpf: Add redirect_peer helper") Reported-by: Youlun Zhang <zhangyoulun@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20231114004220.6495-6-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20veth: Use tstats per-CPU traffic countersPeilin Ye
Currently veth devices use the lstats per-CPU traffic counters, which only cover TX traffic. veth_get_stats64() actually populates RX stats of a veth device from its peer's TX counters, based on the assumption that a veth device can _only_ receive packets from its peer, which is no longer true: For example, recent CNIs (like Cilium) can use the bpf_redirect_peer() BPF helper to redirect traffic from NIC's tc ingress to veth's tc ingress (in a different netns), skipping veth's peer device. Unfortunately, this kind of traffic isn't currently accounted for in veth's RX stats. In preparation for the fix, use tstats (instead of lstats) to maintain both RX and TX counters for each veth device. We'll use RX counters for bpf_redirect_peer() traffic, and keep using TX counters for the usual "peer-to-peer" traffic. In veth_get_stats64(), calculate RX stats by _adding_ RX count to peer's TX count, in order to cover both kinds of traffic. veth_stats_rx() might need a name change (perhaps to "veth_stats_xdp()") for less confusion, but let's leave it to another patch to keep the fix minimal. Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20231114004220.6495-5-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20netkit: Add tstats per-CPU traffic countersDaniel Borkmann
Add dev->tstats traffic accounting to netkit. The latter contains per-CPU RX and TX counters. The dev's TX counters are bumped upon pass/unspec as well as redirect verdicts, in other words, on everything except for drops. The dev's RX counters are bumped upon successful __netif_rx(), as well as from skb_do_redirect() (not part of this commit here). Using dev->lstats with having just a single packets/bytes counter and inferring one another's RX counters from the peer dev's lstats is not possible given skb_do_redirect() can also bump the device's stats. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20231114004220.6495-4-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20net: Move {l,t,d}stats allocation to core and convert veth & vrfDaniel Borkmann
Move {l,t,d}stats allocation to the core and let netdevs pick the stats type they need. That way the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc) - all happening in the core. Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20net, vrf: Move dstats structure to coreDaniel Borkmann
Just move struct pcpu_dstats out of the vrf into the core, and streamline the field names slightly, so they better align with the {t,l}stats ones. No functional change otherwise. A conversion of the u64s to u64_stats_t could be done at a separate point in future. This move is needed as we are moving the {t,l,d}stats allocation/freeing to the core. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231114004220.6495-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20block/null_blk: Fix double blk_mq_start_request() warningChengming Zhou
When CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION is enabled, null_queue_rq() would return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE for the request, which has been marked as MQ_RQ_IN_FLIGHT by blk_mq_start_request(). Then null_queue_rqs() put these requests in the rqlist, return back to the block layer core, which would try to queue them individually again, so the warning in blk_mq_start_request() triggered. Fix it by splitting the null_queue_rq() into two parts: the first is the preparation of request, the second is the handling of request. We put the blk_mq_start_request() after the preparation part, which may fail and return back to the block layer core. The throttling also belongs to the preparation part, so move it before blk_mq_start_request(). And change the return type of null_handle_cmd() to void, since it always return BLK_STS_OK now. Reported-by: <syzbot+fcc47ba2476570cbbeb0@syzkaller.appspotmail.com> Closes: https://lore.kernel.org/all/0000000000000e6aac06098aee0c@google.com/ Fixes: d78bfa1346ab ("block/null_blk: add queue_rqs() support") Suggested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20231120032521.1012037-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20nvmet-tcp: always initialize tls_handshake_tmo_workHannes Reinecke
The TLS handshake timeout work item should always be initialized to avoid a crash when cancelling the workqueue. Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall") Suggested-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvmet: nul-terminate the NQNs passed in the connect commandChristoph Hellwig
The host and subsystem NQNs are passed in the connect command payload and interpreted as nul-terminated strings. Ensure they actually are nul-terminated before using them. Fixes: a07b4970f464 "nvmet: add a generic NVMe target") Reported-by: Alon Zahavi <zahavi.alon@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme: blank out authentication fabrics options if not configuredHannes Reinecke
If the config option NVME_HOST_AUTH is not selected we should not accept the corresponding fabrics options. This allows userspace to detect if NVMe authentication has been enabled for the kernel. Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: f50fff73d620 ("nvme: implement In-Band authentication") Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme: catch errors from nvme_configure_metadata()Hannes Reinecke
nvme_configure_metadata() is issuing I/O, so we might incur an I/O error which will cause the connection to be reset. But in that case any further probing will race with reset and cause UAF errors. So return a status from nvme_configure_metadata() and abort probing if there was an I/O error. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme-tcp: only evaluate 'tls' option if TLS is selectedHannes Reinecke
We only need to evaluate the 'tls' connect option if TLS is enabled; otherwise we might be getting a link error. Fixes: 706add13676d ("nvme: keyring: fix conditional compilation") Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/r/202311140426.0eHrTXBr-lkp@intel.com/ Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme-auth: set explanation code for failure2 msgsMark O'Donovan
Some error cases were not setting an auth-failure-reason-code-explanation. This means an AUTH_Failure2 message will be sent with an explanation value of 0 which is a reserved value. Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme-auth: unlock mutex in one place onlyMark O'Donovan
Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20block: Remove blk_set_runtime_active()Damien Le Moal
The function blk_set_runtime_active() is called only from blk_post_runtime_resume(), so there is no need for that function to be exported. Open-code this function directly in blk_post_runtime_resume() and remove it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20231120070611.33951-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20nbd: fix null-ptr-dereference while accessing 'nbd->config'Li Nan
Memory reordering may occur in nbd_genl_connect(), causing config_refs to be set to 1 while nbd->config is still empty. Opening nbd at this time will cause null-ptr-dereference. T1 T2 nbd_open nbd_get_config_unlocked nbd_genl_connect nbd_alloc_and_init_config //memory reordered refcount_set(&nbd->config_refs, 1) // 2 nbd->config ->null point nbd->config = config // 1 Fix it by adding smp barrier to guarantee the execution sequence. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231116162316.1740402-4-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20nbd: factor out a helper to get nbd_config without holding 'config_lock'Li Nan
There are no functional changes, just to make code cleaner and prepare to fix null-ptr-dereference while accessing 'nbd->config'. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231116162316.1740402-3-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20nbd: fold nbd config initialization into nbd_alloc_config()Li Nan
There are no functional changes, make the code cleaner and prepare to fix null-ptr-dereference while accessing 'nbd->config'. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20231116162316.1740402-2-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20PM: tools: Fix sleepgraph syntax errorDavid Woodhouse
The sleepgraph tool currently fails: File "/usr/bin/sleepgraph", line 4155 or re.match('psci: CPU(?P<cpu>[0-9]*) killed.*', msg)): ^ SyntaxError: unmatched ')' Fixes: 34ea427e01ea ("PM: tools: sleepgraph: Recognize "CPU killed" messages") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Wolfram Sang <wsa@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-20Merge tag 'md-fixes-20231120' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.7 Pull MD fix from Song. * tag 'md-fixes-20231120' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: fix bi_status reporting in md_end_clone_io
2023-11-20ACPI: resource: Skip IRQ override on ASUS ExpertBook B1402CVAHans de Goede
Like various other ASUS ExpertBook-s, the ASUS ExpertBook B1402CVA has an ACPI DSDT table that describes IRQ 1 as ActiveLow while the kernel overrides it to EdgeHigh. This prevents the keyboard from working. To fix this issue, add this laptop to the skip_override_table so that the kernel does not override IRQ 1. Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218114 Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-20ACPI: video: Use acpi_device_fix_up_power_children()Hans de Goede
Commit 89c290ea7589 ("ACPI: video: Put ACPI video and its child devices into D0 on boot") introduced calling acpi_device_fix_up_power_extended() on the video card for which the ACPI video bus is the companion device. This unnecessarily touches the power-state of the GPU itself, while the issue it tries to address only requires calling _PS0 on the child devices. Touching the power-state of the GPU itself is causing suspend / resume issues on e.g. a Lenovo ThinkPad W530. Instead use acpi_device_fix_up_power_children(), which only touches the child devices, to fix this. Fixes: 89c290ea7589 ("ACPI: video: Put ACPI video and its child devices into D0 on boot") Reported-by: Owen T. Heisler <writer@owenh.net> Closes: https://lore.kernel.org/regressions/9f36fb06-64c4-4264-aaeb-4e1289e764c4@owenh.net/ Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/273 Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218124 Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Tested-by: Owen T. Heisler <writer@owenh.net> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Cc: 6.6+ <stable@vger.kernel.org> # 6.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-20ACPI: PM: Add acpi_device_fix_up_power_children() functionHans de Goede
In some cases it is necessary to fix-up the power-state of an ACPI device's children without touching the ACPI device itself add a new acpi_device_fix_up_power_children() function for this. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Cc: 6.6+ <stable@vger.kernel.org> # 6.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-20ACPI: processor_idle: use raw_safe_halt() in acpi_idle_play_dead()David Woodhouse
Xen HVM guests were observed taking triple-faults when attempting to online a previously offlined vCPU. Investigation showed that the fault was coming from a failing call to lockdep_assert_irqs_disabled(), in load_current_idt() which was too early in the CPU bringup to actually catch the exception and report the failure cleanly. This was a false positive, caused by acpi_idle_play_dead() setting the per-cpu hardirqs_enabled flag by calling safe_halt(). Switch it to use raw_safe_halt() instead, which doesn't do so. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: 6.6+ <stable@vger.kernel.org> # 6.6+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-11-20bcache: avoid NULL checking to c->root in run_cache_set()Coly Li
In run_cache_set() after c->root returned from bch_btree_node_get(), it is checked by IS_ERR_OR_NULL(). Indeed it is unncessary to check NULL because bch_btree_node_get() will not return NULL pointer to caller. This patch replaces IS_ERR_OR_NULL() by IS_ERR() for the above reason. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-11-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc()Coly Li
This patch adds code comments to bch_btree_node_get() and __bch_btree_node_alloc() that NULL pointer will not be returned and it is unnecessary to check NULL pointer by the callers of these routines. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-10-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce()Coly Li
Commit 028ddcac477b ("bcache: Remove unnecessary NULL point check in node allocations") do the following change inside btree_gc_coalesce(), 31 @@ -1340,7 +1340,7 @@ static int btree_gc_coalesce( 32 memset(new_nodes, 0, sizeof(new_nodes)); 33 closure_init_stack(&cl); 34 35 - while (nodes < GC_MERGE_NODES && !IS_ERR_OR_NULL(r[nodes].b)) 36 + while (nodes < GC_MERGE_NODES && !IS_ERR(r[nodes].b)) 37 keys += r[nodes++].keys; 38 39 blocks = btree_default_blocks(b->c) * 2 / 3; At line 35 the original r[nodes].b is not always allocatored from __bch_btree_node_alloc(), and possibly initialized as NULL pointer by caller of btree_gc_coalesce(). Therefore the change at line 36 is not correct. This patch replaces the mistaken IS_ERR() by IS_ERR_OR_NULL() to avoid potential issue. Fixes: 028ddcac477b ("bcache: Remove unnecessary NULL point check in node allocations") Cc: <stable@vger.kernel.org> # 6.5+ Cc: Zheng Wang <zyytlz.wz@163.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-9-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up raceMingzhe Zou
We get a kernel crash about "unable to handle kernel paging request": ```dmesg [368033.032005] BUG: unable to handle kernel paging request at ffffffffad9ae4b5 [368033.032007] PGD fc3a0d067 P4D fc3a0d067 PUD fc3a0e063 PMD 8000000fc38000e1 [368033.032012] Oops: 0003 [#1] SMP PTI [368033.032015] CPU: 23 PID: 55090 Comm: bch_dirtcnt[0] Kdump: loaded Tainted: G OE --------- - - 4.18.0-147.5.1.es8_24.x86_64 #1 [368033.032017] Hardware name: Tsinghua Tongfang THTF Chaoqiang Server/072T6D, BIOS 2.4.3 01/17/2017 [368033.032027] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1d0 [368033.032029] Code: 8b 02 48 85 c0 74 f6 48 89 c1 eb d0 c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 c0 3d 02 00 48 03 04 cd 60 68 93 ad <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 02 [368033.032031] RSP: 0018:ffffbb48852abe00 EFLAGS: 00010082 [368033.032032] RAX: ffffffffad9ae4b5 RBX: 0000000000000246 RCX: 0000000000003bf3 [368033.032033] RDX: ffff97b0ff8e3dc0 RSI: 0000000000600000 RDI: ffffbb4884743c68 [368033.032034] RBP: 0000000000000001 R08: 0000000000000000 R09: 000007ffffffffff [368033.032035] R10: ffffbb486bb01000 R11: 0000000000000001 R12: ffffffffc068da70 [368033.032036] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [368033.032038] FS: 0000000000000000(0000) GS:ffff97b0ff8c0000(0000) knlGS:0000000000000000 [368033.032039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [368033.032040] CR2: ffffffffad9ae4b5 CR3: 0000000fc3a0a002 CR4: 00000000003626e0 [368033.032042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [368033.032043] bcache: bch_cached_dev_attach() Caching rbd479 as bcache462 on set 8cff3c36-4a76-4242-afaa-7630206bc70b [368033.032045] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [368033.032046] Call Trace: [368033.032054] _raw_spin_lock_irqsave+0x32/0x40 [368033.032061] __wake_up_common_lock+0x63/0xc0 [368033.032073] ? bch_ptr_invalid+0x10/0x10 [bcache] [368033.033502] bch_dirty_init_thread+0x14c/0x160 [bcache] [368033.033511] ? read_dirty_submit+0x60/0x60 [bcache] [368033.033516] kthread+0x112/0x130 [368033.033520] ? kthread_flush_work_fn+0x10/0x10 [368033.034505] ret_from_fork+0x35/0x40 ``` The crash occurred when call wake_up(&state->wait), and then we want to look at the value in the state. However, bch_sectors_dirty_init() is not found in the stack of any task. Since state is allocated on the stack, we guess that bch_sectors_dirty_init() has exited, causing bch_dirty_init_thread() to be unable to handle kernel paging request. In order to verify this idea, we added some printing information during wake_up(&state->wait). We find that "wake up" is printed twice, however we only expect the last thread to wake up once. ```dmesg [ 994.641004] alcache: bch_dirty_init_thread() wake up [ 994.641018] alcache: bch_dirty_init_thread() wake up [ 994.641523] alcache: bch_sectors_dirty_init() init exit ``` There is a race. If bch_sectors_dirty_init() exits after the first wake up, the second wake up will trigger this bug("unable to handle kernel paging request"). Proceed as follows: bch_sectors_dirty_init kthread_run ==============> bch_dirty_init_thread(bch_dirtcnt[0]) ... ... atomic_inc(&state.started) ... ... ... atomic_read(&state.enough) ... ... atomic_set(&state->enough, 1) kthread_run ======================================================> bch_dirty_init_thread(bch_dirtcnt[1]) ... atomic_dec_and_test(&state->started) ... atomic_inc(&state.started) ... ... ... wake_up(&state->wait) ... atomic_read(&state.enough) atomic_dec_and_test(&state->started) ... ... wait_event(state.wait, atomic_read(&state.started) == 0) ... return ... wake_up(&state->wait) We believe it is very common to wake up twice if there is no dirty, but crash is an extremely low probability event. It's hard for us to reproduce this issue. We attached and detached continuously for a week, with a total of more than one million attaches and only one crash. Putting atomic_inc(&state.started) before kthread_run() can avoid waking up twice. Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-8-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20bcache: fixup lock c->root errorMingzhe Zou
We had a problem with io hung because it was waiting for c->root to release the lock. crash> cache_set.root -l cache_set.list ffffa03fde4c0050 root = 0xffff802ef454c800 crash> btree -o 0xffff802ef454c800 | grep rw_semaphore [ffff802ef454c858] struct rw_semaphore lock; crash> struct rw_semaphore ffff802ef454c858 struct rw_semaphore { count = { counter = -4294967297 }, wait_list = { next = 0xffff00006786fc28, prev = 0xffff00005d0efac8 }, wait_lock = { raw_lock = { { val = { counter = 0 }, { locked = 0 '\000', pending = 0 '\000' }, { locked_pending = 0, tail = 0 } } } }, osq = { tail = { counter = 0 } }, owner = 0xffffa03fdc586603 } The "counter = -4294967297" means that lock count is -1 and a write lock is being attempted. Then, we found that there is a btree with a counter of 1 in btree_cache_freeable. crash> cache_set -l cache_set.list ffffa03fde4c0050 -o|grep btree_cache [ffffa03fde4c1140] struct list_head btree_cache; [ffffa03fde4c1150] struct list_head btree_cache_freeable; [ffffa03fde4c1160] struct list_head btree_cache_freed; [ffffa03fde4c1170] unsigned int btree_cache_used; [ffffa03fde4c1178] wait_queue_head_t btree_cache_wait; [ffffa03fde4c1190] struct task_struct *btree_cache_alloc_lock; crash> list -H ffffa03fde4c1140|wc -l 973 crash> list -H ffffa03fde4c1150|wc -l 1123 crash> cache_set.btree_cache_used -l cache_set.list ffffa03fde4c0050 btree_cache_used = 2097 crash> list -s btree -l btree.list -H ffffa03fde4c1140|grep -E -A2 "^ lock = {" > btree_cache.txt crash> list -s btree -l btree.list -H ffffa03fde4c1150|grep -E -A2 "^ lock = {" > btree_cache_freeable.txt [root@node-3 127.0.0.1-2023-08-04-16:40:28]# pwd /var/crash/127.0.0.1-2023-08-04-16:40:28 [root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache.txt|grep counter|grep -v "counter = 0" [root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache_freeable.txt|grep counter|grep -v "counter = 0" counter = 1 We found that this is a bug in bch_sectors_dirty_init() when locking c->root: (1). Thread X has locked c->root(A) write. (2). Thread Y failed to lock c->root(A), waiting for the lock(c->root A). (3). Thread X bch_btree_set_root() changes c->root from A to B. (4). Thread X releases the lock(c->root A). (5). Thread Y successfully locks c->root(A). (6). Thread Y releases the lock(c->root B). down_write locked ---(1)----------------------┐ | | | down_read waiting ---(2)----┐ | | | ┌-------------┐ ┌-------------┐ bch_btree_set_root ===(3)========>> | c->root A | | c->root B | | | └-------------┘ └-------------┘ up_write ---(4)---------------------┘ | | | | | down_read locked ---(5)-----------┘ | | | up_read ---(6)-----------------------------┘ Since c->root may change, the correct steps to lock c->root should be the same as bch_root_usage(), compare after locking. static unsigned int bch_root_usage(struct cache_set *c) { unsigned int bytes = 0; struct bkey *k; struct btree *b; struct btree_iter iter; goto lock_root; do { rw_unlock(false, b); lock_root: b = c->root; rw_lock(false, b, b->level); } while (b != c->root); for_each_key_filter(&b->keys, k, &iter, bch_ptr_bad) bytes += bkey_bytes(k); rw_unlock(false, b); return (bytes * 100) / btree_bytes(c); } Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-7-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20bcache: fixup init dirty data errorsMingzhe Zou
We found that after long run, the dirty_data of the bcache device will have errors. This error cannot be eliminated unless re-register. We also found that reattach after detach, this error can accumulate. In bch_sectors_dirty_init(), all inode <= d->id keys will be recounted again. This is wrong, we only need to count the keys of the current device. Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20bcache: prevent potential division by zero errorRand Deeb
In SHOW(), the variable 'n' is of type 'size_t.' While there is a conditional check to verify that 'n' is not equal to zero before executing the 'do_div' macro, concerns arise regarding potential division by zero error in 64-bit environments. The concern arises when 'n' is 64 bits in size, greater than zero, and the lower 32 bits of it are zeros. In such cases, the conditional check passes because 'n' is non-zero, but the 'do_div' macro casts 'n' to 'uint32_t,' effectively truncating it to its lower 32 bits. Consequently, the 'n' value becomes zero. To fix this potential division by zero error and ensure precise division handling, this commit replaces the 'do_div' macro with div64_u64(). div64_u64() is designed to work with 64-bit operands, guaranteeing that division is performed correctly. This change enhances the robustness of the code, ensuring that division operations yield accurate results in all scenarios, eliminating the possibility of division by zero, and improving compatibility across different 64-bit environments. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Rand Deeb <rand.sec96@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20bcache: remove redundant assignment to variable cur_idxColin Ian King
Variable cur_idx is being initialized with a value that is never read, it is being re-assigned later in a while-loop. Remove the redundant assignment. Cleans up clang scan build warning: drivers/md/bcache/writeback.c:916:2: warning: Value stored to 'cur_idx' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Coly Li <colyli@suse.de> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20231120052503.6122-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20bcache: check return value from btree_node_alloc_replacement()Coly Li
In btree_gc_rewrite_node(), pointer 'n' is not checked after it returns from btree_gc_rewrite_node(). There is potential possibility that 'n' is a non NULL ERR_PTR(), referencing such error code is not permitted in following code. Therefore a return value checking is necessary after 'n' is back from btree_node_alloc_replacement(). Signed-off-by: Coly Li <colyli@suse.de> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20231120052503.6122-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20bcache: avoid oversize memory allocation by small stripe_sizeColy Li
Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes are used for dirty data writeback, their sizes are decided by backing device capacity and stripe size. Larger backing device capacity or smaller stripe size make these two arraies occupies more dynamic memory space. Currently bcache->stripe_size is directly inherited from queue->limits.io_opt of underlying storage device. For normal hard drives, its limits.io_opt is 0, and bcache sets the corresponding stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for devices do declare value for queue->limits.io_opt, small stripe_size (comparing to 1TB) becomes an issue for oversize memory allocations of bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while the capacity of hard drives gets much larger in recent decade. For example a raid5 array assembled by three 20TB hardrives, the raid device capacity is 40TB with typical 512KB limits.io_opt. After the math calculation in bcache code, these two arraies will occupy 400MB dynamic memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt is declared on a new 2TB hard drive, then these two arraies request 2GB and 512MB dynamic memory from kzalloc(). The result is that bcache device always fails to initialize on his system. To avoid the oversize memory allocation, bcache->stripe_size should not directly inherited by queue->limits.io_opt from the underlying device. This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe size and set bcache device's stripe size against the declared limits.io_opt value from the underlying storage device, - If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device will set its stripe size directly by this limits.io_opt value. - If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device will set its stripe size by a value multiplying limits.io_opt and euqal or large than BCH_MIN_STRIPE_SZ. Then the minimal stripe size of a bcache device will always be >= 4MB. For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be 50MB in total. For a 2TB hard drive with 4KB limits.io_opt, memory occupied by these two arraies will be 2.5MB in total. Such mount of memory allocated for bcache->stripe_sectors_dirty and bcache->full_dirty_stripes is reasonable for most of storage devices. Reported-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com> Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Eric Wheeler <bcache@lists.ewheeler.net> Link: https://lore.kernel.org/r/20231120052503.6122-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20drm/rockchip: vop: Fix color for RGB888/BGR888 format on VOP fullJonas Karlman
Use of DRM_FORMAT_RGB888 and DRM_FORMAT_BGR888 on e.g. RK3288, RK3328 and RK3399 result in wrong colors being displayed. The issue can be observed using modetest: modetest -s <connector_id>@<crtc_id>:1920x1080-60@RG24 modetest -s <connector_id>@<crtc_id>:1920x1080-60@BG24 Vendor 4.4 kernel apply an inverted rb swap for these formats on VOP full framework (IP version 3.x) compared to VOP little framework (2.x). Fix colors by applying different rb swap for VOP full framework (3.x) and VOP little framework (2.x) similar to vendor 4.4 kernel. Fixes: 85a359f25388 ("drm/rockchip: Add BGR formats to VOP") Signed-off-by: Jonas Karlman <jonas@kwiboo.se> Tested-by: Diederik de Haas <didi.debian@cknow.org> Reviewed-by: Christopher Obbard <chris.obbard@collabora.com> Tested-by: Christopher Obbard <chris.obbard@collabora.com> Signed-off-by: Heiko Stuebner <heiko@sntech.de> Link: https://patchwork.freedesktop.org/patch/msgid/20231026191500.2994225-1-jonas@kwiboo.se
2023-11-20io_uring/fs: consider link->flags when getting path for LINKATCharles Mirabile
In order for `AT_EMPTY_PATH` to work as expected, the fact that the user wants that behavior needs to make it to `getname_flags` or it will return ENOENT. Fixes: cf30da90bc3a ("io_uring: add support for IORING_OP_LINKAT") Cc: <stable@vger.kernel.org> Link: https://github.com/axboe/liburing/issues/995 Signed-off-by: Charles Mirabile <cmirabil@redhat.com> Link: https://lore.kernel.org/r/20231120105545.1209530-1-cmirabil@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20libfs: getdents() should return 0 after reaching EODChuck Lever
The new directory offset helpers don't conform with the convention of getdents() returning no more entries once a directory file descriptor has reached the current end-of-directory. To address this, copy the logic from dcache_readdir() to mark the open directory file descriptor once EOD has been reached. Seeking resets the mark. Reported-by: Tavian Barnes <tavianator@tavianator.com> Closes: https://lore.kernel.org/linux-fsdevel/20231113180616.2831430-1-tavianator@tavianator.com/ Fixes: 6faddda69f62 ("libfs: Add directory operations for stable offsets") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/170043792492.4628.15646203084646716134.stgit@bazille.1015granger.net Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-20xfs: respect the stable writes flag on the RT deviceChristoph Hellwig
Update the per-folio stable writes flag dependening on which device an inode resides on. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231025141020.192413-5-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>