summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_trace.h
AgeCommit message (Collapse)Author
2022-05-04Merge tag 'reflink-speedups-5.19_2022-04-28' of ↵Dave Chinner
git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.19-for-next xfs: fix reflink inefficiencies As Dave Chinner has complained about on IRC, there are a couple of things about reflink that are very inefficient. First of all, we limited the size of all bunmapi operations to avoid flooding the log with defer ops in the worst case, but recent changes to the defer ops code have solved that problem, so get rid of the bunmapi length clamp. Second, the log reservations for reflink operations are far far larger than they need to be. Shrink them to exactly what we need to handle each deferred RUI and CUI log item, and no more. Also reduce logcount because we don't need 8 rolls per operation. Introduce a transaction reservation compatibility layer to avoid changing the minimum log size calculations. Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-04Merge tag 'rmap-speedups-5.19_2022-04-28' of ↵Dave Chinner
git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.19-for-next xfs: fix rmap inefficiencies Reduce the performance impact of the reverse mapping btree when reflink is enabled by using the much faster non-overlapped btree lookup functions when we're searching the rmap index with a fully specified key. If we find the exact record we're looking for, great! We don't have to perform the full overlapped scan. For filesystems with high sharing factors this reduces the xfs_scrub runtime by a good 15%%. This has been shown to reduce the fstests runtime for realtime rmap configurations by 30%%, since the lack of AGs severely limits scalability. Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-05-04xfs: intent item whiteoutsDave Chinner
When we log modifications based on intents, we add both intent and intent done items to the modification being made. These get written to the log to ensure that the operation is re-run if the intent done is not found in the log. However, for operations that complete wholly within a single checkpoint, the change in the checkpoint is atomic and will never need replay. In this case, we don't need to actually write the intent and intent done items to the journal because log recovery will never need to manually restart this modification. Log recovery currently handles intent/intent done matching by inserting the intent into the AIL, then removing it when a matching intent done item is found. Hence for all the intent-based operations that complete within a checkpoint, we spend all that time parsing the intent/intent done items just to cancel them and do nothing with them. Hence it follows that the only time we actually need intents in the log is when the modification crosses checkpoint boundaries in the log and so may only be partially complete in the journal. Hence if we commit and intent done item to the CIL and the intent item is in the same checkpoint, we don't actually have to write them to the journal because log recovery will always cancel the intents. We've never really worried about the overhead of logging intents unnecessarily like this because the intents we log are generally very much smaller than the change being made. e.g. freeing an extent involves modifying at lease two freespace btree blocks and the AGF, so the EFI/EFD overhead is only a small increase in space and processing time compared to the overall cost of freeing an extent. However, delayed attributes change this cost equation dramatically, especially for inline attributes. In the case of adding an inline attribute, we only log the inode core and attribute fork at present. With delayed attributes, we now log the attr intent which includes the name and value, the inode core adn attr fork, and finally the attr intent done item. We increase the number of items we log from 1 to 3, and the number of log vectors (regions) goes up from 3 to 7. Hence we tripple the number of objects that the CIL has to process, and more than double the number of log vectors that need to be written to the journal. At scale, this means delayed attributes cause a non-pipelined CIL to become CPU bound processing all the extra items, resulting in a > 40% performance degradation on 16-way file+xattr create worklaods. Pipelining the CIL (as per 5.15) reduces the performance degradation to 20%, but now the limitation is the rate at which the log items can be written to the iclogs and iclogs be dispatched for IO and completed. Even log IO completion is slowed down by these intents, because it now has to process 3x the number of items in the checkpoint. Processing completed intents is especially inefficient here, because we first insert the intent into the AIL, then remove it from the AIL when the intent done is processed. IOWs, we are also doing expensive operations in log IO completion we could completely avoid if we didn't log completed intent/intent done pairs. Enter log item whiteouts. When an intent done is committed, we can check to see if the associated intent is in the same checkpoint as we are currently committing the intent done to. If so, we can mark the intent log item with a whiteout and immediately free the intent done item rather than committing it to the CIL. We can basically skip the entire formatting and CIL insertion steps for the intent done item. However, we cannot remove the intent item from the CIL at this point because the unlocked per-cpu CIL item lists do not permit removal without holding the CIL context lock exclusively. Transaction commit only holds the context lock shared, hence the best we can do is mark the intent item with a whiteout so that the CIL push can release it rather than writing it to the log. This means we never write the intent to the log if the intent done has also been committed to the same checkpoint, but we'll always write the intent if the intent done has not been committed or has been committed to a different checkpoint. This will result in correct log recovery behaviour in all cases, without the overhead of logging unnecessary intents. This intent whiteout concept is generic - we can apply it to all intent/intent done pairs that have a direct 1:1 relationship. The way deferred ops iterate and relog intents mean that all intents currently have a 1:1 relationship with their done intent, and hence we can apply this cancellation to all existing intent/intent done implementations. For delayed attributes with a 16-way 64kB xattr create workload, whiteouts reduce the amount of journalled metadata from ~2.5GB/s down to ~600MB/s and improve the creation rate from 9000/s to 14000/s. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-28xfs: rewrite xfs_reflink_end_cow to use intentsDarrick J. Wong
Currently, the code that performs CoW remapping after a write has this odd behavior where it walks /backwards/ through the data fork to remap extents in reverse order. Earlier, we rewrote the reflink remap function to use deferred bmap log items instead of trying to cram as much into the first transaction that we could. Now do the same for the CoW remap code. There doesn't seem to be any performance impact; we're just making better use of code that we added for the benefit of reflink. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-28xfs: report "max_resp" used for min log size computationDarrick J. Wong
Move the tracepoint that computes the size of the transaction used to compute the minimum log size into xfs_log_get_max_trans_res so that we only have to compute this stuff once. Leave xfs_log_get_max_trans_res as a non-static function so that xfs_db can call it to report the results of the userspace computation of the same value to diagnose mkfs/kernel misinteractions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-28xfs: create shadow transaction reservations for computing minimum log sizeDarrick J. Wong
Every time someone changes the transaction reservation sizes, they introduce potential compatibility problems if the changes affect the minimum log size that we validate at mount time. If the minimum log size gets larger (which should be avoided because doing so presents a serious risk of log livelock), filesystems created with old mkfs will not mount on a newer kernel; if the minimum size shrinks, filesystems created with newer mkfs will not mount on older kernels. Therefore, enable the creation of a shadow log reservation structure where we can "undo" the effects of tweaks when computing minimum log sizes. These shadow reservations should never be used in practice, but they insulate us from perturbations in minimum log size. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-04-27xfs: capture buffer ops in the xfs_buf tracepointsDarrick J. Wong
Record the buffer ops in the xfs_buf tracepoints so that we can monitor the alleged type of the buffer. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-04-21Merge tag 'large-extent-counters-v9' of https://github.com/chandanr/linux ↵Dave Chinner
into xfs-5.19-for-next xfs: Large extent counters The commit xfs: fix inode fork extent count overflow (3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) mentions that 10 billion data fork extents should be possible to create. However the corresponding on-disk field has a signed 32-bit type. Hence this patchset extends the per-inode data fork extent counter to 64 bits (out of which 48 bits are used to store the extent count). Also, XFS has an attribute fork extent counter which is 16 bits wide. A workload that, 1. Creates 1 million 255-byte sized xattrs, 2. Deletes 50% of these xattrs in an alternating manner, 3. Tries to insert 400,000 new 255-byte sized xattrs causes the xattr extent counter to overflow. Dave tells me that there are instances where a single file has more than 100 million hardlinks. With parent pointers being stored in xattrs, we will overflow the signed 16-bits wide attribute extent counter when large number of hardlinks are created. Hence this patchset extends the on-disk field to 32-bits. The following changes are made to accomplish this, 1. A 64-bit inode field is carved out of existing di_pad and di_flushiter fields to hold the 64-bit data fork extent counter. 2. The existing 32-bit inode data fork extent counter will be used to hold the attribute fork extent counter. 3. A new incompat superblock flag to prevent older kernels from mounting the filesystem. Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: convert quota options flags to unsigned.Dave Chinner
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned fields to be unsigned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: convert da btree operations flags to unsigned.Dave Chinner
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned fields to be unsigned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-11xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectivelyChandan Babu R
A future commit will introduce a 64-bit on-disk data extent counter and a 32-bit on-disk attr extent counter. This commit promotes xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits in order to correctly handle in-core versions of these quantities. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2022-04-11xfs: Use xfs_extnum_t instead of basic data typesChandan Babu R
xfs_extnum_t is the type to use to declare variables which have values obtained from xfs_dinode->di_[a]nextents. This commit replaces basic types (e.g. uint32_t) with xfs_extnum_t for such variables. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2022-03-20xfs: log items should have a xlog pointer, not a mountDave Chinner
Log items belong to the log, not the xfs_mount. Convert the mount pointer in the log item to a xlog pointer in preparation for upcoming log centric changes to the log items. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-14xfs: constify the name argument to various directory functionsDarrick J. Wong
Various directory functions do not modify their @name parameter, so mark it const to make that clear. This will enable us to mark the global xfs_name_dotdot variable as const to prevent mischief. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: prepare xfs_btree_cur for dynamic cursor heightsDarrick J. Wong
Split out the btree level information into a separate struct and put it at the end of the cursor structure as a VLA. Files with huge data forks (and in the future, the realtime rmap btree) will require the ability to support many more levels than a per-AG btree cursor, which means that we're going to create per-btree type cursor caches to conserve memory for the more common case. Note that a subsequent patch actually introduces dynamic cursor heights. This one merely rearranges the structure to prepare for that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-19xfs: convert bp->b_bn references to xfs_buf_daddr()Dave Chinner
Stop directly referencing b_bn in code outside the buffer cache, as b_bn is supposed to be used only as an internal cache index. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: convert remaining mount flags to state flagsDave Chinner
The remaining mount flags kept in m_flags are actually runtime state flags. These change dynamically, so they really should be updated atomically so we don't potentially lose an update due to racing modifications. Convert these remaining flags to be stored in m_opstate and use atomic bitops to set and clear the flags. This also adds a couple of simple wrappers for common state checks - read only and shutdown. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: start documenting common units and tags used in tracepointsDarrick J. Wong
Because there are a lot of tracepoints that express numeric data with an associated unit and tag, document what they are to help everyone else keep these thigns straight. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: standardize inode generation formatting in ftrace outputDarrick J. Wong
Always print inode generation in hexadecimal and preceded with the unit "gen". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: standardize remaining xfs_buf length tracepointsDarrick J. Wong
For the remaining xfs_buf tracepoints, convert all the tags to xfs_daddr_t units and retag them 'daddrcount' to match everything else. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: resolve fork names in trace outputDarrick J. Wong
Emit whichfork values as text strings in the ftrace output. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: rename i_disk_size fields in ftrace outputDarrick J. Wong
Whenever we record i_disk_size (i.e. the ondisk file size), use the "disize" tag and hexadecimal format consistently. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: disambiguate units for ftrace fields tagged "count"Darrick J. Wong
Some of our tracepoints have a field known as "count". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal when we're referring to blocks, extents, or IO operations. "fsbcount" are in units of fs blocks "bytecount" are in units of bytes Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: disambiguate units for ftrace fields tagged "len"Darrick J. Wong
Some of our tracepoints have a field known as "len". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal. "fsbcount" are in units of fs blocks "bbcount" are in units of 512b blocks "ireccount" are in units of inodes Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: disambiguate units for ftrace fields tagged "offset"Darrick J. Wong
Some of our tracepoints describe fields as "offset". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal. "fileoff" means file offset, in units of fs blocks "pos" means file offset, in bytes "forkoff" means inode fork offset, in bytes The one remaining "offset" value is for iclogs, since that's the byte offset of the end of where we've written into the current iclog. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: disambiguate units for ftrace fields tagged "blkno", "block", or "bno"Darrick J. Wong
Some of our tracepoints describe fields as "blkno", "block", or "bno". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal. "startblock" is the startblock field from the bmap structure, which is a segmented fsblock on the data device, or an rfsblock on the realtime device. "fileoff" is a file offset, in units of filesystem blocks "daddr" is a raw device offset, in 512b blocks Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: standardize daddr formatting in ftrace outputDarrick J. Wong
Always print disk addr (i.e. 512 byte block) numbers in hexadecimal and preceded with the unit "daddr". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: standardize rmap owner number formatting in ftrace outputDarrick J. Wong
Always print rmap owner number in hexadecimal and preceded with the unit "owner". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: standardize AG block number formatting in ftrace outputDarrick J. Wong
Always print allocation group block numbers in hexadecimal and preceded with the unit "agbno". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: standardize AG number formatting in ftrace outputDarrick J. Wong
Always print allocation group numbers in hexadecimal and preceded with the unit "agno". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-19xfs: standardize inode number formatting in ftrace outputDarrick J. Wong
Always print inode numbers in hexadecimal and preceded with the unit "ino" or "agino", as apropriate. Fix one tracepoint that used "ino %u" for an inode btree block count to reduce confusion. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-08-18xfs: make the record pointer passed to query_range functions constDarrick J. Wong
The query_range functions are supposed to call a caller-supplied function on each record found in the dataset. These functions don't own the memory storing the record, so don't let them change the record. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-08-18xfs: add trace point for fs shutdownDarrick J. Wong
Add a tracepoint for fs shutdowns so we can capture that in ftrace output. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-08-16xfs: XLOG_STATE_IOERROR must dieDave Chinner
We don't need an iclog state field to tell us the log has been shut down. We can just check the xlog_is_shutdown() instead. The avoids the need to have shutdown overwrite the current iclog state while being active used by the log code and so having to ensure that every iclog state check handles XLOG_STATE_IOERROR appropriately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-09xfs: add attr state machine tracepointsAllison Henderson
This is a quick patch to add a new xfs_attr_*_return tracepoints. We use these to track when ever a new state is set or -EAGAIN is returned Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-09xfs: replace kmem_alloc_large() with kvmalloc()Dave Chinner
There is no reason for this wrapper existing anymore. All the places that use KM_NOFS allocation are within transaction contexts and hence covered by memalloc_nofs_save/restore contexts. Hence we don't need any special handling of vmalloc for large IOs anymore and so special casing this code isn't necessary. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-09xfs: remove kmem_alloc_io()Dave Chinner
Since commit 59bb47985c1d ("mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)"), the core slab code now guarantees slab alignment in all situations sufficient for IO purposes (i.e. minimum of 512 byte alignment of >= 512 byte sized heap allocations) we no longer need the workaround in the XFS code to provide this guarantee. Replace the use of kmem_alloc_io() with kmem_alloc() or kmem_alloc_large() appropriately, and remove the kmem_alloc_io() interface altogether. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-09xfs: throttle inode inactivation queuing on memory reclaimDarrick J. Wong
Now that we defer inode inactivation, we've decoupled the process of unlinking or closing an inode from the process of inactivating it. In theory this should lead to better throughput since we now inactivate the queued inodes in batches instead of one at a time. Unfortunately, one of the primary risks with this decoupling is the loss of rate control feedback between the frontend and background threads. In other words, a rm -rf /* thread can run the system out of memory if it can queue inodes for inactivation and jump to a new CPU faster than the background threads can actually clear the deferred work. The workers can get scheduled off the CPU if they have to do IO, etc. To solve this problem, we configure a shrinker so that it will activate the /second/ time the shrinkers are called. The custom shrinker will queue all percpu deferred inactivation workers immediately and set a flag to force frontend callers who are releasing a vfs inode to wait for the inactivation workers. On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve most of the OOMing problem when deleting 10 million inodes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-09xfs: use background worker pool when transactions can't get free spaceDarrick J. Wong
In xfs_trans_alloc, if the block reservation call returns ENOSPC, we call xfs_blockgc_free_space with a NULL icwalk structure to try to free space. Each frontend thread that encounters this situation starts its own walk of the inode cache to see if it can find anything, which is wasteful since we don't have any additional selection criteria. For this one common case, create a function that reschedules all pending background work immediately and flushes the workqueue so that the scan can run in parallel. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-09xfs: don't run speculative preallocation gc when fs is frozenDarrick J. Wong
Now that we have the infrastructure to switch background workers on and off at will, fix the block gc worker code so that we don't actually run the worker when the filesystem is frozen, same as we do for deferred inactivation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: per-cpu deferred inode inactivation queuesDave Chinner
Move inode inactivation to background work contexts so that it no longer runs in the context that releases the final reference to an inode. This will allow process work that ends up blocking on inactivation to continue doing work while the filesytem processes the inactivation in the background. A typical demonstration of this is unlinking an inode with lots of extents. The extents are removed during inactivation, so this blocks the process that unlinked the inode from the directory structure. By moving the inactivation to the background process, the userspace applicaiton can keep working (e.g. unlinking the next inode in the directory) while the inactivation work on the previous inode is done by a different CPU. The implementation of the queue is relatively simple. We use a per-cpu lockless linked list (llist) to queue inodes for inactivation without requiring serialisation mechanisms, and a work item to allow the queue to be processed by a CPU bound worker thread. We also keep a count of the queue depth so that we can trigger work after a number of deferred inactivations have been queued. The use of a bound workqueue with a single work depth allows the workqueue to run one work item per CPU. We queue the work item on the CPU we are currently running on, and so this essentially gives us affine per-cpu worker threads for the per-cpu queues. THis maintains the effective CPU affinity that occurs within XFS at the AG level due to all objects in a directory being local to an AG. Hence inactivation work tends to run on the same CPU that last accessed all the objects that inactivation accesses and this maintains hot CPU caches for unlink workloads. A depth of 32 inodes was chosen to match the number of inodes in an inode cluster buffer. This hopefully allows sequential allocation/unlink behaviours to defering inactivation of all the inodes in a single cluster buffer at a time, further helping maintain hot CPU and buffer cache accesses while running inactivations. A hard per-cpu queue throttle of 256 inode has been set to avoid runaway queuing when inodes that take a long to time inactivate are being processed. For example, when unlinking inodes with large numbers of extents that can take a lot of processing to free. Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: tweak comments and tracepoints, convert opflags to state bits] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-07-29xfs: need to see iclog flags in tracingDave Chinner
Because I cannot tell if the NEED_FLUSH flag is being set correctly by the log force and CIL push machinery without it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-21xfs: fix type mismatches in the inode reclaim functionsDarrick J. Wong
It's currently unlikely that we will ever end up with more than 4 billion inodes waiting for reclamation, but the fs object code uses long int for object counts and we're certainly capable of generating that many. Instead of truncating the internal counters, widen them and report the object counts correctly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-21xfs: refactor the inode recycling codeDarrick J. Wong
Hoist the code in xfs_iget_cache_hit that restores the VFS inode state to an xfs_inode that was previously vfs-destroyed. The next patch will add a new set of state flags, so we need the helper to avoid duplication. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-21xfs: add iclog state trace eventsDave Chinner
For the DEBUGS! Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-18Merge tag 'xfs-delay-ready-attrs-v20.1' of ↵Darrick J. Wong
https://github.com/allisonhenderson/xfs_work into xfs-5.14-merge4 xfs: Delay Ready Attributes Hi all, This set is a subset of a larger series for Dealyed Attributes. Which is a subset of a yet larger series for parent pointers. Delayed attributes allow attribute operations (set and remove) to be logged and committed in the same way that other delayed operations do. This allows more complex operations (like parent pointers) to be broken up into multiple smaller transactions. To do this, the existing attr operations must be modified to operate as a delayed operation. This means that they cannot roll, commit, or finish transactions. Instead, they return -EAGAIN to allow the calling function to handle the transaction. In this series, we focus on only the delayed attribute portion. We will introduce parent pointers in a later set. The set as a whole is a bit much to digest at once, so I usually send out the smaller sub series to reduce reviewer burn out. But the entire extended series is visible through the included github links. Updates since v19: Added Darricks fix for the remote block accounting as well as some minor nits about the default assert in xfs_attr_set_iter. Spent quite a bit of time testing this cycle to weed out any more unexpected bugs. No new test failures were observed with the addition of this set. xfs: Fix default ASSERT in xfs_attr_set_iter Replaced the assert with ASSERT(0); xfs: Add delay ready attr remove routines Added Darricks fix for remote block accounting This series can be viewed on github here: https://github.com/allisonhenderson/xfs_work/tree/delay_ready_attrs_v20 As well as the extended delayed attribute and parent pointer series: https://github.com/allisonhenderson/xfs_work/tree/delay_ready_attrs_v20_extended And the test cases: https://github.com/allisonhenderson/xfs_work/tree/pptr_xfstestsv3 In order to run the test cases, you will need have the corresponding xfsprogs changes as well. Which can be found here: https://github.com/allisonhenderson/xfs_work/tree/delay_ready_attrs_xfsprogs_v20 https://github.com/allisonhenderson/xfs_work/tree/delay_ready_attrs_xfsprogs_v20_extended To run the xfs attributes tests run: check -g attr To run as delayed attributes run: export MOUNT_OPTIONS="-o delattr" check -g attr To run parent pointer tests: check -g parent I've also made the corresponding updates to the user space side as well, and ported anything they need to seat correctly. Questions, comment and feedback appreciated! Thanks all! Allison * tag 'xfs-delay-ready-attrs-v20.1' of https://github.com/allisonhenderson/xfs_work: xfs: Make attr name schemes consistent xfs: Fix default ASSERT in xfs_attr_set_iter xfs: Clean up xfs_attr_node_addname_clear_incomplete xfs: Remove xfs_attr_rmtval_set xfs: Add delay ready attr set routines xfs: Add delay ready attr remove routines xfs: Hoist node transaction handling xfs: Hoist xfs_attr_leaf_addname xfs: Hoist xfs_attr_node_addname xfs: Add helper xfs_attr_node_addname_find_attr xfs: Separate xfs_attr_node_addname and xfs_attr_node_addname_clear_incomplete xfs: Refactor xfs_attr_set_shortform xfs: Add xfs_attr_node_remove_name xfs: Reverse apply 72b97ea40d
2021-06-08xfs: rename struct xfs_eofblocks to xfs_icwalkDarrick J. Wong
The xfs_eofblocks structure is no longer well-named -- nowadays it provides optional filtering criteria to any walk of the incore inode cache. Only one of the cache walk goals has anything to do with clearing of speculative post-EOF preallocations, so change the name to be more appropriate. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-08Merge tag 'inode-walk-cleanups-5.14_2021-06-03' of ↵Darrick J. Wong
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-5.14-merge2 xfs: clean up incore inode walk functions This ambitious series aims to cleans up redundant inode walk code in xfs_icache.c, hide implementation details of the quotaoff dquot release code, and eliminates indirect function calls from incore inode walks. The first thing it does is to move all the code that quotaoff calls to release dquots from all incore inodes into xfs_icache.c. Next, it separates the goal of an inode walk from the actual radix tree tags that may or may not be involved and drops the kludgy XFS_ICI_NO_TAG thing. Finally, we split the speculative preallocation (blockgc) and quotaoff dquot release code paths into separate functions so that we can keep the implementations cohesive. Christoph suggested last cycle that we 'simply' change quotaoff not to allow deactivating quota entirely, but as these cleanups are to enable one major change in behavior (deferred inode inactivation) I do not want to add a second behavior change (quotaoff) as a dependency. To be blunt: Additional cleanups are not in scope for this series. Next, I made two observations about incore inode radix tree walks -- since there's a 1:1 mapping between the walk goal and the per-inode processing function passed in, we can use the goal to make a direct call to the processing function. Furthermore, the only caller to supply a nonzero iter_flags argument is quotaoff, and there's only one INEW flag. From that observation, I concluded that it's quite possible to remove two parameters from the xfs_inode_walk* function signatures -- the iter_flags, and the execute function pointer. The middle of the series moves the INEW functionality into the one piece (quotaoff) that wants it, and removes the indirect calls. The final observation is that the inode reclaim walk loop is now almost the same as xfs_inode_walk, so it's silly to maintain two copies. Merge the reclaim loop code into xfs_inode_walk. Lastly, refactor the per-ag radix tagging functions since there's duplicated code that can be consolidated. This series is a prerequisite for the next two patchsets, since deferred inode inactivation will add another inode radix tree tag and iterator function to xfs_inode_walk. v2: walk the vfs inode list when running quotaoff instead of the radix tree, then rework the (now completely internal) inode walk function to take the tag as the main parameter. v3: merge the reclaim loop into xfs_inode_walk, then consolidate the radix tree tagging functions v4: rebase to 5.13-rc4 v5: combine with the quotaoff patchset, reorder functions to minimize forward declarations, split inode walk goals from radix tree tags to reduce conceptual confusion v6: start moving the inode cache code towards the xfs_icwalk prefix * tag 'inode-walk-cleanups-5.14_2021-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: refactor per-AG inode tagging functions xfs: merge xfs_reclaim_inodes_ag into xfs_inode_walk_ag xfs: pass struct xfs_eofblocks to the inode scan callback xfs: fix radix tree tag signs xfs: make the icwalk processing functions clean up the grab state xfs: clean up inode state flag tests in xfs_blockgc_igrab xfs: remove indirect calls from xfs_inode_walk{,_ag} xfs: remove iter_flags parameter from xfs_inode_walk_* xfs: move xfs_inew_wait call into xfs_dqrele_inode xfs: separate the dqrele_all inode grab logic from xfs_inode_walk_ag_grab xfs: pass the goal of the incore inode walk to xfs_inode_walk() xfs: rename xfs_inode_walk functions to xfs_icwalk xfs: move the inode walk functions further down xfs: detach inode dquots at the end of inactivation xfs: move the quotaoff dqrele inode walk into xfs_icache.c [djwong: added variable names to function declarations while fixing merge conflicts] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-03xfs: refactor per-AG inode tagging functionsDarrick J. Wong
In preparation for adding another incore inode tree tag, refactor the code that sets and clears tags from the per-AG inode tree and the tree of per-AG structures, and remove the open-coded versions used by the blockgc code. Note: For reclaim, we now rely on the radix tree tags instead of the reclaimable inode count more heavily than we used to. The conversion should be fine, but the logic isn't 100% identical. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: merge xfs_reclaim_inodes_ag into xfs_inode_walk_agDarrick J. Wong
Merge these two inode walk loops together, since they're pretty similar now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>