From 96355d5a1f0ee6dcc182c37db4894ec0c29f1692 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 29 Jun 2020 14:48:45 -0700 Subject: xfs: Don't allow logging of XFS_ISTALE inodes In tracking down a problem in this patchset, I discovered we are reclaiming dirty stale inodes. This wasn't discovered until inodes were always attached to the cluster buffer and then the rcu callback that freed inodes was assert failing because the inode still had an active pointer to the cluster buffer after it had been reclaimed. Debugging the issue indicated that this was a pre-existing issue resulting from the way the inodes are handled in xfs_inactive_ifree. When we free a cluster buffer from xfs_ifree_cluster, all the inodes in cache are marked XFS_ISTALE. Those that are clean have nothing else done to them and so eventually get cleaned up by background reclaim. i.e. it is assumed we'll never dirty/relog an inode marked XFS_ISTALE. On journal commit dirty stale inodes as are handled by both buffer and inode log items to run though xfs_istale_done() and removed from the AIL (buffer log item commit) or the log item will simply unpin it because the buffer log item will clean it. What happens to any specific inode is entirely dependent on which log item wins the commit race, but the result is the same - stale inodes are clean, not attached to the cluster buffer, and not in the AIL. Hence inode reclaim can just free these inodes without further care. However, if the stale inode is relogged, it gets dirtied again and relogged into the CIL. Most of the time this isn't an issue, because relogging simply changes the inode's location in the current checkpoint. Problems arise, however, when the CIL checkpoints between two transactions in the xfs_inactive_ifree() deferops processing. This results in the XFS_ISTALE inode being redirtied and inserted into the CIL without any of the other stale cluster buffer infrastructure being in place. Hence on journal commit, it simply gets unpinned, so it remains dirty in memory. Everything in inode writeback avoids XFS_ISTALE inodes so it can't be written back, and it is not tracked in the AIL so there's not even a trigger to attempt to clean the inode. Hence the inode just sits dirty in memory until inode reclaim comes along, sees that it is XFS_ISTALE, and goes to reclaim it. This reclaiming of a dirty inode caused use after free, list corruptions and other nasty issues later in this patchset. Hence this patch addresses a violation of the "never log XFS_ISTALE inodes" caused by the deferops processing rolling a transaction and relogging a stale inode in xfs_inactive_free. It also adds a bunch of asserts to catch this problem in debug kernels so that we don't reintroduce this problem in future. Reproducer for this issue was generic/558 on a v4 filesystem. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_trans_inode.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'fs/xfs/libxfs/xfs_trans_inode.c') diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index b5dfb6654842..4504d215cd59 100644 --- a/fs/xfs/libxfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -36,6 +36,7 @@ xfs_trans_ijoin( ASSERT(iip->ili_lock_flags == 0); iip->ili_lock_flags = lock_flags; + ASSERT(!xfs_iflags_test(ip, XFS_ISTALE)); /* * Get a log_item_desc to point at the new item. @@ -89,6 +90,7 @@ xfs_trans_log_inode( ASSERT(ip->i_itemp != NULL); ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); + ASSERT(!xfs_iflags_test(ip, XFS_ISTALE)); /* * Don't bother with i_lock for the I_DIRTY_TIME check here, as races -- cgit v1.2.3 From 1319ebefd6ed7a9988b7b4bc9317fbcf61a28bfc Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 29 Jun 2020 14:48:46 -0700 Subject: xfs: add an inode item lock The inode log item is kind of special in that it can be aggregating new changes in memory at the same time time existing changes are being written back to disk. This means there are fields in the log item that are accessed concurrently from contexts that don't share any locking at all. e.g. updating ili_last_fields occurs at flush time under the ILOCK_EXCL and flush lock at flush time, under the flush lock at IO completion time, and is read under the ILOCK_EXCL when the inode is logged. Hence there is no actual serialisation between reading the field during logging of the inode in transactions vs clearing the field in IO completion. We currently get away with this by the fact that we are only clearing fields in IO completion, and nothing bad happens if we accidentally log more of the inode than we actually modify. Worst case is we consume a tiny bit more memory and log bandwidth. However, if we want to do more complex state manipulations on the log item that requires updates at all three of these potential locations, we need to have some mechanism of serialising those operations. To do this, introduce a spinlock into the log item to serialise internal state. This could be done via the xfs_inode i_flags_lock, but this then leads to potential lock inversion issues where inode flag updates need to occur inside locks that best nest inside the inode log item locks (e.g. marking inodes stale during inode cluster freeing). Using a separate spinlock avoids these sorts of problems and simplifies future code. This does not touch the use of ili_fields in the item formatting code - that is entirely protected by the ILOCK_EXCL at this point in time, so it remains untouched. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_trans_inode.c | 52 ++++++++++++++++++++--------------------- fs/xfs/xfs_file.c | 9 ++++--- fs/xfs/xfs_inode.c | 20 +++++++++------- fs/xfs/xfs_inode_item.c | 7 ++++++ fs/xfs/xfs_inode_item.h | 18 +++++++++++--- 5 files changed, 66 insertions(+), 40 deletions(-) (limited to 'fs/xfs/libxfs/xfs_trans_inode.c') diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index 4504d215cd59..c66d9d1dd58b 100644 --- a/fs/xfs/libxfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -82,16 +82,20 @@ xfs_trans_ichgtime( */ void xfs_trans_log_inode( - xfs_trans_t *tp, - xfs_inode_t *ip, - uint flags) + struct xfs_trans *tp, + struct xfs_inode *ip, + uint flags) { - struct inode *inode = VFS_I(ip); + struct xfs_inode_log_item *iip = ip->i_itemp; + struct inode *inode = VFS_I(ip); + uint iversion_flags = 0; - ASSERT(ip->i_itemp != NULL); + ASSERT(iip); ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); ASSERT(!xfs_iflags_test(ip, XFS_ISTALE)); + tp->t_flags |= XFS_TRANS_DIRTY; + /* * Don't bother with i_lock for the I_DIRTY_TIME check here, as races * don't matter - we either will need an extra transaction in 24 hours @@ -104,15 +108,6 @@ xfs_trans_log_inode( spin_unlock(&inode->i_lock); } - /* - * Record the specific change for fdatasync optimisation. This - * allows fdatasync to skip log forces for inodes that are only - * timestamp dirty. We do this before the change count so that - * the core being logged in this case does not impact on fdatasync - * behaviour. - */ - ip->i_itemp->ili_fsync_fields |= flags; - /* * First time we log the inode in a transaction, bump the inode change * counter if it is configured for this to occur. While we have the @@ -122,23 +117,28 @@ xfs_trans_log_inode( * set however, then go ahead and bump the i_version counter * unconditionally. */ - if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) && - IS_I_VERSION(VFS_I(ip))) { - if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE)) - flags |= XFS_ILOG_CORE; + if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) { + if (IS_I_VERSION(inode) && + inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)) + iversion_flags = XFS_ILOG_CORE; } - tp->t_flags |= XFS_TRANS_DIRTY; + /* + * Record the specific change for fdatasync optimisation. This allows + * fdatasync to skip log forces for inodes that are only timestamp + * dirty. + */ + spin_lock(&iip->ili_lock); + iip->ili_fsync_fields |= flags; /* - * Always OR in the bits from the ili_last_fields field. - * This is to coordinate with the xfs_iflush() and xfs_iflush_done() - * routines in the eventual clearing of the ili_fields bits. - * See the big comment in xfs_iflush() for an explanation of - * this coordination mechanism. + * Always OR in the bits from the ili_last_fields field. This is to + * coordinate with the xfs_iflush() and xfs_iflush_done() routines in + * the eventual clearing of the ili_fields bits. See the big comment in + * xfs_iflush() for an explanation of this coordination mechanism. */ - flags |= ip->i_itemp->ili_last_fields; - ip->i_itemp->ili_fields |= flags; + iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags); + spin_unlock(&iip->ili_lock); } int diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index cc6528726187..01c098834c4b 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -94,6 +94,7 @@ xfs_file_fsync( { struct inode *inode = file->f_mapping->host; struct xfs_inode *ip = XFS_I(inode); + struct xfs_inode_log_item *iip = ip->i_itemp; struct xfs_mount *mp = ip->i_mount; int error = 0; int log_flushed = 0; @@ -137,13 +138,15 @@ xfs_file_fsync( xfs_ilock(ip, XFS_ILOCK_SHARED); if (xfs_ipincount(ip)) { if (!datasync || - (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) - lsn = ip->i_itemp->ili_last_lsn; + (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) + lsn = iip->ili_last_lsn; } if (lsn) { error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed); - ip->i_itemp->ili_fsync_fields = 0; + spin_lock(&iip->ili_lock); + iip->ili_fsync_fields = 0; + spin_unlock(&iip->ili_lock); } xfs_iunlock(ip, XFS_ILOCK_SHARED); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 2f65fe70d305..d6da08165a2e 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2704,9 +2704,11 @@ xfs_ifree_cluster( continue; iip = ip->i_itemp; + spin_lock(&iip->ili_lock); iip->ili_last_fields = iip->ili_fields; iip->ili_fields = 0; iip->ili_fsync_fields = 0; + spin_unlock(&iip->ili_lock); xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, &iip->ili_item.li_lsn); @@ -2742,6 +2744,7 @@ xfs_ifree( { int error; struct xfs_icluster xic = { 0 }; + struct xfs_inode_log_item *iip = ip->i_itemp; ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); ASSERT(VFS_I(ip)->i_nlink == 0); @@ -2779,7 +2782,9 @@ xfs_ifree( ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS; /* Don't attempt to replay owner changes for a deleted inode */ - ip->i_itemp->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER); + spin_lock(&iip->ili_lock); + iip->ili_fields &= ~(XFS_ILOG_AOWNER | XFS_ILOG_DOWNER); + spin_unlock(&iip->ili_lock); /* * Bump the generation count so no one will be confused @@ -3835,20 +3840,19 @@ xfs_iflush_int( * know that the information those bits represent is permanently on * disk. As long as the flush completes before the inode is logged * again, then both ili_fields and ili_last_fields will be cleared. - * - * We can play with the ili_fields bits here, because the inode lock - * must be held exclusively in order to set bits there and the flush - * lock protects the ili_last_fields bits. Store the current LSN of the - * inode so that we can tell whether the item has moved in the AIL from - * xfs_iflush_done(). In order to read the lsn we need the AIL lock, - * because it is a 64 bit value that cannot be read atomically. */ error = 0; flush_out: + spin_lock(&iip->ili_lock); iip->ili_last_fields = iip->ili_fields; iip->ili_fields = 0; iip->ili_fsync_fields = 0; + spin_unlock(&iip->ili_lock); + /* + * Store the current LSN of the inode so that we can tell whether the + * item has moved in the AIL from xfs_iflush_done(). + */ xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, &iip->ili_item.li_lsn); diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index b17384aa8df4..6ef9cbcfc94a 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -637,6 +637,7 @@ xfs_inode_item_init( iip = ip->i_itemp = kmem_zone_zalloc(xfs_ili_zone, 0); iip->ili_inode = ip; + spin_lock_init(&iip->ili_lock); xfs_log_item_init(mp, &iip->ili_item, XFS_LI_INODE, &xfs_inode_item_ops); } @@ -738,7 +739,11 @@ xfs_iflush_done( list_for_each_entry_safe(blip, n, &tmp, li_bio_list) { list_del_init(&blip->li_bio_list); iip = INODE_ITEM(blip); + + spin_lock(&iip->ili_lock); iip->ili_last_fields = 0; + spin_unlock(&iip->ili_lock); + xfs_ifunlock(iip->ili_inode); } list_del(&tmp); @@ -762,9 +767,11 @@ xfs_iflush_abort( * Clear the inode logging fields so no more flushes are * attempted. */ + spin_lock(&iip->ili_lock); iip->ili_last_fields = 0; iip->ili_fields = 0; iip->ili_fsync_fields = 0; + spin_unlock(&iip->ili_lock); } /* * Release the inode's flush lock since we're done with it. diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h index 4de5070e0765..4a10a1b92ee9 100644 --- a/fs/xfs/xfs_inode_item.h +++ b/fs/xfs/xfs_inode_item.h @@ -16,12 +16,24 @@ struct xfs_mount; struct xfs_inode_log_item { struct xfs_log_item ili_item; /* common portion */ struct xfs_inode *ili_inode; /* inode ptr */ - xfs_lsn_t ili_flush_lsn; /* lsn at last flush */ - xfs_lsn_t ili_last_lsn; /* lsn at last transaction */ - unsigned short ili_lock_flags; /* lock flags */ + unsigned short ili_lock_flags; /* inode lock flags */ + /* + * The ili_lock protects the interactions between the dirty state and + * the flush state of the inode log item. This allows us to do atomic + * modifications of multiple state fields without having to hold a + * specific inode lock to serialise them. + * + * We need atomic changes between inode dirtying, inode flushing and + * inode completion, but these all hold different combinations of + * ILOCK and iflock and hence we need some other method of serialising + * updates to the flush state. + */ + spinlock_t ili_lock; /* flush state lock */ unsigned int ili_last_fields; /* fields when flushed */ unsigned int ili_fields; /* fields to be logged */ unsigned int ili_fsync_fields; /* logged since last fsync */ + xfs_lsn_t ili_flush_lsn; /* lsn at last flush */ + xfs_lsn_t ili_last_lsn; /* lsn at last transaction */ }; static inline int xfs_inode_clean(xfs_inode_t *ip) -- cgit v1.2.3 From 298f7bec503f30bd98242ec02df6abe13b31a677 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 29 Jun 2020 14:49:15 -0700 Subject: xfs: pin inode backing buffer to the inode log item When we dirty an inode, we are going to have to write it disk at some point in the near future. This requires the inode cluster backing buffer to be present in memory. Unfortunately, under severe memory pressure we can reclaim the inode backing buffer while the inode is dirty in memory, resulting in stalling the AIL pushing because it has to do a read-modify-write cycle on the cluster buffer. When we have no memory available, the read of the cluster buffer blocks the AIL pushing process, and this causes all sorts of issues for memory reclaim as it requires inode writeback to make forwards progress. Allocating a cluster buffer causes more memory pressure, and results in more cluster buffers to be reclaimed, resulting in more RMW cycles to be done in the AIL context and everything then backs up on AIL progress. Only the synchronous inode cluster writeback in the the inode reclaim code provides some level of forwards progress guarantees that prevent OOM-killer rampages in this situation. Fix this by pinning the inode backing buffer to the inode log item when the inode is first dirtied (i.e. in xfs_trans_log_inode()). This may mean the first modification of an inode that has been held in cache for a long time may block on a cluster buffer read, but we can do that in transaction context and block safely until the buffer has been allocated and read. Once we have the cluster buffer, the inode log item takes a reference to it, pinning it in memory, and attaches it to the log item for future reference. This means we can always grab the cluster buffer from the inode log item when we need it. When the inode is finally cleaned and removed from the AIL, we can drop the reference the inode log item holds on the cluster buffer. Once all inodes on the cluster buffer are clean, the cluster buffer will be unpinned and it will be available for memory reclaim to reclaim again. This avoids the issues with needing to do RMW cycles in the AIL pushing context, and hence allows complete non-blocking inode flushing to be performed by the AIL pushing context. Signed-off-by: Dave Chinner Reviewed-by: Brian Foster Reviewed-by: Darrick J. Wong Signed-off-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_inode_buf.c | 3 +- fs/xfs/libxfs/xfs_trans_inode.c | 53 +++++++++++++++++++++++++++++++---- fs/xfs/xfs_buf_item.c | 4 +-- fs/xfs/xfs_inode_item.c | 61 +++++++++++++++++++++++++++++++++-------- fs/xfs/xfs_trans_ail.c | 8 ++++-- 5 files changed, 105 insertions(+), 24 deletions(-) (limited to 'fs/xfs/libxfs/xfs_trans_inode.c') diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c index 5c93e8e6de74..b4a6c091571e 100644 --- a/fs/xfs/libxfs/xfs_inode_buf.c +++ b/fs/xfs/libxfs/xfs_inode_buf.c @@ -176,7 +176,8 @@ xfs_imap_to_bp( } *bpp = bp; - *dipp = xfs_buf_offset(bp, imap->im_boffset); + if (dipp) + *dipp = xfs_buf_offset(bp, imap->im_boffset); return 0; } diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index c66d9d1dd58b..ad5974365c58 100644 --- a/fs/xfs/libxfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -8,6 +8,8 @@ #include "xfs_shared.h" #include "xfs_format.h" #include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" #include "xfs_inode.h" #include "xfs_trans.h" #include "xfs_trans_priv.h" @@ -72,13 +74,19 @@ xfs_trans_ichgtime( } /* - * This is called to mark the fields indicated in fieldmask as needing - * to be logged when the transaction is committed. The inode must - * already be associated with the given transaction. + * This is called to mark the fields indicated in fieldmask as needing to be + * logged when the transaction is committed. The inode must already be + * associated with the given transaction. * - * The values for fieldmask are defined in xfs_inode_item.h. We always - * log all of the core inode if any of it has changed, and we always log - * all of the inline data/extents/b-tree root if any of them has changed. + * The values for fieldmask are defined in xfs_inode_item.h. We always log all + * of the core inode if any of it has changed, and we always log all of the + * inline data/extents/b-tree root if any of them has changed. + * + * Grab and pin the cluster buffer associated with this inode to avoid RMW + * cycles at inode writeback time. Avoid the need to add error handling to every + * xfs_trans_log_inode() call by shutting down on read error. This will cause + * transactions to fail and everything to error out, just like if we return a + * read error in a dirty transaction and cancel it. */ void xfs_trans_log_inode( @@ -131,6 +139,39 @@ xfs_trans_log_inode( spin_lock(&iip->ili_lock); iip->ili_fsync_fields |= flags; + if (!iip->ili_item.li_buf) { + struct xfs_buf *bp; + int error; + + /* + * We hold the ILOCK here, so this inode is not going to be + * flushed while we are here. Further, because there is no + * buffer attached to the item, we know that there is no IO in + * progress, so nothing will clear the ili_fields while we read + * in the buffer. Hence we can safely drop the spin lock and + * read the buffer knowing that the state will not change from + * here. + */ + spin_unlock(&iip->ili_lock); + error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, NULL, + &bp, 0); + if (error) { + xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR); + return; + } + + /* + * We need an explicit buffer reference for the log item but + * don't want the buffer to remain attached to the transaction. + * Hold the buffer but release the transaction reference. + */ + xfs_buf_hold(bp); + xfs_trans_brelse(tp, bp); + + spin_lock(&iip->ili_lock); + iip->ili_item.li_buf = bp; + } + /* * Always OR in the bits from the ili_last_fields field. This is to * coordinate with the xfs_iflush() and xfs_iflush_done() routines in diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index d61f20b989cd..ecb3362395af 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -1143,11 +1143,9 @@ xfs_buf_inode_iodone( if (ret == XBF_IOERROR_DONE) return; ASSERT(ret == XBF_IOERROR_FAIL); - spin_lock(&bp->b_mount->m_ail->ail_lock); list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { - xfs_set_li_failed(lip, bp); + set_bit(XFS_LI_FAILED, &lip->li_flags); } - spin_unlock(&bp->b_mount->m_ail->ail_lock); xfs_buf_ioerror(bp, 0); xfs_buf_relse(bp); return; diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 0ba75764a8dc..64bdda72f7b2 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -439,6 +439,7 @@ xfs_inode_item_pin( struct xfs_inode *ip = INODE_ITEM(lip)->ili_inode; ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); + ASSERT(lip->li_buf); trace_xfs_inode_pin(ip, _RET_IP_); atomic_inc(&ip->i_pincount); @@ -450,6 +451,12 @@ xfs_inode_item_pin( * item which was previously pinned with a call to xfs_inode_item_pin(). * * Also wake up anyone in xfs_iunpin_wait() if the count goes to 0. + * + * Note that unpin can race with inode cluster buffer freeing marking the buffer + * stale. In that case, flush completions are run from the buffer unpin call, + * which may happen before the inode is unpinned. If we lose the race, there + * will be no buffer attached to the log item, but the inode will be marked + * XFS_ISTALE. */ STATIC void xfs_inode_item_unpin( @@ -459,6 +466,7 @@ xfs_inode_item_unpin( struct xfs_inode *ip = INODE_ITEM(lip)->ili_inode; trace_xfs_inode_unpin(ip, _RET_IP_); + ASSERT(lip->li_buf || xfs_iflags_test(ip, XFS_ISTALE)); ASSERT(atomic_read(&ip->i_pincount) > 0); if (atomic_dec_and_test(&ip->i_pincount)) wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT); @@ -629,10 +637,15 @@ xfs_inode_item_init( */ void xfs_inode_item_destroy( - xfs_inode_t *ip) + struct xfs_inode *ip) { - kmem_free(ip->i_itemp->ili_item.li_lv_shadow); - kmem_cache_free(xfs_ili_zone, ip->i_itemp); + struct xfs_inode_log_item *iip = ip->i_itemp; + + ASSERT(iip->ili_item.li_buf == NULL); + + ip->i_itemp = NULL; + kmem_free(iip->ili_item.li_lv_shadow); + kmem_cache_free(xfs_ili_zone, iip); } @@ -673,11 +686,10 @@ xfs_iflush_done( list_move_tail(&lip->li_bio_list, &tmp); /* Do an unlocked check for needing the AIL lock. */ - if (lip->li_lsn == iip->ili_flush_lsn || + if (iip->ili_flush_lsn == lip->li_lsn || test_bit(XFS_LI_FAILED, &lip->li_flags)) need_ail++; } - ASSERT(list_empty(&bp->b_li_list)); /* * We only want to pull the item from the AIL if it is actually there @@ -690,7 +702,7 @@ xfs_iflush_done( /* this is an opencoded batch version of xfs_trans_ail_delete */ spin_lock(&ailp->ail_lock); list_for_each_entry(lip, &tmp, li_bio_list) { - xfs_clear_li_failed(lip); + clear_bit(XFS_LI_FAILED, &lip->li_flags); if (lip->li_lsn == INODE_ITEM(lip)->ili_flush_lsn) { xfs_lsn_t lsn = xfs_ail_delete_one(ailp, lip); if (!tail_lsn && lsn) @@ -706,14 +718,29 @@ xfs_iflush_done( * them is safely on disk. */ list_for_each_entry_safe(lip, n, &tmp, li_bio_list) { + bool drop_buffer = false; + list_del_init(&lip->li_bio_list); iip = INODE_ITEM(lip); spin_lock(&iip->ili_lock); + + /* + * Remove the reference to the cluster buffer if the inode is + * clean in memory. Drop the buffer reference once we've dropped + * the locks we hold. + */ + ASSERT(iip->ili_item.li_buf == bp); + if (!iip->ili_fields) { + iip->ili_item.li_buf = NULL; + drop_buffer = true; + } iip->ili_last_fields = 0; + iip->ili_flush_lsn = 0; spin_unlock(&iip->ili_lock); - xfs_ifunlock(iip->ili_inode); + if (drop_buffer) + xfs_buf_rele(bp); } } @@ -725,12 +752,20 @@ xfs_iflush_done( */ void xfs_iflush_abort( - struct xfs_inode *ip) + struct xfs_inode *ip) { - struct xfs_inode_log_item *iip = ip->i_itemp; + struct xfs_inode_log_item *iip = ip->i_itemp; + struct xfs_buf *bp = NULL; if (iip) { + /* + * Clear the failed bit before removing the item from the AIL so + * xfs_trans_ail_delete() doesn't try to clear and release the + * buffer attached to the log item before we are done with it. + */ + clear_bit(XFS_LI_FAILED, &iip->ili_item.li_flags); xfs_trans_ail_delete(&iip->ili_item, 0); + /* * Clear the inode logging fields so no more flushes are * attempted. @@ -739,12 +774,14 @@ xfs_iflush_abort( iip->ili_last_fields = 0; iip->ili_fields = 0; iip->ili_fsync_fields = 0; + iip->ili_flush_lsn = 0; + bp = iip->ili_item.li_buf; + iip->ili_item.li_buf = NULL; spin_unlock(&iip->ili_lock); } - /* - * Release the inode's flush lock since we're done with it. - */ xfs_ifunlock(ip); + if (bp) + xfs_buf_rele(bp); } /* diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index ac33f6393f99..c3be6e440134 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -377,8 +377,12 @@ xfsaild_resubmit_item( } /* protected by ail_lock */ - list_for_each_entry(lip, &bp->b_li_list, li_bio_list) - xfs_clear_li_failed(lip); + list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { + if (bp->b_flags & _XBF_INODES) + clear_bit(XFS_LI_FAILED, &lip->li_flags); + else + xfs_clear_li_failed(lip); + } xfs_buf_unlock(bp); return XFS_ITEM_SUCCESS; -- cgit v1.2.3 From 48d55e2ae3ce837598c073995bbbac5d24a35fe1 Mon Sep 17 00:00:00 2001 From: Dave Chinner Date: Mon, 29 Jun 2020 14:49:18 -0700 Subject: xfs: attach inodes to the cluster buffer when dirtied Rather than attach inodes to the cluster buffer just when we are doing IO, attach the inodes to the cluster buffer when they are dirtied. The means the buffer always carries a list of dirty inodes that reference it, and we can use that list to make more fundamental changes to inode writeback that aren't otherwise possible. Signed-off-by: Dave Chinner Reviewed-by: Darrick J. Wong Reviewed-by: Brian Foster Signed-off-by: Darrick J. Wong --- fs/xfs/libxfs/xfs_trans_inode.c | 9 ++++++--- fs/xfs/xfs_buf_item.c | 1 + fs/xfs/xfs_icache.c | 1 + fs/xfs/xfs_inode.c | 24 +++++------------------- fs/xfs/xfs_inode_item.c | 16 ++++++++++++++-- 5 files changed, 27 insertions(+), 24 deletions(-) (limited to 'fs/xfs/libxfs/xfs_trans_inode.c') diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c index ad5974365c58..e15129647e00 100644 --- a/fs/xfs/libxfs/xfs_trans_inode.c +++ b/fs/xfs/libxfs/xfs_trans_inode.c @@ -163,13 +163,16 @@ xfs_trans_log_inode( /* * We need an explicit buffer reference for the log item but * don't want the buffer to remain attached to the transaction. - * Hold the buffer but release the transaction reference. + * Hold the buffer but release the transaction reference once + * we've attached the inode log item to the buffer log item + * list. */ xfs_buf_hold(bp); - xfs_trans_brelse(tp, bp); - spin_lock(&iip->ili_lock); iip->ili_item.li_buf = bp; + bp->b_flags |= _XBF_INODES; + list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list); + xfs_trans_brelse(tp, bp); } /* diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index ecb3362395af..e9428c30862a 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -465,6 +465,7 @@ xfs_buf_item_unpin( if (bip->bli_flags & XFS_BLI_STALE_INODE) { xfs_buf_item_done(bp); xfs_iflush_done(bp); + ASSERT(list_empty(&bp->b_li_list)); } else { xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR); xfs_buf_item_relse(bp); diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index dc90a81abb1a..58a750ce689c 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -115,6 +115,7 @@ __xfs_inode_free( { /* asserts to verify all state is correct here */ ASSERT(atomic_read(&ip->i_pincount) == 0); + ASSERT(!ip->i_itemp || list_empty(&ip->i_itemp->ili_item.li_bio_list)); XFS_STATS_DEC(ip->i_mount, vn_active); call_rcu(&VFS_I(ip)->i_rcu, xfs_inode_free_callback); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 1c3a8bed4875..c4586ac3656a 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2584,27 +2584,24 @@ retry: ASSERT(iip->ili_last_fields); goto out_iunlock; } - ASSERT(!iip || list_empty(&iip->ili_item.li_bio_list)); /* - * Clean inodes can be released immediately. Everything else has to go - * through xfs_iflush_abort() on journal commit as the flock - * synchronises removal of the inode from the cluster buffer against - * inode reclaim. + * Inodes not attached to the buffer can be released immediately. + * Everything else has to go through xfs_iflush_abort() on journal + * commit as the flock synchronises removal of the inode from the + * cluster buffer against inode reclaim. */ - if (xfs_inode_clean(ip)) { + if (!iip || list_empty(&iip->ili_item.li_bio_list)) { xfs_ifunlock(ip); goto out_iunlock; } /* we have a dirty inode in memory that has not yet been flushed. */ - ASSERT(iip->ili_fields); spin_lock(&iip->ili_lock); iip->ili_last_fields = iip->ili_fields; iip->ili_fields = 0; iip->ili_fsync_fields = 0; spin_unlock(&iip->ili_lock); - list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list); ASSERT(iip->ili_last_fields); out_iunlock: @@ -3818,19 +3815,8 @@ flush_out: xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn, &iip->ili_item.li_lsn); - /* - * Attach the inode item callback to the buffer whether the flush - * succeeded or not. If not, the caller will shut down and fail I/O - * completion on the buffer to remove the inode from the AIL and release - * the flush lock. - */ - bp->b_flags |= _XBF_INODES; - list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list); - /* generate the checksum. */ xfs_dinode_calc_crc(mp, dip); - - ASSERT(!list_empty(&bp->b_li_list)); return error; } diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 64bdda72f7b2..697248b7eb2b 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -660,6 +660,10 @@ xfs_inode_item_destroy( * list for other inodes that will run this function. We remove them from the * buffer list so we can process all the inode IO completions in one AIL lock * traversal. + * + * Note: Now that we attach the log item to the buffer when we first log the + * inode in memory, we can have unflushed inodes on the buffer list here. These + * inodes will have a zero ili_last_fields, so skip over them here. */ void xfs_iflush_done( @@ -677,12 +681,15 @@ xfs_iflush_done( */ list_for_each_entry_safe(lip, n, &bp->b_li_list, li_bio_list) { iip = INODE_ITEM(lip); + if (xfs_iflags_test(iip->ili_inode, XFS_ISTALE)) { - list_del_init(&lip->li_bio_list); xfs_iflush_abort(iip->ili_inode); continue; } + if (!iip->ili_last_fields) + continue; + list_move_tail(&lip->li_bio_list, &tmp); /* Do an unlocked check for needing the AIL lock. */ @@ -728,12 +735,16 @@ xfs_iflush_done( /* * Remove the reference to the cluster buffer if the inode is * clean in memory. Drop the buffer reference once we've dropped - * the locks we hold. + * the locks we hold. If the inode is dirty in memory, we need + * to put the inode item back on the buffer list for another + * pass through the flush machinery. */ ASSERT(iip->ili_item.li_buf == bp); if (!iip->ili_fields) { iip->ili_item.li_buf = NULL; drop_buffer = true; + } else { + list_add(&lip->li_bio_list, &bp->b_li_list); } iip->ili_last_fields = 0; iip->ili_flush_lsn = 0; @@ -777,6 +788,7 @@ xfs_iflush_abort( iip->ili_flush_lsn = 0; bp = iip->ili_item.li_buf; iip->ili_item.li_buf = NULL; + list_del_init(&iip->ili_item.li_bio_list); spin_unlock(&iip->ili_lock); } xfs_ifunlock(ip); -- cgit v1.2.3