diff options
author | Filipe Manana <fdmanana@suse.com> | 2025-08-06 12:11:31 +0100 |
---|---|---|
committer | Greg Kroah-Hartman <gregkh@linuxfoundation.org> | 2025-09-09 18:58:02 +0200 |
commit | 3d9c5e1512422962c24d4e23d5aae36b354af7fa (patch) | |
tree | 89e8d74e472aa35580e3c85d8bd72852c1f0a568 | |
parent | 37c491006e5396e33c1a3c73290e6b4a4b04f2d8 (diff) |
btrfs: fix race between setting last_dir_index_offset and inode logging
[ Upstream commit 59a0dd4ab98970086fd096281b1606c506ff2698 ]
At inode_logged() if we find that the inode was not logged before we
update its ->last_dir_index_offset to (u64)-1 with the goal that the
next directory log operation will see the (u64)-1 and then figure out
it must check what was the index of the last logged dir index key and
update ->last_dir_index_offset to that key's offset (this is done in
update_last_dir_index_offset()).
This however has a possibility for a time window where a race can happen
and lead to directory logging skipping dir index keys that should be
logged. The race happens like this:
1) Task A calls inode_logged(), sees ->logged_trans as 0 and then checks
that the inode item was logged before, but before it sets the inode's
->last_dir_index_offset to (u64)-1...
2) Task B is at btrfs_log_inode() which calls inode_logged() early, and
that has set ->last_dir_index_offset to (u64)-1;
3) Task B then enters log_directory_changes() which calls
update_last_dir_index_offset(). There it sees ->last_dir_index_offset
is (u64)-1 and that the inode was logged before (ctx->logged_before is
true), and so it searches for the last logged dir index key in the log
tree and it finds that it has an offset (index) value of N, so it sets
->last_dir_index_offset to N, so that we can skip index keys that are
less than or equal to N (later at process_dir_items_leaf());
4) Task A now sets ->last_dir_index_offset to (u64)-1, undoing the update
that task B just did;
5) Task B will now skip every index key when it enters
process_dir_items_leaf(), since ->last_dir_index_offset is (u64)-1.
Fix this by making inode_logged() not touch ->last_dir_index_offset and
initializing it to 0 when an inode is loaded (at btrfs_alloc_inode()) and
then having update_last_dir_index_offset() treat a value of 0 as meaning
we must check the log tree and update with the index of the last logged
index key. This is fine since the minimum possible value for
->last_dir_index_offset is 1 (BTRFS_DIR_START_INDEX - 1 = 2 - 1 = 1).
This also simplifies the management of ->last_dir_index_offset and now
all accesses to it are done under the inode's log_mutex.
Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-rw-r--r-- | fs/btrfs/btrfs_inode.h | 2 | ||||
-rw-r--r-- | fs/btrfs/inode.c | 1 | ||||
-rw-r--r-- | fs/btrfs/tree-log.c | 17 |
3 files changed, 4 insertions, 16 deletions
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index db53a3263fbd..89f582612fb9 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -247,7 +247,7 @@ struct btrfs_inode { u64 new_delalloc_bytes; /* * The offset of the last dir index key that was logged. - * This is used only for directories. + * This is used only for directories. Protected by 'log_mutex'. */ u64 last_dir_index_offset; }; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f84e3f9fad84..98d087a14be5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7767,6 +7767,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) ei->last_sub_trans = 0; ei->logged_trans = 0; ei->delalloc_bytes = 0; + /* new_delalloc_bytes and last_dir_index_offset are in a union. */ ei->new_delalloc_bytes = 0; ei->defrag_bytes = 0; ei->disk_i_size = 0; diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index bf5d7e52467a..dc4c9fb0c011 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3448,19 +3448,6 @@ static int inode_logged(const struct btrfs_trans_handle *trans, inode->logged_trans = trans->transid; spin_unlock(&inode->lock); - /* - * If it's a directory, then we must set last_dir_index_offset to the - * maximum possible value, so that the next attempt to log the inode does - * not skip checking if dir index keys found in modified subvolume tree - * leaves have been logged before, otherwise it would result in attempts - * to insert duplicate dir index keys in the log tree. This must be done - * because last_dir_index_offset is an in-memory only field, not persisted - * in the inode item or any other on-disk structure, so its value is lost - * once the inode is evicted. - */ - if (S_ISDIR(inode->vfs_inode.i_mode)) - inode->last_dir_index_offset = (u64)-1; - return 1; } @@ -4052,7 +4039,7 @@ done: /* * If the inode was logged before and it was evicted, then its - * last_dir_index_offset is (u64)-1, so we don't the value of the last index + * last_dir_index_offset is 0, so we don't know the value of the last index * key offset. If that's the case, search for it and update the inode. This * is to avoid lookups in the log tree every time we try to insert a dir index * key from a leaf changed in the current transaction, and to allow us to always @@ -4068,7 +4055,7 @@ static int update_last_dir_index_offset(struct btrfs_inode *inode, lockdep_assert_held(&inode->log_mutex); - if (inode->last_dir_index_offset != (u64)-1) + if (inode->last_dir_index_offset != 0) return 0; if (!ctx->logged_before) { |