summaryrefslogtreecommitdiff
path: root/fs/btrfs/send.c
AgeCommit message (Collapse)Author
2022-12-05btrfs: use a structure to pass arguments to backref walking functionsFilipe Manana
The public backref walking functions have quite a lot of arguments that are passed down the call stack to find_parent_nodes(), the core function of the backref walking code. The next patches in series will need to add even arguments to these functions that should be passed not only to find_parent_nodes(), but also to other functions used by the later (directly or even lower in the call stack). So create a structure to hold all these arguments and state used by the main backref walking function, find_parent_nodes(), and use it as the argument for the public backref walking functions iterate_extent_inodes(), btrfs_find_all_leafs() and btrfs_find_all_roots(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use a single argument for extent offset in backref walking functionsFilipe Manana
The interface for find_parent_nodes() has two extent offset related arguments: 1) One u64 pointer argument for the extent offset; 2) One boolean argument to tell if the extent offset should be ignored or not. These are confusing, becase the extent offset pointer can be NULL and in some cases callers pass a NULL value as a way to tell the backref walking code to ignore offsets in file extent items (and simply consider all file extent items that point to the target data extent). The boolean argument was added in commit c995ab3cda3f ("btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents"), but it was never really necessary, it was enough if it could find a way to get a NULL value passed to the "extent_item_pos" argument of find_parent_nodes(). The arguments are also passed to functions called by find_parent_nodes() and respective helper functions, which further makes everything more complicated than needed. Then we have several backref walking related functions that end up calling find_parent_nodes(), either directly or through some other function that they call, and for many we have to use an "extent_item_pos" (u64) argument and a boolean "ignore_offset" argument too. This is confusing and not really necessary. So use a single argument to specify the extent offset, as a simple u64 and not as a pointer, but using a special value of (u64)-1, defined as a documented constant, to indicate when the extent offset should be ignored. This is also preparation work for the upcoming patches in the series that add other arguments to find_parent_nodes() and other related functions that use it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: send: optimize clone detection to increase extent sharingFilipe Manana
Currently send does not do the best decisions when it comes to decide between multiple clone sources, which results in clone operations for partial extent ranges, which has the following disadvantages: 1) We get less shared extents at the destination; 2) We have to read more data during the send operation and emit more write commands. Besides not being optimal behaviour, it also breaks user expectations and is often reported by users, with a recent example in the Link tag at the bottom of this change log. Part of the reason for this non-optimal behaviour is that the backref walking code does not provide information about the length of the file extent items that were found for each backref, so send is blind about which backref is the best to chose as a cloning source. The other existing reasons are just silliness, namely always prefering the inode with the lowest number when multiple are found for the same root and when we can clone from multiple roots, always prefer the send root over any of the other clone roots. This does not make any sense since any inode or root is fine and as good as any other inode/root. Fix this by making backref walking pass information about the number of bytes referenced by each file extent item and then have send's backref callback pick the inode with the highest number of bytes for each root. Finally select the root from which we can clone more bytes from. Example reproducer: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo cp --reflink=always $MNT/foo $MNT/bar cp --reflink=always $MNT/foo $MNT/baz sync # Overwrite the second half of file foo. xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo sync echo echo "*** fiemap in the original filesystem ***" echo xfs_io -c "fiemap -v" $MNT/foo xfs_io -c "fiemap -v" $MNT/bar xfs_io -c "fiemap -v" $MNT/baz echo btrfs filesystem du $MNT btrfs subvolume snapshot -r $MNT $MNT/snap btrfs send -f /tmp/send_stream $MNT/snap umount $MNT mkfs.btrfs -f $DEV &> /dev/null mount $DEV $MNT btrfs receive -f /tmp/send_stream $MNT echo echo "*** fiemap in the new filesystem ***" echo xfs_io -r -c "fiemap -v" $MNT/snap/foo xfs_io -r -c "fiemap -v" $MNT/snap/bar xfs_io -r -c "fiemap -v" $MNT/snap/baz echo btrfs filesystem du $MNT rm -f /tmp/send_stream rm -f /tmp/snap.fssum umount $MNT Before this change: $ ./test.sh (...) *** fiemap in the original filesystem *** /mnt/sdi/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 Total Exclusive Set shared Filename 2.00MiB 1.00MiB - /mnt/sdi/foo 2.00MiB 0.00B - /mnt/sdi/bar 2.00MiB 0.00B - /mnt/sdi/baz 6.00MiB 1.00MiB 2.00MiB /mnt/sdi Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap' At subvol /mnt/sdi/snap At subvol snap *** fiemap in the new filesystem *** /mnt/sdi/snap/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/snap/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 32768..34815 2048 0x1 Total Exclusive Set shared Filename 2.00MiB 0.00B - /mnt/sdi/snap/foo 2.00MiB 1.00MiB - /mnt/sdi/snap/bar 2.00MiB 1.00MiB - /mnt/sdi/snap/baz 6.00MiB 2.00MiB - /mnt/sdi/snap 6.00MiB 2.00MiB 2.00MiB /mnt/sdi We end up with two 1M extents that are not shared for files bar and baz. After this change: $ ./test.sh (...) *** fiemap in the original filesystem *** /mnt/sdi/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 Total Exclusive Set shared Filename 2.00MiB 1.00MiB - /mnt/sdi/foo 2.00MiB 0.00B - /mnt/sdi/bar 2.00MiB 0.00B - /mnt/sdi/baz 6.00MiB 1.00MiB 2.00MiB /mnt/sdi Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap' At subvol /mnt/sdi/snap At subvol snap *** fiemap in the new filesystem *** /mnt/sdi/snap/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x2001 /mnt/sdi/snap/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x2001 Total Exclusive Set shared Filename 2.00MiB 0.00B - /mnt/sdi/snap/foo 2.00MiB 0.00B - /mnt/sdi/snap/bar 2.00MiB 0.00B - /mnt/sdi/snap/baz 6.00MiB 0.00B - /mnt/sdi/snap 6.00MiB 0.00B 3.00MiB /mnt/sdi Now there's a much better sharing, files bar and baz share 1M of the extent of file foo and the second extent of files bar and baz is shared between themselves. This will later be turned into a test case for fstests. Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: send: avoid unnecessary backref lookups when finding clone sourceFilipe Manana
At find_extent_clone(), unless we are given an inline extent, a file extent item that represents hole or an extent that starts beyond the i_size, we always do backref walking to look for clone sources, unless if we have more than SEND_MAX_EXTENT_REFS (64) known references on the extent. However if we know we only have one reference in the extent item and only one clone source (the send root), then it's pointless to do the backref walking to search for clone sources, as we can't clone from any other root. So skip the backref walking in that case. The following test was run on a non-debug kernel (Debian's default kernel config): $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT # Create an extent tree that's not too small and none of the # extents is shared. for ((i = 1; i <= 50000; i++)); do xfs_io -f -c "pwrite 0 4K" $MNT/file_$i > /dev/null echo -ne "\r$i files created..." done echo btrfs subvolume snapshot -r $MNT $MNT/snap start=$(date +%s%N) btrfs send $MNT/snap > /dev/null end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo -e "\nsend took $dur milliseconds" umount $MNT Before this change: send took 5389 milliseconds After this change: send took 4519 milliseconds (-16.1%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: send: drop unnecessary backref context field initializationsFilipe Manana
At find_extent_clone() we are initializing to zero the 'found_itself' and 'found' fields of the backref context before we use it but we have already initialized the structure to zeroes when we declared it on stack, so it's pointless to initialize those fields and they are unnecessarily increasing the object text size with two "mov" instructions (x86_64). Similarly make the 'extent_len' initialization more clear by using an if- -then-else instead of a double assignment to it in case the extent's end crosses the i_size boundary. Before this change: $ size fs/btrfs/send.o text data bss dec hex filename 68694 4252 16 72962 11d02 fs/btrfs/send.o After this change: $ size fs/btrfs/send.o text data bss dec hex filename 68678 4252 16 72946 11cf2 fs/btrfs/send.o Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: send: update comment at find_extent_clone()Filipe Manana
We have this unclear comment at find_extent_clone() about extents starting at a file offset greater than or equals to the i_size of the inode. It's not really informative and it's misleading, since it mentions the author found such extents with snapshots and large files. Such extents are a result of fallocate with FALLOC_FL_KEEP_SIZE and there is no relation to snapshots or large files (all write paths update the i_size before inserting a new file extent item). So update the comment to be precise about it and why we don't bother looking for clone sources in that case. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: send: avoid unnecessary path allocations when finding extent cloneFilipe Manana
When looking for an extent clone, at find_extent_clone(), we start by allocating a path and then check for cases where we can't have clones and exit immediately in those cases. It's a waste of time to allocate the path before those cases, so reorder the logic so that we check for those cases before allocating the path. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move verity prototypes into verity.hJosef Bacik
Move these out of ctree.h into verity.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move ioctl prototypes into ioctl.hJosef Bacik
Move these out of ctree.h into ioctl.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move file-item prototypes into their own headerJosef Bacik
Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move dir-item prototypes into dir-item.hJosef Bacik
Move these prototypes out of ctree.h and into their own header file. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: extend btrfs_dir_item type to store encryption statusOmar Sandoval
For directories with encrypted files/filenames, we need to store a flag indicating this fact. There's no room in other fields, so we'll need to borrow a bit from dir_type. Since it's now a combination of type and flags, we rename it to dir_flags to reflect its new usage. The new flag, FT_ENCRYPTED, indicates a directory containing encrypted data, which is orthogonal to file type; therefore, add the new flag, and make conversion from directory type to file type strip the flag. As the file types almost never change we can afford to use the bits. Actual usage will be guarded behind an incompat bit, this patch only adds the support for later use by fscrypt. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use struct fscrypt_str instead of struct qstrSweet Tea Dorminy
While struct qstr is more natural without fscrypt, since it's provided by dentries, struct fscrypt_str is provided by the fscrypt handlers processing dentries, and is thus more natural in the fscrypt world. Replace all of the struct qstr uses with struct fscrypt_str. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use struct qstr instead of name and namelen pairsSweet Tea Dorminy
Many functions throughout btrfs take name buffer and name length arguments. Most of these functions at the highest level are usually called with these arguments extracted from a supplied dentry's name. But the entire name can be passed instead, making each function a little more elegant. Each function whose arguments are currently the name and length extracted from a dentry is herein converted to instead take a pointer to the name in the dentry. The couple of calls to these calls without a struct dentry are converted to create an appropriate qstr to pass in. Additionally, every function which is only called with a name/len extracted directly from a qstr is also converted. This change has positive effect on stack consumption, frame of many functions is reduced but this will be used in the future for fscrypt related structures. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move accessor helpers into accessors.hJosef Bacik
This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: send add define for v2 buffer sizeWang Yugui
Add a define for the data buffer size (though the maximum size is not limited by it) BTRFS_SEND_BUF_SIZE_V2 so it's more visible. Signed-off-by: Wang Yugui <wangyugui@e16-tech.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-21btrfs: send: avoid unaligned encoded writes when attempting to clone rangeFilipe Manana
When trying to see if we can clone a file range, there are cases where we end up sending two write operations in case the inode from the source root has an i_size that is not sector size aligned and the length from the current offset to its i_size is less than the remaining length we are trying to clone. Issuing two write operations when we could instead issue a single write operation is not incorrect. However it is not optimal, specially if the extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed to the send ioctl. In that case we can end up sending an encoded write with an offset that is not sector size aligned, which makes the receiver fallback to decompressing the data and writing it using regular buffered IO (so re-compressing the data in case the fs is mounted with compression enabled), because encoded writes fail with -EINVAL when an offset is not sector size aligned. The following example, which triggered a bug in the receiver code for the fallback logic of decompressing + regular buffer IO and is fixed by the patchset referred in a Link at the bottom of this changelog, is an example where we have the non-optimal behaviour due to an unaligned encoded write: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV > /dev/null mount -o compress $DEV $MNT # File foo has a size of 33K, not aligned to the sector size. xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar # Now clone the first 32K of file bar into foo at offset 0. xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo # Snapshot the default subvolume and create a full send stream (v2). btrfs subvolume snapshot -r $MNT $MNT/snap btrfs send --compressed-data -f /tmp/test.send $MNT/snap echo -e "\nFile bar in the original filesystem:" od -A d -t x1 $MNT/snap/bar umount $MNT mkfs.btrfs -f $DEV > /dev/null mount $DEV $MNT echo -e "\nReceiving stream in a new filesystem..." btrfs receive -f /tmp/test.send $MNT echo -e "\nFile bar in the new filesystem:" od -A d -t x1 $MNT/snap/bar umount $MNT Before this patch, the send stream included one regular write and one encoded write for file 'bar', with the later being not sector size aligned and causing the receiver to fallback to decompression + buffered writes. The output of the btrfs receive command in verbose mode (-vvv): (...) mkfile o258-7-0 rename o258-7-0 -> bar utimes clone bar - source=foo source offset=0 offset=0 length=32768 write bar - offset=32768 length=1024 encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0 encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument") (...) This patch avoids the regular write followed by an unaligned encoded write so that we end up sending a single encoded write that is aligned. So after this patch the stream content is (output of btrfs receive -vvv): (...) mkfile o258-7-0 rename o258-7-0 -> bar utimes clone bar - source=foo source offset=0 offset=0 length=32768 encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0 (...) So we get more optimal behaviour and avoid the silent data loss bug in versions of btrfs-progs affected by the bug referred by the Link tag below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1). Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-01btrfs: send: Proactively round up to kmalloc bucket sizeKees Cook
Instead of discovering the kmalloc bucket size _after_ allocation, round up proactively so the allocation is explicitly made for the full size, allowing the compiler to correctly reason about the resulting size of the buffer through the existing __alloc_size() hint. Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: linux-btrfs@vger.kernel.org Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/lkml/20220922133014.GI32411@suse.cz Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220923202822.2667581-8-keescook@chromium.org
2022-10-24btrfs: send: fix send failure of a subcase of orphan inodesBingJing Chang
Commit 9ed0a72e5b35 ("btrfs: send: fix failures when processing inodes with no links") tries to fix all incremental send cases of orphan inodes the send operation will meet. However, there's still a bug causing the corner subcase fails with a ENOENT error. Here's shortened steps of that subcase: $ btrfs subvolume create vol $ touch vol/foo $ btrfs subvolume snapshot -r vol snap1 $ btrfs subvolume snapshot -r vol snap2 # Turn the second snapshot to RW mode and delete the file while # holding an open file descriptor on it $ btrfs property set snap2 ro false $ exec 73<snap2/foo $ rm snap2/foo # Set the second snapshot back to RO mode and do an incremental send # with an unusal reverse order $ btrfs property set snap2 ro true $ btrfs send -p snap2 snap1 > /dev/null At subvol snap1 ERROR: send ioctl failed with -2: No such file or directory It's subcase 3 of BTRFS_COMPARE_TREE_CHANGED in the commit 9ed0a72e5b35 ("btrfs: send: fix failures when processing inodes with no links"). And it's not a common case. We still have not met it in the real world. Theoretically, this case can happen in a batch cascading snapshot backup. In cascading backups, the receive operation in the middle may cause orphan inodes to appear because of the open file descriptors on the snapshot files during receiving. And if we don't do the batch snapshot backups in their creation order, then we can have an inode, which is an orphan in the parent snapshot but refers to a file in the send snapshot. Since an orphan inode has no paths, the send operation will fail with a ENOENT error if it tries to generate a path for it. In that patch, this subcase will be treated as an inode with a new generation. However, when the routine tries to delete the old paths in the parent snapshot, the function process_all_refs() doesn't check whether there are paths recorded or not before it calls the function process_recorded_refs(). And the function process_recorded_refs() try to get the first path in the parent snapshot in the beginning. Since it has no paths in the parent snapshot, the send operation fails. To fix this, we can easily put a link count check to avoid entering the deletion routine like what we do a link count check to avoid creating a new one. Moreover, we can assume that the function process_all_refs() can always collect references to process because we know it has a positive link count. Fixes: 9ed0a72e5b35 ("btrfs: send: fix failures when processing inodes with no links") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-10-11btrfs: send: update command for protocol version checkDavid Sterba
For a protocol and command compatibility we have a helper that hasn't been updated for v3 yet. We use it for verity so update where necessary. Fixes: 38622010a6de ("btrfs: send: add support for fs-verity") Signed-off-by: David Sterba <dsterba@suse.com>
2022-10-11btrfs: send: allow protocol version 3 with CONFIG_BTRFS_DEBUGBoris Burkov
We haven't finalized send stream v3 yet, so gate the send stream version behind CONFIG_BTRFS_DEBUG as we want some way to test it. The original verity send did not check the protocol version, so add that actual protection as well. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: send: fix failures when processing inodes with no linksBingJing Chang
There is a bug causing send failures when processing an orphan directory with no links. In commit 46b2f4590aab ("Btrfs: fix send failure when root has deleted files still open")', the orphan inode issue was addressed. The send operation fails with a ENOENT error because of any attempts to generate a path for the inode with a link count of zero. Therefore, in that patch, sctx->ignore_cur_inode was introduced to be set if the current inode has a link count of zero for bypassing some unnecessary steps. And a helper function btrfs_unlink_all_paths() was introduced and called to clean up old paths found in the parent snapshot. However, not only regular files but also directories can be orphan inodes. So if the send operation meets an orphan directory, it will issue a wrong unlink command for that directory now. Soon the receive operation fails with a EISDIR error. Besides, the send operation also fails with a ENOENT error later when it tries to generate a path of it. Similar example but making an orphan dir for an incremental send: $ btrfs subvolume create vol $ mkdir vol/dir $ touch vol/dir/foo $ btrfs subvolume snapshot -r vol snap1 $ btrfs subvolume snapshot -r vol snap2 # Turn the second snapshot to RW mode and delete the whole dir while # holding an open file descriptor on it. $ btrfs property set snap2 ro false $ exec 73<snap2/dir $ rm -rf snap2/dir # Set the second snapshot back to RO mode and do an incremental send. $ btrfs property set snap2 ro true $ mkdir receive_dir $ btrfs send snap2 -p snap1 | btrfs receive receive_dir/ At subvol snap2 At snapshot snap2 ERROR: send ioctl failed with -2: No such file or directory ERROR: unlink dir failed. Is a directory Actually, orphan inodes are more common use cases in cascading backups. (Please see the illustration below.) In a cascading backup, a user wants to replicate a couple of snapshots from Machine A to Machine B and from Machine B to Machine C. Machine B doesn't take any RO snapshots for sending. All a receiver does is create an RW snapshot of its parent snapshot, apply the send stream and turn it into RO mode at the end. Even if all paths of some inodes are deleted in applying the send stream, these inodes would not be deleted and become orphans after changing the subvolume from RW to RO. Moreover, orphan inodes can occur not only in send snapshots but also in parent snapshots because Machine B may do a batch replication of a couple of snapshots. An illustration for cascading backups: Machine A (snapshot {1..n}) --> Machine B --> Machine C The idea to solve the problem is to delete all the items of orphan inodes before using these snapshots for sending. I used to think that the reasonable timing for doing that is during the ioctl of changing the subvolume from RW to RO because it sounds good that we will not modify the fs tree of a RO snapshot anymore. However, attempting to do the orphan cleanup in the ioctl would be pointless. Because if someone is holding an open file descriptor on the inode, the reference count of the inode will never drop to 0. Then iput() cannot trigger eviction, which finally deletes all the items of it. So we try to extend the original patch to handle orphans in send/parent snapshots. Here are several cases that need to be considered: Case 1: BTRFS_COMPARE_TREE_NEW | send snapshot | action -------------------------------- nlink | 0 | ignore In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result, it means that a new inode is found in the send snapshot and it doesn't appear in the parent snapshot. Since this inode has a link count of zero (It's an orphan and there're no paths for it.), we can leverage sctx->ignore_cur_inode in the original patch to prevent it from being created. Case 2: BTRFS_COMPARE_TREE_DELETED | parent snapshot | action ---------------------------------- nlink | 0 | as usual In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison result, it means that the inode only appears in the parent snapshot. As usual, the send operation will try to delete all its paths. However, this inode has a link count of zero, so no paths of it will be found. No deletion operations will be issued. We don't need to change any logic. Case 3: BTRFS_COMPARE_TREE_CHANGED | | parent snapshot | send snapshot | action ----------------------------------------------------------------------- subcase 1 | nlink | 0 | 0 | ignore subcase 2 | nlink | >0 | 0 | new_gen(deletion) subcase 3 | nlink | 0 | >0 | new_gen(creation) In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result, it means that the inode appears in both snapshots. Here are 3 subcases. First, when the inode has link counts of zero in both snapshots. Since there are no paths for this inode in (source/destination) parent snapshots and we don't care about whether there is also an orphan inode in destination or not, we can set sctx->ignore_cur_inode on to prevent it from being created. For the second and the third subcases, if there are paths in one snapshot and there're no paths in the other snapshot for this inode. We can treat this inode as a new generation. We can also leverage the logic handling a new generation of an inode with small adjustments. Then it will delete all old paths and create a new inode with new attributes and paths only when there's a positive link count in the send snapshot. In subcase 2, the send operation only needs to delete all old paths as in the parent snapshot. But it may require more operations for a directory to remove its old paths. If a not-empty directory is going to be deleted (because it has a link count of zero in the send snapshot) but there are files/directories with bigger inode numbers under it, the send operation will need to rename it to its orphan name first. After processing and deleting the last item under this directory, the send operation will check this directory, aka the parent directory of the last item, again and issue a rmdir operation to remove it finally. Therefore, we also need to treat inodes with a link count of zero as if they didn't exist in get_cur_inode_state(), which is used in process_recorded_refs(). By doing this, when checking a directory with orphan names after the last item under it has been deleted, the send operation now can properly issue a rmdir operation. Otherwise, without doing this, the orphan directory with an orphan name would be kept here at the end due to the existing inode with a link count of zero being found. In subcase 3, as in case 2, no old paths would be found, so no deletion operations will be issued. The send operation will only create a new one for that inode. Note that subcase 3 is not common. That's because it's easy to reduce the hard links of an inode, but once all valid paths are removed, there are no valid paths for creating other hard links. The only way to do that is trying to send an older snapshot after a newer snapshot has been sent. Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: send: refactor arguments of get_inode_info()BingJing Chang
Refactor get_inode_info() to populate all wanted fields on an output structure. Besides, also introduce a helper function called get_inode_gen(), which is commonly used. Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: send: add support for fs-verityBoris Burkov
Preserve the fs-verity status of a btrfs file across send/recv. There is no facility for installing the Merkle tree contents directly on the receiving filesystem, so we package up the parameters used to enable verity found in the verity descriptor. This gives the receive side enough information to properly enable verity again. Note that this means that receive will have to re-compute the whole Merkle tree, similar to how compression worked before encoded_write. Since the file becomes read-only after verity is enabled, it is important that verity is added to the send stream after any file writes. Therefore, when we process a verity item, merely note that it happened, then actually create the command in the send stream during 'finish_inode_if_needed'. This also creates V3 of the send stream format, without any format changes besides adding the new commands and attributes. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: always use the rbtree based inode ref management infrastructureFilipe Manana
After the patch "btrfs: send: fix sending link commands for existing file paths", we now have two infrastructures to detect and eliminate duplicated inode references (due to names that got removed and re-added between the send and parent snapshots): 1) One that works on a single inode ref/extref item; 2) A new one that works acrosss all ref/extref items for an inode, and it's also more efficient because even in the single ref/extref item case, it does not do a linear search for all the names encoded in the ref/extref item, it uses red black trees to speedup up the search. There's no good reason to keep both infrastructures, we can use the new one everywhere, and it's always more efficient. So remove the old infrastructure and change all sites that are using it to use the new one. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: fix sending link commands for existing file pathsBingJing Chang
There is a bug sending link commands for existing file paths. When we're processing an inode, we go over all references. All the new file paths are added to the "new_refs" list. And all the deleted file paths are added to the "deleted_refs" list. In the end, when we finish processing the inode, we iterate over all the items in the "new_refs" list and send link commands for those file paths. After that, we go over all the items in the "deleted_refs" list and send unlink commands for them. If there are duplicated file paths in both lists, we will try to create them before we remove them. Then the receiver gets an -EEXIST error when trying the link operations. Example for having duplicated file paths in both list: $ btrfs subvolume create vol # create a file and 2000 hard links to the same inode $ touch vol/foo $ for i in {1..2000}; do link vol/foo vol/$i ; done # take a snapshot for a parent snapshot $ btrfs subvolume snapshot -r vol snap1 # remove 2000 hard links and re-create the last 1000 links $ for i in {1..2000}; do rm vol/$i; done; $ for i in {1001..2000}; do link vol/foo vol/$i; done # take another one for a send snapshot $ btrfs subvolume snapshot -r vol snap2 $ mkdir receive_dir $ btrfs send snap2 -p snap1 | btrfs receive receive_dir/ At subvol snap2 link 1238 -> foo ERROR: link 1238 -> foo failed: File exists In this case, we will have the same file paths added to both lists. In the parent snapshot, reference paths {1..1237} are stored in inode references, but reference paths {1238..2000} are stored in inode extended references. In the send snapshot, all reference paths {1001..2000} are stored in inode references. During the incremental send, we process their inode references first. In record_changed_ref(), we iterate all its inode references in the send/parent snapshot. For every inode reference, we also use find_iref() to check whether the same file path also appears in the parent/send snapshot or not. Inode references {1238..2000} which appear in the send snapshot but not in the parent snapshot are added to the "new_refs" list. On the other hand, Inode references {1..1000} which appear in the parent snapshot but not in the send snapshot are added to the "deleted_refs" list. Next, when we process their inode extended references, reference paths {1238..2000} are added to the "deleted_refs" list because all of them only appear in the parent snapshot. Now two lists contain items as below: "new_refs" list: {1238..2000} "deleted_refs" list: {1..1000}, {1238..2000} Reference paths {1238..2000} appear in both lists. And as the processing order mentioned about before, the receiver gets an -EEXIST error when trying the link operations. To fix the bug, the idea is to process the "deleted_refs" list before the "new_refs" list. However, it's not easy to reshuffle the processing order. For one reason, if we do so, we may unlink all the existing paths first, there's no valid path anymore for links. And it's inefficient because we do a bunch of unlinks followed by links for the same paths. Moreover, it makes less sense to have duplications in both lists. A reference path cannot not only be regarded as new but also has been seen in the past, or we won't call it a new path. However, it's also not a good idea to make find_iref() check a reference against all inode references and all inode extended references because it may result in large disk reads. So we introduce two rbtrees to make the references easier for lookups. And we also introduce record_new_ref_if_needed() and record_deleted_ref_if_needed() for changed_ref() to check and remove duplicated references early. Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: introduce recorded_ref_alloc and recorded_ref_freeBingJing Chang
Introduce wrappers to allocate and free recorded_ref structures. Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: add new command FILEATTR for file attributesDavid Sterba
There are file attributes inherited from previous ext2 SETFLAGS/GETFLAGS and later from XFLAGS interfaces, now commonly found under the 'fileattr' API. This corresponds to the individual inode bits and that's part of the on-disk format, so this is suitable for the protocol. The other interfaces contain a lot of cruft or bits that btrfs does not support yet. Currently the value is u64 and matches btrfs_inode_item. Not all the bits can be set by ioctls (like NODATASUM or READONLY), but we can send them over the protocol and leave it up to the receiving side what and how to apply. As some of the flags, eg. IMMUTABLE, can prevent any further changes, the receiving side needs to understand that and apply the changes in the right order, or possibly with some intermediate steps. This should be easier, future proof and simpler on the protocol layer than implementing in kernel. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: add OTIME as utimes attribute for proto 2+ by defaultDavid Sterba
When send v1 was introduced the otime (inode creation time) was not available, however the attribute in btrfs send protocol exists. Though it would be possible to add it for v1 too as the attribute would be ignored by v1 receive, let's not change the layout of v1 and only add that to v2+. The otime cannot be changed and is only informative. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: use boolean types for current inode statusDavid Sterba
The new, new_gen and deleted indicate a status, use boolean type instead of int. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: remove old TODO regarding ERESTARTSYSDavid Sterba
The whole send operation is restartable and handling properly a buffer write may not be easy. We can't know what caused that and if a short delay and retry will fix it or how many retries should be performed in case it's a temporary condition. The error value is returned to the ioctl caller so in case it's transient problem, the user would be notified about the reason. Remove the TODO note as there's no plan to handle ERESTARTSYS. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: simplify includesDavid Sterba
We don't need the whole ctree.h in send.h, none of the data types defined there are used. Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: enable support for stream v2 and compressed writesOmar Sandoval
Now that the new support is implemented, allow the ioctl to accept v2 and the compressed flag, and update the version in sysfs. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: send compressed extents with encoded writesOmar Sandoval
Now that all of the pieces are in place, we can use the ENCODED_WRITE command to send compressed extents when appropriate. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: get send buffer pages for protocol v2Omar Sandoval
For encoded writes in send v2, we will get the encoded data with btrfs_encoded_read_regular_fill_pages(), which expects a list of raw pages. To avoid extra buffers and copies, we should read directly into the send buffer. Therefore, we need the raw pages for the send buffer. We currently allocate the send buffer with kvmalloc(), which may return a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page(). However, the buffer size we use (144K) is not a power of two, which in theory is not guaranteed to return a page-aligned buffer, and in practice would waste a lot of memory due to rounding up to the next power of two. 144K is large enough that it usually gets allocated with vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc() and save the pages in an array. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: write larger chunks when using stream v2Omar Sandoval
The length field of the send stream TLV header is 16 bits. This means that the maximum amount of data that can be sent for one write is 64K minus one. However, encoded writes must be able to send the maximum compressed extent (128K) in one command, or more. To support this, send stream version 2 encodes the DATA attribute differently: it has no length field, and the length is implicitly up to the end of containing command (which has a 32bit length field). Although this is necessary for encoded writes, normal writes can benefit from it, too. Also add a check to enforce that the DATA attribute is last. It is only strictly necessary for v2, but we might as well make v1 consistent with it. For v2, let's bump up the send buffer to the maximum compressed extent size plus 16K for the other metadata (144K total). Since this will most likely be vmalloc'd (and always will be after the next commit), we round it up to the next page since we might as well use the rest of the page on systems with >16K pages. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: add stream v2 definitionsOmar Sandoval
This adds the definitions of the new commands for send stream version 2 and their respective attributes: fallocate, FS_IOC_SETFLAGS (a.k.a. chattr), and encoded writes. It also documents two changes to the send stream format in v2: the receiver shouldn't assume a maximum command size, and the DATA attribute is encoded differently to allow for writes larger than 64k. These will be implemented in subsequent changes, and then the ioctl will accept the new version and flag. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: explicitly number commands and attributesOmar Sandoval
Commit e77fbf990316 ("btrfs: send: prepare for v2 protocol") added _BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the version plus 1, but as written this creates gaps in the number space. The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2 has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that 23 and 24 are valid commands. Instead, let's explicitly number all of the commands, attributes, and sentinel MAX constants. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: send: remove unused send_ctx::{total,cmd}_send_sizeOmar Sandoval
We collect these statistics but have never exposed them in any way. I also didn't find any patches that ever attempted to make use of them. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-16Merge tag 'for-5.19-rc7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs reverts from David Sterba: "Due to a recent report [1] we need to revert the radix tree to xarray conversion patches. There's a problem with sleeping under spinlock, when xa_insert could allocate memory under pressure. We use GFP_NOFS so this is a real problem that we unfortunately did not discover during review. I'm sorry to do such change at rc6 time but the revert is IMO the safer option, there are patches to use mutex instead of the spin locks but that would need more testing. The revert branch has been tested on a few setups, all seem ok. The conversion to xarray will be revisited in the future" Link: https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ [1] * tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: turn delayed_nodes_tree into an XArray" Revert "btrfs: turn name_cache radix tree into XArray in send_ctx" Revert "btrfs: turn fs_info member buffer_radix into XArray" Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
2022-07-15Revert "btrfs: turn name_cache radix tree into XArray in send_ctx"David Sterba
This reverts commit 4076942021fe14efecae33bf98566df6dd5ae6f7. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-24Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull page cache updates from Matthew Wilcox: - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio * tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits) nilfs2: Fix some kernel-doc comments Appoint myself page cache maintainer fs: Remove aops->freepage secretmem: Convert to free_folio nfs: Convert to free_folio orangefs: Convert to free_folio fs: Add free_folio address space operation fs: Convert drop_buffers() to use a folio fs: Change try_to_free_buffers() to take a folio jbd2: Convert release_buffer_page() to use a folio jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio reiserfs: Convert release_buffer_page() to use a folio fs: Remove last vestiges of releasepage ubifs: Convert to release_folio reiserfs: Convert to release_folio orangefs: Convert to release_folio ocfs2: Convert to release_folio nilfs2: Remove comment about releasepage nfs: Convert to release_folio jfs: Convert to release_folio ...
2022-05-17btrfs: send: avoid trashing the page cacheFilipe Manana
A send operation reads extent data using the buffered IO path for getting extent data to send in write commands and this is both because it's simple and to make use of the generic readahead infrastructure, which results in a massive speedup. However this fills the page cache with data that, most of the time, is really only used by the send operation - once the write commands are sent, it's not useful to have the data in the page cache anymore. For large snapshots, bringing all data into the page cache eventually leads to the need to evict other data from the page cache that may be more useful for applications (and kernel subsystems). Even if extents are shared with the subvolume on which a snapshot is based on and the data is currently on the page cache due to being read through the subvolume, attempting to read the data through the snapshot will always result in bringing a new copy of the data into another location in the page cache (there's currently no shared memory for shared extents). So make send evict the data it has read before if when it first opened the inode, its mapping had no pages currently loaded: when inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding based on the return value of filemap_range_has_page() before reading an extent because the generic readahead mechanism may read pages beyond the range we request (and it very often does it), which means a call to filemap_range_has_page() will return true due to the readahead that was triggered when processing a previous extent - we don't have a simple way to distinguish this case from the case where the data was brought into the page cache through someone else. So checking for the mapping number of pages being 0 when we first open the inode is simple, cheap and it generally accomplishes the goal of not trashing the page cache - the only exception is if part of data was previously loaded into the page cache through the snapshot by some other process, in that case we end up not evicting any data send brings into the page cache, just like before this change - but that however is not the common case. Example scenario, on a box with 32G of RAM: $ btrfs subvolume create /mnt/sv1 $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1 $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1 $ free -m total used free shared buff/cache available Mem: 31937 186 26866 0 4883 31297 Swap: 8188 0 8188 # After this we get less 4G of free memory. $ btrfs send /mnt/snap1 >/dev/null $ free -m total used free shared buff/cache available Mem: 31937 186 22814 0 8935 31297 Swap: 8188 0 8188 The same, obviously, applies to an incremental send. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: send: keep the current inode open while processing itFilipe Manana
Every time we send a write command, we open the inode, read some data to a buffer and then close the inode. The amount of data we read for each write command is at most 48K, returned by max_send_read_size(), and that corresponds to: BTRFS_SEND_BUF_SIZE - 16K = 48K. In practice this does not add any significant overhead, because the time elapsed between every close (iput()) and open (btrfs_iget()) is very short, so the inode is kept in the VFS's cache after the iput() and it's still there by the time we do the next btrfs_iget(). As between processing extents of the current inode we don't do anything else, it makes sense to keep the inode open after we process its first extent that needs to be sent and keep it open until we start processing the next inode. This serves to facilitate the next change, which aims to avoid having send operations trash the page cache with data extents. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: turn name_cache radix tree into XArray in send_ctxGabriel Niebler
… and adjust all usages of this object to use the XArray API for the sake of consistency. XArray API provides array semantics, so it is notionally easier to use and understand, and it also takes care of locking for us. None of this makes a real difference in this particular patch, but it does in other places where similar replacements are or have been made and we want to be consistent in our usage of data structures in btrfs. Signed-off-by: Gabriel Niebler <gniebler@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: use btrfs_for_each_slot in btrfs_unlink_all_pathsGabriel Niebler
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: use btrfs_for_each_slot in process_all_extentsGabriel Niebler
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: use btrfs_for_each_slot in process_all_new_xattrsGabriel Niebler
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: use btrfs_for_each_slot in process_all_refsGabriel Niebler
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: use btrfs_for_each_slot in is_ancestorGabriel Niebler
This function can be simplified by refactoring to use the new iterator macro. No functional changes. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>