summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2016-06-25Merge branch 'for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I have a two part pull this time because one of the patches Dave Sterba collected needed to be against v4.7-rc2 or higher (we used rc4). I try to make my for-linus-xx branch testable on top of the last major so we can hand fixes to people on the list more easily, so I've split this pull in two. This first part has some fixes and two performance improvements that we've been testing for some time. Josef's two performance fixes are most notable. The transid tracking patch makes a big improvement on pretty much every workload" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: Force stripesize to the value of sectorsize btrfs: fix disk_i_size update bug when fallocate() fails Btrfs: fix error handling in map_private_extent_buffer Btrfs: fix error return code in btrfs_init_test_fs() Btrfs: don't do nocow check unless we have to btrfs: fix deadlock in delayed_ref_async_start Btrfs: track transid for delayed ref flushing
2016-06-25Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodesOmar Sandoval
Commit fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"") backed out the conversion to ->iterate_shared() for Btrfs because the delayed inode handling in btrfs_real_readdir() is racy. However, we can still do readdir in parallel if there are no delayed nodes. This is a temporary fix which upgrades the shared inode lock to an exclusive lock only when we have delayed items until we come up with a more complete solution. While we're here, rename the btrfs_{get,put}_delayed_items functions to make it very clear that they're just for readdir. Tested with xfstests and by doing a parallel kernel build: while make tinyconfig && make -j4 && git clean dqfx; do : done along with a bunch of parallel finds in another shell: while true; do for ((i=0; i<4; i++)); do find . >/dev/null & done wait done Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23Btrfs: Force stripesize to the value of sectorsizeChandan Rajendra
Btrfs code currently assumes stripesize to be same as sectorsize. However Btrfs-progs (until commit df05c7ed455f519e6e15e46196392e4757257305) has been setting btrfs_super_block->stripesize to a value of 4096. This commit makes sure that the value of btrfs_super_block->stripesize is a power of 2. Later, it unconditionally sets btrfs_root->stripesize to sectorsize. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23btrfs: fix disk_i_size update bug when fallocate() failsWang Xiaoguang
When doing truncate operation, btrfs_setsize() will first call truncate_setsize() to set new inode->i_size, but if later btrfs_truncate() fails, btrfs_setsize() will call "i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the inmemory inode size, now bug occurs. It's because for truncate case btrfs_ordered_update_i_size() directly uses inode->i_size to update BTRFS_I(inode)->disk_i_size, indeed we should use the "offset" argument to update disk_i_size. Here is the call graph: ==>btrfs_truncate() ====>btrfs_truncate_inode_items() ======>btrfs_ordered_update_i_size(inode, last_size, NULL); Here btrfs_ordered_update_i_size()'s offset argument is last_size. And below test case can reveal this bug: dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100 dev=$(losetup --show -f fs.img) mkdir -p /mnt/mntpoint mkfs.btrfs -f $dev mount $dev /mnt/mntpoint cd /mnt/mntpoint echo "workdir is: /mnt/mntpoint" blocksize=$((128 * 1024)) dd if=/dev/zero of=testfile bs=$blocksize count=1 sync count=$((17*1024*1024*1024/blocksize)) echo "file size is:" $((count*blocksize)) for ((i = 1; i <= $count; i++)); do i=$((i + 1)) dst_offset=$((blocksize * i)) xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\ testfile > /dev/null done sync truncate --size 0 testfile ls -l testfile du -sh testfile exit In this case, truncate operation will fail for enospc reason and "du -sh testfile" returns value greater than 0, but testfile's size is 0, we need to reflect correct inode->i_size. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23Btrfs: fix error handling in map_private_extent_bufferLiu Bo
map_private_extent_buffer() can return -EINVAL in two different cases, 1. when the requested contents span two pages if nodesize is larger than pagesize, 2. when it detects something insane. The 2nd one used to be only a WARN_ON(1), and we decided to return a error to callers, but we didn't fix up all its callers, which will be addressed by this patch. Without this, btrfs may end up with 'general protection', ie. reading invalid memory. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-23Btrfs: fix error return code in btrfs_init_test_fs()Wei Yongjun
Fix to return a negative error code from the kern_mount() error handling case instead of 0(ret is set to 0 by register_filesystem), as done elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22Btrfs: don't do nocow check unless we have toJosef Bacik
Before we write into prealloc/nocow space we have to make sure that there are no references to the extents we are writing into, which means checking the extent tree and csum tree in the case of nocow. So we don't want to do the nocow dance unless we can't reserve data space, since it's a serious drag on performance. With the following sequence fallocate -l10737418240 /mnt/btrfs-test/file cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \ --end_fsync=1 we get the worst case scenario where we have to fall back on to doing the check anyway. Without this patch lat (usec): min=5, max=111598, avg=27.65, stdev=124.51 write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec With this patch lat (usec): min=3, max=91210, avg=14.09, stdev=110.62 write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec We get twice the throughput, half of the runtime, and half of the average latency. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> [ PAGE_CACHE_ removal related fixups ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22btrfs: fix deadlock in delayed_ref_async_startChris Mason
"Btrfs: track transid for delayed ref flushing" was deadlocking on btrfs_attach_transaction because its not safe to call from the async delayed ref start code. This commit brings back btrfs_join_transaction instead and checks for a blocked commit. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-22Btrfs: track transid for delayed ref flushingJosef Bacik
Using the offwakecputime bpf script I noticed most of our time was spent waiting on the delayed ref throttling. This is what is supposed to happen, but sometimes the transaction can commit and then we're waiting for throttling that doesn't matter anymore. So change this stuff to be a little smarter by tracking the transid we were in when we initiated the throttling. If the transaction we get is different then we can just bail out. This resulted in a 50% speedup in my fs_mark test, and reduced the amount of time spent throttling by 60 seconds over the entire run (which is about 30 minutes). Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-06-18Merge branch 'for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "The most user visible change here is a fix for our recent superblock validation checks that were causing problems on non-4k pagesized systems" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize btrfs: remove build fixup for qgroup_account_snapshot btrfs: use new error message helper in qgroup_account_snapshot btrfs: avoid blocking open_ctree from cleaner_kthread Btrfs: don't BUG_ON() in btrfs_orphan_add btrfs: account for non-CoW'd blocks in btrfs_abort_transaction Btrfs: check if extent buffer is aligned to sectorsize btrfs: Use correct format specifier
2016-06-17Btrfs: btrfs_check_super_valid: Allow 4096 as stripesizeChandan Rajendra
Older btrfs-progs/mkfs.btrfs sets 4096 as the stripesize. Hence restricting stripesize to be equal to sectorsize would cause super block validation to return an error on architectures where PAGE_SIZE is not equal to 4096. Hence as a workaround, this commit allows stripesize to be set to 4096 bytes. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17btrfs: remove build fixup for qgroup_account_snapshotDavid Sterba
Introduced in 2c1984f244838477aab ("btrfs: build fixup for qgroup_account_snapshot") as temporary bisectability build fixup. Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17btrfs: use new error message helper in qgroup_account_snapshotDavid Sterba
We've renamed btrfs_std_error, this one is left from last merge. Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17btrfs: avoid blocking open_ctree from cleaner_kthreadZygo Blaxell
This fixes a problem introduced in commit 2f3165ecf103599f82bf0ea254039db335fb5005 "btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes". open_ctree eventually calls btrfs_replay_log which in turn calls btrfs_commit_super which tries to lock the cleaner_mutex, causing a recursive mutex deadlock during mount. Instead of playing whack-a-mole trying to keep up with all the functions that may want to lock cleaner_mutex, put all the cleaner_mutex lockers back where they were, and attack the problem more directly: keep cleaner_kthread asleep until the filesystem is mounted. When filesystems are mounted read-only and later remounted read-write, open_ctree did not set fs_info->open and neither does anything else. Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs nor cleaner_kthread get confused by the common case of "/" filesystem read-only mount followed by read-write remount. Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17Btrfs: don't BUG_ON() in btrfs_orphan_addJosef Bacik
This is just a screwup for developers, so change it to an ASSERT() so developers notice when things go wrong and deal with the error appropriately if ASSERT() isn't enabled. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17btrfs: account for non-CoW'd blocks in btrfs_abort_transactionJeff Mahoney
The test for !trans->blocks_used in btrfs_abort_transaction is insufficient to determine whether it's safe to drop the transaction handle on the floor. btrfs_cow_block, informed by should_cow_block, can return blocks that have already been CoW'd in the current transaction. trans->blocks_used is only incremented for new block allocations. If an operation overlaps the blocks in the current transaction entirely and must abort the transaction, we'll happily let it clean up the trans handle even though it may have modified the blocks and will commit an incomplete operation. In the long-term, I'd like to do closer tracking of when the fs is actually modified so we can still recover as gracefully as possible, but that approach will need some discussion. In the short term, since this is the only code using trans->blocks_used, let's just switch it to a bool indicating whether any blocks were used and set it when should_cow_block returns false. Cc: stable@vger.kernel.org # 3.4+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17Btrfs: check if extent buffer is aligned to sectorsizeLiu Bo
Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer via alloc_extent_buffer(). An unaligned eb can have more pages than it should have, which ends up extent buffer's leak or some corrupted content in extent buffer. This adds a warning to let us quickly know what was happening. Now that alloc_extent_buffer() no more returns NULL, this changes its caller and callers of its caller to match with the new error handling. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-17btrfs: Use correct format specifierHeinrich Schuchardt
Component mirror_num of struct btrfsic_block is defined as unsigned int. Use %u as format specifier. Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-10Merge branch 'for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Has some fixes and some new self tests for btrfs. The self tests are usually disabled in the .config file (unless you're doing btrfs dev work), and this bunch is meant to find problems with the 64K page size patches. Jeff has a patch to help people see if they are using the hardware assist crc32c module, which really helps us nail down problems when people ask why crcs are using so much CPU. Otherwise, it's small fixes" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize Btrfs: self-tests: Use macros instead of constants and add missing newline Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE btrfs: advertise which crc32c implementation is being used at module load Btrfs: add validadtion checks for chunk loading Btrfs: add more validation checks for superblock Btrfs: clear uptodate flags of pages in sys_array eb Btrfs: self-tests: Support non-4k page size Btrfs: Fix integer overflow when calculating bytes_per_bitmap Btrfs: test_check_exists: Fix infinite loop when searching for free space entries Btrfs: end transaction if we abort when creating uuid root btrfs: Use __u64 in exported linux/btrfs.h.
2016-06-08Merge branch 'misc-fixes-4.7' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.7
2016-06-08Merge branch 'for-chris' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.7
2016-06-07block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSHMike Christie
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers, fs: shrink bi_rw from long to intMike Christie
We don't need bi_rw to be so large on 64 bit archs, so reduce it to unsigned int. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07btrfs: use bio fields for op and flagsMike Christie
The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is no need to pass around the rq_flag_bits bits too. btrfs users should should access the bio insead. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07btrfs: update __btrfs_map_block for REQ_OP transitionMike Christie
We no longer pass in a bitmap of rq_flag_bits bits to __btrfs_map_block. It will always be a REQ_OP, or the btrfs specific REQ_GET_READ_MIRRORS, so this drops the bit tests. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07btrfs: use bio op accessorsMike Christie
This should be the easier cases to convert btrfs to bio_set_op_attrs/bio_op. They are mostly just cut and replace type of changes. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07btrfs: have submit_one_bio users use bio op accessorsMike Christie
This patch has btrfs's submit_one_bio users set the bio op using bio_set_op_attrs and get the op using bio_op. The next patches will continue to convert btrfs, so submit_bio_hook and merge_bio_hook related code will be modified to take only the bio. I did not do it in this patch to try and keep it smaller. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07direct-io: use bio set/get op accessorsMike Christie
This patch has the dio code use a REQ_OP for the op and rq_flag_bits for bi_rw flags. To set/get the op it uses the bio_set_op_attrs/bio_op accssors. It also begins to convert btrfs's dio_submit_t because of the dio submit_io callout use. The next patches will completely convert this code and the reset of the btrfs code paths. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07fs: have submit_bh users pass in op and flags separatelyMike Christie
This has submit_bh users pass in the operation and flags separately, so submit_bh_wbc can setup the bio op and bi_rw flags on the bio that is submitted. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block/fs/drivers: remove rw argument from submit_bioMike Christie
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-06Btrfs: self-tests: Fix extent buffer bitmap test fail on BE systemFeifei Xu
In __test_eb_bitmaps(), we write random data to a bitmap. Then copy the bitmap to another bitmap that resides inside an extent buffer. Later we verify the values of corresponding bits in the bitmap and the bitmap inside the extent buffer. However, extent_buffer_test_bit() reads in byte granularity while test_bit() reads in unsigned long granularity. Hence we end up comparing wrong bits on big-endian systems such as ppc64. This commit fixes the issue by reading the bitmap in byte granularity. Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsizeFeifei Xu
With 64K sectorsize, 1G sized block group cannot span across bitmaps. To execute test_bitmaps() function, this commit allocates "BITS_PER_BITMAP * sectorsize + PAGE_SIZE" sized block group. Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06Btrfs: self-tests: Use macros instead of constants and add missing newlineFeifei Xu
This commit replaces numerical constants with appropriate preprocessor macros. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06Btrfs: self-tests: Support testing all possible sectorsizes and nodesizesFeifei Xu
To test all possible sectorsizes, this commit adds a sectorsize array. This commit executes the tests for all possible sectorsizes and nodesizes. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZEFeifei Xu
On ppc64, PAGE_SIZE is 64k which is same as BTRFS_MAX_METADATA_BLOCKSIZE. In such a scenario, we will never be able to have an extent buffer containing more than one page. Hence in such cases this commit does not execute the page straddling tests. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06btrfs: advertise which crc32c implementation is being used at module loadJeff Mahoney
Since several architectures support hardware-accelerated crc32c calculation, it would be nice to confirm that btrfs is actually using it. We can see an elevated use count for the module, but it doesn't actually show who the users are. This patch simply prints the name of the driver after successfully initializing the shash. Signed-off-by: Jeff Mahoney <jeffm@suse.com> [ added a helper and used in module load-time message ] Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06Btrfs: add validadtion checks for chunk loadingLiu Bo
To prevent fuzzed filesystem images from panic the whole system, we need various validation checks to refuse to mount such an image if btrfs finds any invalid value during loading chunks, including both sys_array and regular chunks. Note that these checks may not be sufficient to cover all corner cases, feel free to add more checks. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06Btrfs: add more validation checks for superblockLiu Bo
This adds validation checks for super_total_bytes, super_bytes_used and super_stripesize, super_num_devices. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-06Btrfs: clear uptodate flags of pages in sys_array ebLiu Bo
We set uptodate flag to pages in the temporary sys_array eb, but do not clear the flag after free eb. As the special btree inode may still hold a reference on those pages, the uptodate flag can remain alive in them. If btrfs_super_chunk_root has been intentionally changed to the offset of this sys_array eb, reading chunk_root will read content of sys_array and it will skip our beautiful checks in btree_readpage_end_io_hook() because of "pages of eb are uptodate => eb is uptodate" This adds the 'clear uptodate' part to force it to read from disk. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-04Merge branch 'for-linus-4.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "The important part of this pull is Filipe's set of fixes for btrfs device replacement. Filipe fixed a few issues seen on the list and a number he found on his own" * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent Btrfs: fix race between device replace and read repair Btrfs: fix race between device replace and discard Btrfs: fix race between device replace and chunk allocation Btrfs: fix race setting block group back to RW mode during device replace Btrfs: fix unprotected assignment of the left cursor for device replace Btrfs: fix race setting block group readonly during device replace Btrfs: fix race between device replace and block group removal Btrfs: fix race between readahead and device replace/removal
2016-06-03Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extentChris Mason
When dealing with inline extents, btrfs_get_extent will incorrectly try to insert a duplicate extent_map. The dup hits -EEXIST from add_extent_map, but then we try to merge with the existing one and end up trying to insert a zero length extent_map. This actually works most of the time, except when there are extent maps past the end of the inline extent. rocksdb will trigger this sometimes because it preallocates an extent and then truncates down. Josef made a script to trigger with xfs_io: #!/bin/bash xfs_io -f -c "pwrite 0 1000" inline xfs_io -c "falloc -k 4k 1M" inline xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline xfs_io -c "fadvise -d 0 1000" inline cat inline You'll get EIOs trying to read inline after this because add_extent_map is returning EEXIST Signed-off-by: Chris Mason <clm@fb.com>
2016-06-02Btrfs: self-tests: Support non-4k page sizeFeifei Xu
self-tests code assumes 4k as the sectorsize and nodesize. This commit fix hardcoded 4K. Enables the self-tests code to be executed on non-4k page sized systems (e.g. ppc64). Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-02Btrfs: Fix integer overflow when calculating bytes_per_bitmapFeifei Xu
On ppc64, bytes_per_bitmap will be (65536*8*65536). Hence append UL to fix integer overflow. Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-02Btrfs: test_check_exists: Fix infinite loop when searching for free space ↵Feifei Xu
entries On a ppc64 machine using 64K as the block size, assume that the RB tree at btrfs_free_space_ctl->free_space_offset contains following two entries: 1. A bitmap entry having an offset value of 0 and having the bits corresponding to the address range [128M+512K, 128M+768K] set. 2. An extent entry corresponding to the address range [128M-256K, 128M-128K] In such a scenario, test_check_exists() invoked for checking the existence of address range [128M+768K, 256M] can lead to an infinite loop as explained below: - Checking for the extent entry fails. - Checking for a bitmap entry results in the free space info in range [128M+512K, 128M+768K] beng returned. - rb_prev(info) returns NULL because the bitmap entry starting from offset 0 comes first in the RB tree. - current_node = bitmap node. - while (current_node) tmp = rb_next(bitmap_node);/*tmp is extent based free space entry*/ Since extent based free space entry's last address is smaller than the address being searched for (i.e. 128M+768K) we incorrectly again obtain the extent node as the "next right node" of the RB tree and thus end up looping infinitely. This patch fixes the issue by checking the "tmp" variable which point to the most recently searched free space node. Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Feifei Xu <xufeifei@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-06-01Btrfs: end transaction if we abort when creating uuid rootJosef Bacik
We still need to call btrfs_end_transaction if we call btrfs_abort_transaction, otherwise we hang and make me super grumpy. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-31Btrfs: fix race between device replace and read repairFilipe Manana
While we are finishing a device replace operation we can have a concurrent task trying to do a read repair operation, in which case it will call btrfs_map_block() to get a struct btrfs_bio which can have a stripe that points to the source device of the device replace operation. This allows for the read repair task to dereference the stripe's device pointer after the device replace operation has freed the source device, resulting in an invalid memory access. This is similar to the problem solved by my previous patch in the same series and named "Btrfs: fix race between device replace and discard". So fix this by surrounding the call to btrfs_map_block() and the code that uses the returned struct btrfs_bio with calls to btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the proper serialization with the finishing phase of the device replace operation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-31Btrfs: fix race between device replace and discardFilipe Manana
While we are finishing a device replace operation, we can make a discard operation (fs mounted with -o discard) do an invalid memory access like the one reported by the following trace: [ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP [ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs] [ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1 [ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000 [ 3206.388595] RIP: 0010:[<ffffffff8124d233>] [<ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7 [ 3206.388595] RSP: 0018:ffff880171b9bb80 EFLAGS: 00010246 [ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000 [ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48 [ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001 [ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8 [ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b [ 3206.388595] FS: 00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000 [ 3206.388595] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0 [ 3206.388595] Stack: [ 3206.388595] 0000000000004878 0000000000000000 0000000002400040 0000000000000000 [ 3206.388595] 0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0 [ 3206.388595] ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0 [ 3206.388595] Call Trace: [ 3206.388595] [<ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs] [ 3206.388595] [<ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs] [ 3206.388595] [<ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs] [ 3206.388595] [<ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs] [ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b [ 3206.388595] [<ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs] [ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b [ 3206.388595] [<ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs] [ 3206.388595] [<ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e [ 3206.388595] [<ffffffff811a82c0>] vfs_fsync+0x1c/0x1e [ 3206.388595] [<ffffffff811a8417>] do_fsync+0x31/0x4a [ 3206.388595] [<ffffffff811a8637>] SyS_fsync+0x10/0x14 [ 3206.388595] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 3206.388595] [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14 [ 3206.388595] [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa This happens because when we call btrfs_map_block() from btrfs_discard_extent() to get a btrfs_bio structure, the device replace operation has not finished yet, but before we use the device of one of the stripes from the returned btrfs_bio structure, the device object is freed. This is illustrated by the following diagram. CPU 1 CPU 2 btrfs_dev_replace_start() (...) btrfs_dev_replace_finishing() btrfs_start_transaction() btrfs_commit_transaction() (...) btrfs_sync_file() btrfs_start_transaction() (...) btrfs_commit_transaction() btrfs_finish_extent_commit() btrfs_discard_extent() btrfs_map_block() --> returns a struct btrfs_bio with a stripe that has a device field pointing to source device of the replace operation (the device that is being replaced) mutex_lock(&uuid_mutex) mutex_lock(&fs_info->fs_devices->device_list_mutex) mutex_lock(&fs_info->chunk_mutex) btrfs_dev_replace_update_device_in_mapping_tree() --> iterates the mapping tree and for each extent map that has a stripe pointing to the source device, it updates the stripe to point to the target device instead btrfs_rm_dev_replace_blocked() --> waits for fs_info->bio_counter to go down to 0 btrfs_rm_dev_replace_remove_srcdev() --> removes source device from the list of devices mutex_unlock(&fs_info->chunk_mutex) mutex_unlock(&fs_info->fs_devices->device_list_mutex) mutex_unlock(&uuid_mutex) btrfs_rm_dev_replace_free_srcdev() --> frees the source device --> iterates over all stripes of the returned struct btrfs_bio --> for each stripe it dereferences its device pointer --> it ends up finding a pointer to the device used as the source device for the replace operation and that was already freed So fix this by surrounding the call to btrfs_map_block(), and the code that uses the returned struct btrfs_bio, with calls to btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that the finishing phase of the device replace operation blocks until the the bio counter decreases to zero before it frees the source device. This is the same approach we do at btrfs_map_bio() for example. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30Btrfs: fix race between device replace and chunk allocationFilipe Manana
While iterating and copying extents from the source device, the device replace code keeps adjusting a left cursor that is used to make sure that once we finish processing a device extent, any future writes to extents from the corresponding block group will get into both the source and target devices. This left cursor is also used for resuming the device replace operation at mount time. However using this left cursor to decide whether writes go into both devices or only the source device is not enough to guarantee we don't miss copying extents into the target device. There are two cases where the current approach fails. The first one is related to when there are holes in the device and they get allocated for new block groups while the device replace operation is iterating the device extents (more on this explained below). The second one is that when that loop over the device extents finishes, we start dellaloc, wait for all ordered extents and then commit the current transaction, we might have got new block groups allocated that are now using a device extent that has an offset greater then or equals to the value of the left cursor, in which case writes to extents belonging to these new block groups will get issued only to the source device. For the first case where the current approach of using a left cursor fails, consider the source device currently has the following layout: [ extent bg A ] [ hole, unallocated space ] [extent bg B ] 3Gb 4Gb 5Gb While we are iterating the device extents from the source device using the commit root of the device tree, the following happens: CPU 1 CPU 2 <we are at transaction N> scrub_enumerate_chunks() --> searches the device tree for extents belonging to the source device using the device tree's commit root --> 1st iteration finds extent belonging to block group A --> sets block group A to RO mode (btrfs_inc_block_group_ro) --> sets cursor left to found_key.offset which is 3Gb --> scrub_chunk() starts copies all allocated extents from block group's A stripe at source device into target device btrfs_alloc_chunk() --> allocates device extent in the range [4Gb, 5Gb[ from the source device for a new block group C extent allocated from block group C for a direct IO, buffered write or btree node/leaf extent is written to, perhaps in response to a writepages() call from the VM or directly through direct IO the write is made only against the source device and not against the target device because the extent's offset is in the interval [4Gb, 5Gb[ which is larger then the value of cursor_left (3Gb) --> scrub_chunks() finishes --> updates left cursor from 3Gb to 4Gb --> btrfs_dec_block_group_ro() sets block group A back to RW mode <we are still at transaction N> --> 2nd iteration finds extent belonging to block group B - it did not find the new extent in the range [4Gb, 5Gb[ for block group C because we are using the device tree's commit root or even because the block group's items are not all yet inserted in the respective btrees, that is, the block group is still attached to some transaction handle's new_bgs list and btrfs_create_pending_block_groups() was not called yet against that transaction handle, so the device extent items were not yet inserted into the devices tree <we are still at transaction N> --> so we end not copying anything from the newly allocated device extent from the source device to the target device So fix this by making __btrfs_map_block() always redirect writes to the target device as well, independently of the left cursor's value. With this change the left cursor is now used only for the purpose of tracking progress and allow a mount operation to resume a device replace. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30Btrfs: fix race setting block group back to RW mode during device replaceFilipe Manana
After it finishes processing a device extent, the device replace code sets back the block group to RW mode and then after that it sets the left cursor to match the logical end address of the block group, so that future writes into extents belonging to the block group go both the source (old) and target (new) devices. However from the moment we turn the block group back to RW mode we have a short time window, that lasts until we update the left cursor's value, where extents can be allocated from the block group and written to, in which case they will not be copied/written to the target (new) device. Fix this by updating the left cursor's value before turning the block group back to RW mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30Btrfs: fix unprotected assignment of the left cursor for device replaceFilipe Manana
We were assigning new values to fields of the device replace object without holding the respective lock after processing each device extent. This is important for the left cursor field which can be accessed by a concurrent task running __btrfs_map_block (which, correctly, takes the device replace lock). So change these fields while holding the device replace lock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>