summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent-tree.c
AgeCommit message (Collapse)Author
2015-04-13btrfs: add WARN_ON() to check is space_info op currentZhao Lei
space_info's value calculation is some complex and easy to cause bug, add WARN_ON() to help debug. Changelog v1->v2: Put WARN_ON()s under the ENOSPC_DEBUG mount option. Suggested by: David Sterba <dsterba@suse.cz> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13btrfs: Set relative data on clear btrfs_block_group_cache->pinnedZhao Lei
Bug1: space_info->bytes_readonly was set to very large(negative) value in btrfs_remove_block_group(). Reason: Current code set block_group_cache->pinned = 0 in btrfs_delete_unused_bgs(), but above space was not counted to space_info->bytes_readonly. Then in btrfs_remove_block_group(): block_group->space_info->bytes_readonly -= block_group->key.offset; We can see following value in trace: btrfs_remove_block_group: pid=2677 comm=btrfs-cleaner WARNING: bytes_readonly=12582912, key.offset=134217728 Bug2: space_info->total_bytes_pinned grow to value larger than fs size. In a 1.2G fs, we can get following trace log: at first: ZL_DEBUG: add_pinned_bytes: pid=2710 comm=sync change total_bytes_pinned flags=1 869793792 + 95944704 = 965738496 after some op: ZL_DEBUG: add_pinned_bytes: pid=2770 comm=sync change total_bytes_pinned flags=1 1780178944 + 95944704 = 1876123648 after some op: ZL_DEBUG: add_pinned_bytes: pid=3193 comm=sync change total_bytes_pinned flags=1 2924568576 + 95551488 = 3020120064 ... Reason: Similar to bug1, we also need to adjust space_info->total_bytes_pinned in above code block. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13btrfs: Adjust commit-transaction condition to avoid NO_SPACE moreZhao Lei
If we have any chance to make a successful write, we should not give up. This patch adjust commit-transaction condition from: pinned >= wanted to left + pinned >= wanted Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13btrfs: fix condition of commit transactionZhao Lei
Old code bypass commit transaction when we don't have enough pinned space, but another case is there exist freed bgs in current transction, it have possibility to make alloc_chunk success. This patch modify the condition to: if (have_free_bg || have_pinned_space) commit_transaction() Confirmed above action by printk before and after patch. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10Btrfs: fix use after free when close_ctree frees the orphan_rsvChris Mason
Near the end of close_ctree, we're calling btrfs_free_block_rsv to free up the orphan rsv. The problem is this call updates the space_info, which has already been freed. This adds a new __ function that directly calls kfree instead of trying to update the space infos. Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10Btrfs: allow block group cache writeout outside critical section in commitChris Mason
We loop through all of the dirty block groups during commit and write the free space cache. In order to make sure the cache is currect, we do this while no other writers are allowed in the commit. If a large number of block groups are dirty, this can introduce long stalls during the final stages of the commit, which can block new procs trying to change the filesystem. This commit changes the block group cache writeout to take appropriate locks and allow it to run earlier in the commit. We'll still have to redo some of the block groups, but it means we can get most of the work out of the way without blocking the entire FS. Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10Btrfs: two stage dirty block group writeoutChris Mason
Block group cache writeout is currently waiting on the pages for each block group cache before moving on to writing the next one. This commit switches things around to send down all the caches and then wait on them in batches. The end result is much faster, since we're keeping the disk pipeline full. Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10Btrfs: don't commit the transaction in the async space flushingJosef Bacik
We're triggering a huge number of commits from btrfs_async_reclaim_metadata_space. These aren't really requried, because everyone calling the async reclaim code is going to end up triggering a commit on their own. Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10Btrfs: reserve space for block groupsJosef Bacik
This changes our delayed refs calculations to include the space needed to write back dirty block groups. Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10Btrfs: refill block reserves during truncateChris Mason
When truncate starts, it allocates some space in the block reserves so that we'll have enough to update metadata along the way. For very large files, we can easily go through all of that space as we loop through the extents. This changes truncate to refill the space reservation as it progresses through the file. Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10Btrfs: account for crcs in delayed ref processingJosef Bacik
As we delete large extents, we end up doing huge amounts of COW in order to delete the corresponding crcs. This adds accounting so that we keep track of that space and flushing of delayed refs so that we don't build up too much delayed crc work. This helps limit the delayed work that must be done at commit time and tries to avoid ENOSPC aborts because the crcs eat all the global reserves. Signed-off-by: Chris Mason <clm@fb.com>
2015-04-01Btrfs: free and unlock our path before btrfs_free_and_pin_reserved_extent()Chris Mason
The error handling path for alloc_reserved_tree_block is calling btrfs_free_and_pin_reserved_extent with a spinning tree lock held. This might sleep as we allocate extent_state objects: BUG: sleeping function called from invalid context at mm/slub.c:1268 in_atomic(): 1, irqs_disabled(): 0, pid: 11093, name: kworker/u4:7 5 locks held by kworker/u4:7/11093: #0: ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520 #1: ((&work->normal_work)){+.+.+.}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520 #2: (sb_internal){++++.+}, at: [<ffffffffa003a70e>] start_transaction+0x43e/0x590 [btrfs] #3: (&head_ref->mutex){+.+...}, at: [<ffffffffa0089f8c>] btrfs_delayed_ref_lock+0x4c/0x240 [btrfs] #4: (btrfs-extent-00){++++..}, at: [<ffffffffa007697b>] btrfs_clear_lock_blocking_rw+0x9b/0x150 [btrfs] CPU: 0 PID: 11093 Comm: kworker/u4:7 Tainted: G W 4.0.0-rc6-default+ #246 Hardware name: Intel Corporation Santa Rosa platform/Matanzas, BIOS TSRSCRB1.86C.0047.B00.0610170821 10/17/06 Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] 00000000000004f4 ffff88006dd17848 ffffffff81ab0e3b ffff88006dd17848 ffff88007a944760 ffff88006dd17868 ffffffff8109d516 ffff88006dd17898 0000000000000000 ffff88006dd17898 ffffffff8109d5b2 ffffffff81aba2bb Call Trace: [<ffffffff81ab0e3b>] dump_stack+0x4f/0x6c [<ffffffff8109d516>] ___might_sleep+0xf6/0x140 [<ffffffff8109d5b2>] __might_sleep+0x52/0x90 [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34 [<ffffffff81196363>] kmem_cache_alloc+0x163/0x1b0 [<ffffffffa0056f31>] ? alloc_extent_state+0x31/0x150 [btrfs] [<ffffffffa0056f20>] ? alloc_extent_state+0x20/0x150 [btrfs] [<ffffffffa0056f31>] alloc_extent_state+0x31/0x150 [btrfs] [<ffffffffa005805b>] __set_extent_bit+0x37b/0x5d0 [btrfs] [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34 [<ffffffffa005888d>] ? set_extent_bit+0xd/0x30 [btrfs] [<ffffffffa00588a3>] set_extent_bit+0x23/0x30 [btrfs] [<ffffffffa0058e80>] set_extent_dirty+0x20/0x30 [btrfs] [<ffffffffa00195ba>] pin_down_extent+0xaa/0x170 [btrfs] [<ffffffffa001d8ef>] __btrfs_free_reserved_extent+0xcf/0x160 [btrfs] [<ffffffffa0023856>] btrfs_free_and_pin_reserved_extent+0x16/0x20 [btrfs] [<ffffffffa002482a>] __btrfs_run_delayed_refs+0xfca/0x1290 [btrfs] [<ffffffffa0026eae>] btrfs_run_delayed_refs+0x6e/0x2e0 [btrfs] [<ffffffffa0027378>] delayed_ref_async_start+0x48/0xb0 [btrfs] [<ffffffffa006c883>] normal_work_helper+0x83/0x350 [btrfs] [<ffffffffa006cd79>] ? btrfs_extent_refs_helper+0x9/0x20 [btrfs] [<ffffffffa006cd82>] btrfs_extent_refs_helper+0x12/0x20 [btrfs] [<ffffffff81091dcb>] process_one_work+0x1cb/0x520 [<ffffffff81091d51>] ? process_one_work+0x151/0x520 [<ffffffff811c7abf>] ? seq_read+0x3f/0x400 [<ffffffff8109260b>] worker_thread+0x5b/0x4e0 [<ffffffff81097be2>] ? __kthread_parkme+0x12/0xa0 [<ffffffff810925b0>] ? rescuer_thread+0x450/0x450 [<ffffffff81098686>] kthread+0xf6/0x120 [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0 [<ffffffff81ab8088>] ret_from_fork+0x58/0x90 [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0 ------------[ cut here ]------------ This changes things to free the path first, which will also unlock the extent buffer. Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Dave Sterba <dsterba@suse.cz> Tested-by: Dave Sterba <dsterba@suse.cz>
2015-03-26Btrfs: fix log tree corruption when fs mounted with -o discardFilipe Manana
While committing a transaction we free the log roots before we write the new super block. Freeing the log roots implies marking the disk location of every node/leaf (metadata extent) as pinned before the new super block is written. This is to prevent the disk location of log metadata extents from being reused before the new super block is written, otherwise we would have a corrupted log tree if before the new super block is written a crash/reboot happens and the location of any log tree metadata extent ended up being reused and rewritten. Even though we pinned the log tree's metadata extents, we were issuing a discard against them if the fs was mounted with the -o discard option, resulting in corruption of the log tree if a crash/reboot happened before writing the new super block - the next time the fs was mounted, during the log replay process we would find nodes/leafs of the log btree with a content full of zeroes, causing the process to fail and require the use of the tool btrfs-zero-log to wipeout the log tree (and all data previously fsynced becoming lost forever). Fix this by not doing a discard when pinning an extent. The discard will be done later when it's safe (after the new super block is committed) at extent-tree.c:btrfs_finish_extent_commit(). Fixes: e688b7252f78 (Btrfs: fix extent pinning bugs in the tree log) CC: <stable@vger.kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-25Merge branch 'cleanups-post-3.19' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1 Signed-off-by: Chris Mason <clm@fb.com> Conflicts: fs/btrfs/disk-io.c
2015-03-25Merge branch 'cleanups-for-4.1-v2' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1
2015-03-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Most of these are fixing extent reservation accounting, or corners with tree writeback during commit. Josef's set does add a test, which isn't strictly a fix, but it'll keep us from making this same mistake again" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix outstanding_extents accounting in DIO Btrfs: add sanity test for outstanding_extents accounting Btrfs: just free dummy extent buffers Btrfs: account merges/splits properly Btrfs: prepare block group cache before writing Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list) Btrfs: account for the correct number of extents for delalloc reservations Btrfs: fix merge delalloc logic Btrfs: fix comp_oper to get right order Btrfs: catch transaction abortion after waiting for it btrfs: fix sizeof format specifier in btrfs_check_super_valid()
2015-03-17Btrfs: add sanity test for outstanding_extents accountingJosef Bacik
I introduced a regression wrt outstanding_extents accounting. These are tricky areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE at any time. So add sanity tests to cover the various conditions that are tricky in order to make sure we don't introduce regressions in the future. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17Btrfs: prepare block group cache before writingJosef Bacik
Writing the block group cache will modify the extent tree quite a bit because it truncates the old space cache and pre-allocates new stuff. To try and cut down on the churn lets do the setup dance first, then later on hopefully we can avoid looping with newly dirtied roots. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-13Btrfs: account for the correct number of extents for delalloc reservationsJosef Bacik
Direct IO can easily pass in an buffer that is greater than BTRFS_MAX_EXTENT_SIZE, so take this into account when reserving extents in the delalloc reservation code. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Outside of misc fixes, Filipe has a few fsync corners and we're pulling in one more of Josef's fixes from production use here" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref. Btrfs: fix data loss in the fast fsync path Btrfs: remove extra run_delayed_refs in update_cowonly_root Btrfs: incremental send, don't rename a directory too soon btrfs: fix lost return value due to variable shadowing Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr Btrfs: fix off-by-one logic error in btrfs_realloc_node Btrfs: add missing inode update when punching hole Btrfs: abort the transaction if we fail to update the free space cache inode Btrfs: fix fsync race leading to ordered extent memory leaks
2015-03-03btrfs: replace remaining do_div calls with div_u64 variantsDavid Sterba
Switch to div_u64_rem that does type checking and has more obvious semantics than do_div. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: cleanup 64bit/32bit divs, provably bounded valuesDavid Sterba
The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type. Get rid of a few more do_div instances. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03btrfs: cleanup 64bit/32bit divs, compile time constantsDavid Sterba
Switch to div_u64 if the divisor is a numeric constant or sum of sizeof()s. We can remove a few instances of do_div that has the hidden semtantics of changing the 1st argument. Small power-of-two divisors are converted to bitshifts, large values are kept intact for clarity. Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-02Btrfs: abort the transaction if we fail to update the free space cache inodeJosef Bacik
Our gluster boxes were hitting a problem where they'd run out of space when updating the block group cache and therefore wouldn't be able to update the free space inode. This is a problem because this is how we invalidate the cache and protect ourselves from errors further down the stack, so if this fails we have to abort the transaction so we make sure we don't end up with stale free space cache. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This pull is mostly cleanups and fixes: - The raid5/6 cleanups from Zhao Lei fixup some long standing warts in the code and add improvements on top of the scrubbing support from 3.19. - Josef has round one of our ENOSPC fixes coming from large btrfs clusters here at FB. - Dave Sterba continues a long series of cleanups (thanks Dave), and Filipe continues hammering on corner cases in fsync and others This all was held up a little trying to track down a use-after-free in btrfs raid5/6. It's not clear yet if this is just made easier to trigger with this pull or if its a new bug from the raid5/6 cleanups. Dave Sterba is the only one to trigger it so far, but he has a consistent way to reproduce, so we'll get it nailed shortly" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits) Btrfs: don't remove extents and xattrs when logging new names Btrfs: fix fsync data loss after adding hard link to inode Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group Btrfs: account for large extents with enospc Btrfs: don't set and clear delalloc for O_DIRECT writes Btrfs: only adjust outstanding_extents when we do a short write btrfs: Fix out-of-space bug Btrfs: scrub, fix sleep in atomic context Btrfs: fix scheduler warning when syncing log Btrfs: Remove unnecessary placeholder in btrfs_err_code btrfs: cleanup init for list in free-space-cache btrfs: delete chunk allocation attemp when setting block group ro btrfs: clear bio reference after submit_one_bio() Btrfs: fix scrub race leading to use-after-free Btrfs: add missing cleanup on sysfs init failure Btrfs: fix race between transaction commit and empty block group removal btrfs: add more checks to btrfs_read_sys_array btrfs: cleanup, rename a few variables in btrfs_read_sys_array btrfs: add checks for sys_chunk_array sizes btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize ...
2015-02-16btrfs: cleanup: remove no-used alloc_chunk in btrfs_check_data_free_space()Zhao Lei
int alloc_chunk is never used in this function, remove it. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16Btrfs: disk-io: replace root args iff only fs_info usedDaniel Dressler
This is the 3rd independent patch of a larger project to cleanup btrfs's internal usage of btrfs_root. Many functions take btrfs_root only to grab the fs_info struct. By requiring a root these functions cause programmer overhead. That these functions can accept any valid root is not obvious until inspection. This patch reduces the specificity of such functions to accept the fs_info directly. These patches can be applied independently and thus are not being submitted as a patch series. There should be about 26 patches by the project's completion. Each patch will cleanup between 1 and 34 functions apiece. Each patch covers a single file's functions. This patch affects the following function(s): 1) csum_tree_block 2) csum_dirty_buffer 3) check_tree_block_fsid 4) btrfs_find_tree_block 5) clean_tree_block Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-14Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block groupForrest Liu
Removing large amount of block group in a transaction may encounters BUG_ON() in btrfs_orphan_add(). That is because btrfs_orphan_reserve_metadata() will grab metadata reservation from transaction handle, and btrfs_delete_unused_bgs() didn't reserve metadata for trnasaction handle when delete unused block group. The problem can be reproduce by following script mntpath=/btrfs loopdev=/dev/loop0 filepath=/home/forrest/image umount $mntpath losetup -d $loopdev truncate --size 1000g $filepath losetup $loopdev $filepath mkfs.btrfs -f $loopdev mount $loopdev $mntpath for j in `seq 1 1 1000`; do fallocate -l 1g $mntpath/$j done # wait cleaner thread remove unused block group sleep 300 The call trace that results from the BUG_ON() is: [ 613.093084] ------------[ cut here ]------------ [ 613.097928] kernel BUG at fs/btrfs/inode.c:3142! [ 613.105855] invalid opcode: 0000 [#1] SMP [ 613.112702] Modules linked in: coretemp(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) snd_ens1371(E) snd_ac97_codec(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ppdev(E) ac97_bus(E) ablk_helper(E) gameport(E) cryptd(E) snd_rawmidi(E) snd_seq_device(E) snd_pcm(E) vmw_balloon(E) snd_timer(E) snd(E) soundcore(E) serio_raw(E) vmwgfx(E) ttm(E) drm_kms_helper(E) drm(E) vmw_vmci(E) parport_pc(E) shpchp(E) i2c_piix4(E) mac_hid(E) lp(E) parport(E) btrfs(E) xor(E) raid6_pq(E) hid_generic(E) usbhid(E) hid(E) psmouse(E) ahci(E) libahci(E) e1000(E) mptspi(E) mptscsih(E) mptbase(E) floppy(E) vmw_pvscsi(E) vmxnet3(E) [ 613.144196] CPU: 0 PID: 1480 Comm: btrfs-cleaner Tainted: G E 3.19.0-rc7-custom #2 [ 613.148501] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 613.152694] task: ffff880035cdb1a0 ti: ffff880039cf4000 task.ti: ffff880039cf4000 [ 613.154969] RIP: 0010:[<ffffffffa01441c2>] [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs] [ 613.157780] RSP: 0018:ffff880039cf7c48 EFLAGS: 00010286 [ 613.159560] RAX: 00000000ffffffe4 RBX: ffff88003bd981a0 RCX: ffff88003c9e4000 [ 613.161904] RDX: 0000000000002244 RSI: 0000000000040000 RDI: ffff88003c9e4138 [ 613.164264] RBP: ffff880039cf7c88 R08: 000060ffc0000850 R09: 0000000000000000 [ 613.166507] R10: ffff88003bc4b7a0 R11: ffffea0000eb6740 R12: ffff88003c9c0000 [ 613.168681] R13: ffff88003c102160 R14: ffff88003c9c0458 R15: 0000000000000001 [ 613.170932] FS: 0000000000000000(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000 [ 613.173316] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 613.175227] CR2: 00007f6343537000 CR3: 0000000036329000 CR4: 00000000000407f0 [ 613.177554] Stack: [ 613.178712] ffff880039cf7c88 ffffffffa0182a54 ffff88003c9e4b04 ffff88003c9c7800 [ 613.181297] ffff88003bc4b7a0 ffff88003bd981a0 ffff88003c8db200 ffff88003c2fcc60 [ 613.183782] ffff880039cf7d18 ffffffffa012da97 ffff88003bc4b7a4 ffff88003bc4b7a0 [ 613.186171] Call Trace: [ 613.187493] [<ffffffffa0182a54>] ? lookup_free_space_inode+0x44/0x100 [btrfs] [ 613.189801] [<ffffffffa012da97>] btrfs_remove_block_group+0x137/0x740 [btrfs] [ 613.192126] [<ffffffffa0166912>] btrfs_remove_chunk+0x672/0x780 [btrfs] [ 613.194267] [<ffffffffa012e2ff>] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs] [ 613.196567] [<ffffffffa0135e4c>] cleaner_kthread+0x12c/0x190 [btrfs] [ 613.198687] [<ffffffffa0135d20>] ? check_leaf+0x350/0x350 [btrfs] [ 613.200758] [<ffffffff8108f232>] kthread+0xd2/0xf0 [ 613.202616] [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180 [ 613.204738] [<ffffffff8175dabc>] ret_from_fork+0x7c/0xb0 [ 613.206652] [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180 [ 613.208741] Code: ff ff 0f 1f 80 00 00 00 00 89 45 c8 3e 80 63 80 fd 48 89 df e8 d0 23 fe ff 8b 45 c8 e9 14 ff ff ff b8 f4 ff ff ff e9 12 ff ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 [ 613.216562] RIP [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs] [ 613.218828] RSP <ffff880039cf7c48> [ 613.220382] ---[ end trace 71073106deb8a457 ]--- This patch replace btrfs_join_transaction() with btrfs_start_transaction() in btrfs_delete_unused_bgs() to revent BUG_ON() in btrfs_orphan_add() Signed-off-by: Forrest Liu <forrestl@synology.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14Btrfs: account for large extents with enospcJosef Bacik
On our gluster boxes we stream large tar balls of backups onto our fses. With 160gb of ram this means we get really large contiguous ranges of dirty data, but the way our ENOSPC stuff works is that as long as it's contiguous we only hold metadata reservation for one extent. The problem is we limit our extents to 128mb, so we'll end up with at least 800 extents so our enospc accounting is quite a bit lower than what we need. To keep track of this make sure we increase outstanding_extents for every multiple of the max extent size so we can be sure to have enough reserved metadata space. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02btrfs: delete chunk allocation attemp when setting block group roShaohua Li
Below test will fail currently: mkfs.ext4 -F /dev/sda btrfs-convert /dev/sda mount /dev/sda /mnt btrfs device add -f /dev/sdb /mnt btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt The reason is there are some block groups with usage 0, but the whole disk hasn't free space to allocate new chunk, so we even can't set such block group readonly. This patch deletes the chunk allocation when setting block group ro. For META, we already have reserve. But for SYSTEM, we don't have, so the check_system_chunk is still required. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02Btrfs: fix race between transaction commit and empty block group removalFilipe Manana
Committing a transaction can race with automatic removal of empty block groups (cleaner kthread), leading to a BUG_ON() in the transaction commit code while running btrfs_finish_extent_commit(). The following sequence diagram shows how it can happen: CPU 1 CPU 2 btrfs_commit_transaction() fs_info->running_transaction = NULL btrfs_finish_extent_commit() find_first_extent_bit() -> found range for block group X in fs_info->freed_extents[] btrfs_delete_unused_bgs() -> found block group X Removed block group X's range from fs_info->freed_extents[] btrfs_remove_chunk() btrfs_remove_block_group(bg X) unpin_extent_range(bg X range) btrfs_lookup_block_group(bg X) -> returns NULL -> BUG_ON() The trace that results from the BUG_ON() is: [48665.187808] ------------[ cut here ]------------ [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675! [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G W 3.19.0-rc5-btrfs-next-4+ #1 [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000 [48665.197388] RIP: 0010:[<ffffffffa0350d05>] [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP: 0018:ffff8801b56a7b88 EFLAGS: 00010246 [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8 [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0 [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000 [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000 [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000 [48665.197388] FS: 0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000 [48665.197388] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0 [48665.197388] Stack: [48665.197388] ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff [48665.197388] ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0 [48665.197388] ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227 [48665.197388] Call Trace: [48665.197388] [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs] [48665.197388] [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs] [48665.197388] [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs] [48665.197388] [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33 [48665.197388] [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs] [48665.197388] [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab [48665.197388] [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab [48665.197388] [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf [48665.197388] [<ffffffff8105a55b>] worker_thread+0x210/0x2d0 [48665.197388] [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3 [48665.197388] [<ffffffff8105e5c0>] kthread+0xef/0xf7 [48665.197388] [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0 [48665.197388] [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b [48665.197388] RIP [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs] [48665.197388] RSP <ffff8801b56a7b88> [48665.272246] ---[ end trace b9c6ab9957521376 ]--- Fix this by ensuring that unpining the block group's range in btrfs_finish_extent_commit() is done in a synchronized fashion with removing the block group's range from freed_extents[] in btrfs_delete_unused_bgs() This race got introduced with the change: Btrfs: remove empty block groups automatically commit 47ab2a6c689913db23ccae38349714edf8365e0a Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Btrfs: cleanup unused run_mostLiu Bo
"run_most" is not used anymore. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Btrfs: add ref_count and free function for btrfs_bioZhao Lei
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag to keep btrfs_bio's memory in raid56 recovery implement. 2: free function for bbio will make code clean and flexible, plus forced data type checking in compile. Changelog v1->v2: Rename following by David Sterba's suggestion: put_btrfs_bio() -> btrfs_put_bio() get_btrfs_bio() -> btrfs_get_bio() bbio->ref_count -> bbio->refs Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Btrfs: lookup for block group only if needed when freeing a tree blockFilipe Manana
Very often our extent buffer's header generation doesn't match the current transaction's id or it is also referenced by other trees (snapshots), so we don't need the corresponding block group cache object. Therefore only search for it if we are going to use it, so we avoid an unnecessary search in the block groups rbtree (and acquiring and releasing its spinlock). Freeing a tree block is performed when COWing or deleting a node/leaf, which implies we are holding the node/leaf's parent node lock, therefore reducing the amount of time spent when freeing a tree block helps reducing the amount of time we are holding the parent node's lock. For example, for a run of xfstests/generic/083, the block group cache object was needed only 682 times for a total of 226691 calls to free a tree block. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Merge branch 'cleanup/blocksize-diet-part2' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
2015-01-21Btrfs: track dirty block groups on their own listJosef Bacik
Currently any time we try to update the block groups on disk we will walk _all_ block groups and check for the ->dirty flag to see if it is set. This function can get called several times during a commit. So if you have several terabytes of data you will be a very sad panda as we will loop through _all_ of the block groups several times, which makes the commit take a while which slows down the rest of the file system operations. This patch introduces a dirty list for the block groups that we get added to when we dirty the block group for the first time. Then we simply update any block groups that have been dirtied since the last time we called btrfs_write_dirty_block_groups. This allows us to clean up how we write the free space cache out so it is much cleaner. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19Btrfs: fix race deleting block group from space_info->ro_bgs listFilipe Manana
When removing a block group we were deleting it from its space_info's ro_bgs list without the correct protection - the space info's spinlock. Fix this by doing the list delete while holding the spinlock of the corresponding space info, which is the correct lock for any operation on that list. This issue was introduced in the 3.19 kernel by the following change: Btrfs: move read only block groups onto their own list V2 commit 633c0aad4c0243a506a3e8590551085ad78af82d I ran into a kernel crash while a task was running statfs, which iterates the space_info->ro_bgs list while holding the space info's spinlock, and another task was deleting it from the same list, without holding that spinlock, as part of the block group remove operation (while running the function btrfs_remove_block_group). This happened often when running the stress test xfstests/generic/038 I recently made. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-02Btrfs: abort transaction if we don't find the block groupJosef Bacik
We shouldn't BUG_ON() if there is corruption. I hit this while testing my block group patch and the abort worked properly. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-12btrfs: sink blocksize parameter to btrfs_find_create_tree_blockDavid Sterba
Finally it's clear that the requested blocksize is always equal to nodesize, with one exception, the superblock. Superblock has fixed size regardless of the metadata block size, but uses the same helpers to initialize sys array/chunk tree and to work with the chunk items. So it pretends to be an extent_buffer for a moment, btrfs_read_sys_array is full of special cases, we're adding one more. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12btrfs: sink blocksize parameter to btrfs_init_new_bufferDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12btrfs: sink blocksize parameter to readahead_tree_blockDavid Sterba
All callers pass nodesize. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-10Btrfs: remove non-sense btrfs_error_discard_extent() functionFilipe Manana
It doesn't do anything special, it just calls btrfs_discard_extent(), so just remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10Btrfs: fix fs corruption on transaction abort if device supports discardFilipe Manana
When we abort a transaction we iterate over all the ranges marked as dirty in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them from those trees, add them back (unpin) to the free space caches and, if the fs was mounted with "-o discard", perform a discard on those regions. Also, after adding the regions to the free space caches, a fitrim ioctl call can see those ranges in a block group's free space cache and perform a discard on the ranges, so the same issue can happen without "-o discard" as well. This causes corruption, affecting one or multiple btree nodes (in the worst case leaving the fs unmountable) because some of those ranges (the ones in the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are referred by the last committed super block - breaking the rule that anything that was committed by a transaction is untouched until the next transaction commits successfully. I ran into this while running in a loop (for several hours) the fstest that I recently submitted: [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim The corruption always happened when a transaction aborted and then fsck complained like this: _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent *** fsck.btrfs output *** Check tree block failed, want=94945280, have=0 Check tree block failed, want=94945280, have=0 Check tree block failed, want=94945280, have=0 Check tree block failed, want=94945280, have=0 Check tree block failed, want=94945280, have=0 read block failed check_tree_block Couldn't open file system In this case 94945280 corresponded to the root of a tree. Using frace what I observed was the following sequence of steps happened: 1) transaction N started, fs_info->pinned_extents pointed to fs_info->freed_extents[0]; 2) node/eb 94945280 is created; 3) eb is persisted to disk; 4) transaction N commit starts, fs_info->pinned_extents now points to fs_info->freed_extents[1], and transaction N completes; 5) transaction N + 1 starts; 6) eb is COWed, and btrfs_free_tree_block() called for this eb; 7) eb range (94945280 to 94945280 + 16Kb) is added to fs_info->pinned_extents (fs_info->freed_extents[1]); 8) Something goes wrong in transaction N + 1, like hitting ENOSPC for example, and the transaction is aborted, turning the fs into readonly mode. The stack trace I got for example: [112065.253935] [<ffffffff8140c7b6>] dump_stack+0x4d/0x66 [112065.254271] [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98 [112065.254567] [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs] [112065.261674] [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50 [112065.261922] [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs] [112065.262211] [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs] [112065.262545] [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs] [112065.262771] [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs] [112065.263105] [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs] (...) [112065.264493] ---[ end trace dd7903a975a31a08 ]--- [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left [112065.264997] BTRFS info (device sdc): forced readonly 9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in fs_info->fs_state and calls btrfs_cleanup_transaction(), which in turn calls btrfs_destroy_pinned_extent(); 10) Then btrfs_destroy_pinned_extent() iterates over all the ranges marked as dirty in fs_info->freed_extents[], and for each one it calls discard, if the fs was mounted with "-o discard", and adds the range to the free space cache of the respective block group; 11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path, sees the free space entries and performs a discard; 12) After an umount and mount (or fsck), our eb's location on disk was full of zeroes, and it should have been untouched, because it was marked as dirty in the fs_info->pinned_extents tree, and therefore used by the trees that the last committed superblock points to. Fix this by not performing a discard and not adding the ranges to the free space caches - it's useless from this point since the fs is now in readonly mode and we won't write free space caches to disk anymore (otherwise we would leak space) nor any new superblock. By not adding the ranges to the free space caches, it prevents other code paths from allocating that space and write to it as well, therefore being safer and simpler. This isn't a new problem, as it's been present since 2011 (git commit acce952b0263825da32cf10489413dec78053347). Cc: stable@vger.kernel.org # any kernel released after 2011-01-06 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10Btrfs: always clear a block group node when removing it from the treeFilipe Manana
Always clear a block group's rbnode after removing it from the rbtree to ensure that any tasks that might be holding a reference on the block group don't end up accessing stale rbnode left and right child pointers through next_block_group(). This is a leftover from the change titled: "Btrfs: fix invalid block group rbtree access after bg is removed" Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02Btrfs: make get_caching_control unconditionally return the ctlJosef Bacik
This was written when we didn't do a caching control for the fast free space cache loading. However we started doing that a long time ago, and there is still a small window of time that we could be caching the block group the fast way, so if there is a caching_ctl at all on the block group just return it, the callers all wait properly for what they want. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02Btrfs: fix unprotected deletion from pending_chunks listFilipe Manana
On block group remove if the corresponding extent map was on the transaction->pending_chunks list, we were deleting the extent map from that list, through remove_extent_mapping(), without any synchronization with chunk allocation (which iterates that list and adds new elements to it). Fix this by ensure that this is done while the chunk mutex is held, since that's the mutex that protects the list in the chunk allocation code path. This applies on top (depends on) of my previous patch titled: "Btrfs: fix race between fs trimming and block group remove/allocation" But the issue in fact was already present before that change, it only became easier to hit after Josef's 3.18 patch that added automatic removal of empty block groups. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02Btrfs: fix fs mapping extent map leakFilipe Manana
On chunk allocation error (label "error_del_extent"), after adding the extent map to the tree and to the pending chunks list, we would leave decrementing the extent map's refcount by 2 instead of 3 (our allocation + tree reference + list reference). Also, on chunk/block group removal, if the block group was on the list pending_chunks we weren't decrementing the respective list reference. Detected by 'rmmod btrfs': [20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects [20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G W L 3.17.0-rc5-btrfs-next-1+ #1 [20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [20770.106130] 0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040 [20770.106132] ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0 [20770.106134] ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78 [20770.106136] Call Trace: [20770.106142] [<ffffffff813e7a13>] dump_stack+0x45/0x56 [20770.106145] [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90 [20770.106164] [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs] [20770.106176] [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs] [20770.106179] [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4 [20770.106182] [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c [20770.106184] [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b This applies on top (depends on) of my previous patch titled: "Btrfs: fix race between fs trimming and block group remove/allocation" But the issue in fact was already present before that change, it only became easier to hit after Josef's 3.18 patch that added automatic removal of empty block groups. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02Btrfs: make btrfs_abort_transaction consider existence of new block groupsFilipe Manana
If the transaction handle doesn't have used blocks but has created new block groups make sure we turn the fs into readonly mode too. This is because the new block groups didn't get all their metadata persisted into the chunk and device trees, and therefore if a subsequent transaction starts, allocates space from the new block groups, writes data or metadata into that space, commits successfully and then after we unmount and mount the filesystem again, the same space can be allocated again for a new block group, resulting in file data or metadata corruption. Example where we don't abort the transaction when we fail to finish the chunk allocation (add items to the chunk and device trees) and later a future transaction where the block group is removed fails because it can't find the chunk item in the chunk tree: [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]() [25230.404301] BTRFS: Transaction aborted (error -28) [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs] [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1 [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [25230.404328] 0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50 [25230.404330] ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4 [25230.404332] ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8 [25230.404334] Call Trace: [25230.404338] [<ffffffff813e7a13>] dump_stack+0x45/0x56 [25230.404342] [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98 [25230.404351] [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404353] [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50 [25230.404362] [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs] [25230.404374] [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs] [25230.404387] [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs] [25230.404398] [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs] [25230.404408] [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs] [25230.404421] [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs] [25230.404425] [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37 [25230.404429] [<ffffffff810f6501>] ? get_page+0x1a/0x2b [25230.404441] [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs] [25230.404443] [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846 [25230.404446] [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18 [25230.404449] [<ffffffff81138d68>] new_sync_write+0x7c/0xa0 [25230.404450] [<ffffffff81139401>] vfs_write+0xb0/0x112 [25230.404452] [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84 [25230.404454] [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]--- [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left). [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.) Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02Btrfs: fix race between fs trimming and block group remove/allocationFilipe Manana
Our fs trim operation, which is completely transactionless (doesn't start or joins an existing transaction) consists of visiting all block groups and then for each one to iterate its free space entries and perform a discard operation against the space range represented by the free space entries. However before performing a discard, the corresponding free space entry is removed from the free space rbtree, and when the discard completes it is added back to the free space rbtree. If a block group remove operation happens while the discard is ongoing (or before it starts and after a free space entry is hidden), we end up not waiting for the discard to complete, remove the extent map that maps logical address to physical addresses and the corresponding chunk metadata from the the chunk and device trees. After that and before the discard completes, the current running transaction can finish and a new one start, allowing for new block groups that map to the same physical addresses to be allocated and written to. So fix this by keeping the extent map in memory until the discard completes so that the same physical addresses aren't reused before it completes. If the physical locations that are under a discard operation end up being used for a new metadata block group for example, and dirty metadata extents are written before the discard finishes (the VM might call writepages() of our btree inode's i_mapping for example, or an fsync log commit happens) we end up overwriting metadata with zeroes, which leads to errors from fsck like the following: checking extents Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block owner ref check failed [833912832 16384] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block root 5 root dir 256 error root 5 inode 260 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref root 5 inode 262 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref root 5 inode 263 errors 2001, no inode item, link count wrong (...) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02Btrfs: fix freeing used extents after removing empty block groupFilipe Manana
There's a race between adding a block group to the list of the unused block groups and removing an unused block group (cleaner kthread) that leads to freeing extents that are in use or a crash during transaction commmit. Basically the cleaner kthread, when executing btrfs_delete_unused_bgs(), might catch the newly added block group to the list fs_info->unused_bgs and clear the range representing the whole group from fs_info->freed_extents[] before the task that added the block group to the list (running update_block_group()) marked the last freed extent as dirty in fs_info->freed_extents (pinned_extents). That is: CPU 1 CPU 2 btrfs_delete_unused_bgs() update_block_group() add block group to fs_info->unused_bgs got block group from the list clear_extent_bits for the whole block group range in freed_extents[] set_extent_dirty for the range covering the freed extent in freed_extents[] (fs_info->pinned_extents) block group deleted, and a new block group with the same logical address is created reserve space from the new block group for new data or metadata - the reserved space overlaps the range specified by CPU 1 for set_extent_dirty() commit transaction find all ranges marked as dirty in fs_info->pinned_extents, clear them and add them to the free space cache Alternatively, if CPU 2 doesn't create a new block group with the same logical address, we get a crash/BUG_ON at transaction commit when unpining extent ranges because we can't find a block group for the range marked as dirty by CPU 1. Sample trace: [ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC [ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio [ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1 [ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000 [ 2163.429157] RIP: 0010:[<ffffffffa037728e>] [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs] [ 2163.429562] RSP: 0018:ffff8801356bfda8 EFLAGS: 00010246 [ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080 [ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118 [ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0 [ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000 [ 2163.430042] FS: 0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000 [ 2163.430042] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0 [ 2163.430042] Stack: [ 2163.430042] ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800 [ 2163.430042] ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000 [ 2163.430042] ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff [ 2163.430042] Call Trace: [ 2163.430042] [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs] [ 2163.430042] [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs] [ 2163.430042] [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs] [ 2163.430042] [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs] [ 2163.430042] [<ffffffff8105966b>] kthread+0xb7/0xbf [ 2163.430042] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67 [ 2163.430042] [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0 [ 2163.430042] [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67 So fix this by making update_block_group() first set the range as dirty in pinned_extents before adding the block group to the unused_bgs list. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>