summaryrefslogtreecommitdiff
path: root/fs/btrfs/disk-io.c
AgeCommit message (Collapse)Author
2016-01-19Btrfs: fix deadlock running delayed iputs at transaction commit timeFilipe Manana
While running a stress test I ran into a deadlock when running the delayed iputs at transaction time, which produced the following report and trace: [ 886.399989] ============================================= [ 886.400871] [ INFO: possible recursive locking detected ] [ 886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted [ 886.402384] --------------------------------------------- [ 886.403182] fio/8277 is trying to acquire lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] but task is already holding lock: [ 886.403568] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] other info that might help us debug this: [ 886.403568] Possible unsafe locking scenario: [ 886.403568] [ 886.403568] CPU0 [ 886.403568] ---- [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] lock(&fs_info->delayed_iput_sem); [ 886.403568] [ 886.403568] *** DEADLOCK *** [ 886.403568] [ 886.403568] May be due to missing lock nesting notation [ 886.403568] [ 886.403568] 3 locks held by fio/8277: [ 886.403568] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0 [ 886.403568] #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs] [ 886.403568] #2: (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.403568] [ 886.403568] stack backtrace: [ 886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 886.403568] 0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0 [ 886.403568] ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290 [ 886.403568] 0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804 [ 886.403568] Call Trace: [ 886.403568] [<ffffffff8125d4fd>] dump_stack+0x4e/0x79 [ 886.403568] [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b [ 886.403568] [<ffffffff810c22db>] ? __module_address+0xdf/0x108 [ 886.403568] [<ffffffff8108eb77>] lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194 [ 886.403568] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffff8148556b>] down_read+0x3e/0x4d [ 886.489542] [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs] [ 886.489542] [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs] [ 886.489542] [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs] [ 886.489542] [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs] [ 886.489542] [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs] [ 886.489542] [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs] [ 886.489542] [<ffffffff81188e31>] evict+0xa7/0x15c [ 886.489542] [<ffffffff81189878>] iput+0x1d3/0x266 [ 886.489542] [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs] [ 886.489542] [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs] [ 886.489542] [<ffffffff81085096>] ? signal_pending_state+0x31/0x31 [ 886.489542] [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs] [ 886.489542] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 886.489542] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 886.489542] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 886.489542] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 886.489542] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 886.489542] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 886.489542] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 886.489542] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 886.489542] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 886.489542] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds. [ 1081.854348] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.863227] fio D ffff880213f9bb28 0 8244 8240 0x00000000 [ 1081.868719] ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240 [ 1081.872499] ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400 [ 1081.876834] ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4 [ 1081.880782] Call Trace: [ 1081.881793] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.883340] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.895525] [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab [ 1081.897419] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1081.899251] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1081.901063] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1081.902365] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1081.903846] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.906078] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1081.908846] [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c [ 1081.910409] [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs] [ 1081.912482] [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs] [ 1081.914597] [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs] [ 1081.919037] [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128 [ 1081.920754] [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs] [ 1081.922496] [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50 [ 1081.923922] [<ffffffff8117279e>] __vfs_write+0x7c/0xa5 [ 1081.925275] [<ffffffff81172cda>] vfs_write+0xa0/0xe4 [ 1081.926584] [<ffffffff811734cc>] SyS_write+0x50/0x7e [ 1081.927968] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 1081.985293] INFO: lockdep is turned off. [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds. [ 1081.987434] Not tainted 4.4.0-rc6-btrfs-next-18+ #1 [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1081.990147] fio D ffff880218febbb8 0 8249 8240 0x00000000 [ 1081.991626] ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240 [ 1081.993258] ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00 [ 1081.994850] ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4 [ 1081.996485] Call Trace: [ 1081.997037] [<ffffffff81482ba4>] schedule+0x7f/0x97 [ 1081.998017] [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325 [ 1081.999241] [<ffffffff810852a5>] ? finish_wait+0x6d/0x76 [ 1082.000306] [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20 [ 1082.001533] [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20 [ 1082.002776] [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21 [ 1082.003995] [<ffffffff814855bd>] down_write+0x43/0x57 [ 1082.005000] [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.007403] [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs] [ 1082.008988] [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs] [ 1082.010193] [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77 [ 1082.011280] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.012265] [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0 [ 1082.013021] [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff [ 1082.013738] [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b [ 1082.014778] [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea [ 1082.015778] [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e [ 1082.016806] [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71 [ 1082.017789] [<ffffffff8118240e>] SyS_ioctl+0x57/0x79 [ 1082.018706] [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f This happens because we can recursively acquire the semaphore fs_info->delayed_iput_sem when attempting to allocate space to satisfy a file write request as shown in the first trace above - when committing a transaction we acquire (down_read) the semaphore before running the delayed iputs, and when running a delayed iput() we can end up calling an inode's eviction handler, which in turn commits another transaction and attempts to acquire (down_read) again the semaphore to run more delayed iput operations. This results in a deadlock because if a task acquires multiple times a semaphore it should invoke down_read_nested() with a different lockdep class for each level of recursion. Fix this by simplifying the implementation and use a mutex instead that is acquired by the cleaner kthread before it runs the delayed iputs instead of always acquiring a semaphore before delayed references are run from anywhere. Fixes: d7c151717a1e (btrfs: Fix NO_SPACE bug caused by delayed-iput) Cc: stable@vger.kernel.org # 4.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-01-19Merge branch 'misc-for-4.5' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2016-01-18Merge branch 'for-linus-4.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has our usual assortment of fixes and cleanups, but the biggest change included is Omar Sandoval's free space tree. It's not the default yet, mounting -o space_cache=v2 enables it and sets a readonly compat bit. The tree can actually be deleted and regenerated if there are any problems, but it has held up really well in testing so far. For very large filesystems (30T+) our existing free space caching code can end up taking a huge amount of time during commits. The new tree based code is faster and less work overall to update as the commit progresses. Omar worked on this during the summer and we'll hammer on it in production here at FB over the next few months" * 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (73 commits) Btrfs: fix fitrim discarding device area reserved for boot loader's use Btrfs: Check metadata redundancy on balance btrfs: statfs: report zero available if metadata are exhausted btrfs: preallocate path for snapshot creation at ioctl time btrfs: allocate root item at snapshot ioctl time btrfs: do an allocation earlier during snapshot creation btrfs: use smaller type for btrfs_path locks btrfs: use smaller type for btrfs_path lowest_level btrfs: use smaller type for btrfs_path reada btrfs: cleanup, use enum values for btrfs_path reada btrfs: constify static arrays btrfs: constify remaining structs with function pointers btrfs tests: replace whole ops structure for free space tests btrfs: use list_for_each_entry* in backref.c btrfs: use list_for_each_entry_safe in free-space-cache.c btrfs: use list_for_each_entry* in check-integrity.c Btrfs: use linux/sizes.h to represent constants btrfs: cleanup, remove stray return statements btrfs: zero out delayed node upon allocation btrfs: pass proper enum type to start_transaction() ...
2016-01-15Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and ↵Chandan Rajendra
subvolume roots The following call trace is seen when btrfs/031 test is executed in a loop, [ 158.661848] ------------[ cut here ]------------ [ 158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea() [ 158.664102] BTRFS: Transaction aborted (error -2) [ 158.664774] Modules linked in: [ 158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted 4.4.0-rc6-g511711a #2 [ 158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 158.667392] ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930 [ 158.668515] ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000 [ 158.669647] ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980 [ 158.670769] Call Trace: [ 158.671153] [<ffffffff81431fc8>] dump_stack+0x44/0x5c [ 158.671884] [<ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0 [ 158.672769] [<ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50 [ 158.673620] [<ffffffff813bc98d>] create_subvol+0x3d1/0x6ea [ 158.674440] [<ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520 [ 158.675376] [<ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50 [ 158.676235] [<ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180 [ 158.677268] [<ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70 [ 158.678183] [<ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90 [ 158.678975] [<ffffffff81144b8e>] ? vma_merge+0xee/0x300 [ 158.679751] [<ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170 [ 158.680599] [<ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70 [ 158.681686] [<ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0 [ 158.682581] [<ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490 [ 158.683399] [<ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60 [ 158.684297] [<ffffffff8117b9d4>] SyS_ioctl+0x74/0x80 [ 158.685051] [<ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a [ 158.685958] ---[ end trace 4b63312de5a2cb76 ]--- [ 158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry [ 158.709508] BTRFS info (device loop0): forced readonly [ 158.737113] BTRFS info (device loop0): disk space caching is enabled [ 158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed [ 158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30 This occurs because, Mount filesystem Create subvol with ID 257 Unmount filesystem Mount filesystem Delete subvol with ID 257 btrfs_drop_snapshot() Add root corresponding to subvol 257 into btrfs_transaction->dropped_roots list Create new subvol (i.e. create_subvol()) 257 is returned as the next free objectid btrfs_read_fs_root_no_name() Finds the btrfs_root instance corresponding to the old subvol with ID 257 in btrfs_fs_info->fs_roots_radix. Returns error since btrfs_root_item->refs has the value of 0. To fix the issue the commit initializes tree root's and subvolume root's highest_objectid when loading the roots from disk. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-11Merge branch 'misc-cleanups-4.5' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>
2016-01-11Merge branch 'misc-for-4.5' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2016-01-07Btrfs: use linux/sizes.h to represent constantsByongho Lee
We use many constants to represent size and offset value. And to make code readable we use '256 * 1024 * 1024' instead of '268435456' to represent '256MB'. However we can make far more readable with 'SZ_256MB' which is defined in the 'linux/sizes.h'. So this patch replaces 'xxx * 1024 * 1024' kind of expression with single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is not a power of 2. And I haven't touched to '4096' & '8192' because it's more intuitive than 'SZ_4KB' & 'SZ_8KB'. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: cleanup, remove stray return statementsDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07Btrfs: add missing brelse when superblock checksum failsAnand Jain
Looks like oversight, call brelse() when checksum fails. Further down the code, in the non error path, we do call brelse() and so we don't see brelse() in the goto error paths. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-30btrfs: don't run delayed references while we are creating the free space treeChris Mason
This is a short term solution to make sure btrfs_run_delayed_refs() doesn't change the extent tree while we are scanning it to create the free space tree. Longer term we need to synchronize scanning the block groups one by one, similar to what happens during a balance. Signed-off-by: Chris Mason <clm@fb.com>
2015-12-23Merge branch 'freespace-4.5' into for-linus-4.5Chris Mason
2015-12-23Merge branch 'dev/simplify-set-bit' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>
2015-12-19x86/cpufeature: Remove unused and seldomly used cpu_has_xx macrosBorislav Petkov
Those are stupid and code should use static_cpu_has_safe() or boot_cpu_has() instead. Kill the least used and unused ones. The remaining ones need more careful inspection before a conversion can happen. On the TODO. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1449481182-27541-4-git-send-email-bp@alien8.de Cc: David Sterba <dsterba@suse.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Matt Mackall <mpm@selenic.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-18Merge branch 'freespace-tree' into for-linus-4.5Chris Mason
Signed-off-by: Chris Mason <clm@fb.com>
2015-12-17Btrfs: add free space tree mount optionOmar Sandoval
Now we can finally hook up everything so we can actually use free space tree. The free space tree is enabled by passing the space_cache=v2 mount option. On the first mount with the this option set, the free space tree will be created and the FREE_SPACE_TREE read-only compat bit will be set. Any time the filesystem is mounted from then on, we must use the free space tree. The clear_cache option will also clear the free space tree. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-12-07btrfs: remove a trivial helper btrfs_set_buffer_uptodateDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03btrfs: drop unused parameter from lock_extent_bitsDavid Sterba
We've always passed 0. Stack usage will slightly decrease. Signed-off-by: David Sterba <dsterba@suse.com>
2015-11-13Merge branch 'for-linus-4.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes and cleanups from Chris Mason: "Some of this got cherry-picked from a github repo this week, but I verified the patches. We have three small scrub cleanups and a collection of fixes" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: Use fs_info directly in btrfs_delete_unused_bgs btrfs: Fix lost-data-profile caused by balance bg btrfs: Fix lost-data-profile caused by auto removing bg btrfs: Remove len argument from scrub_find_csum btrfs: Reduce unnecessary arguments in scrub_recheck_block btrfs: Use scrub_checksum_data and scrub_checksum_tree_block for scrub_recheck_block_checksum btrfs: Reset sblock->xxx_error stats before calling scrub_recheck_block_checksum btrfs: scrub: setup all fields for sblock_to_check btrfs: scrub: set error stats when tree block spanning stripes Btrfs: fix race when listing an inode's xattrs Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow Btrfs: fix race leading to incorrect item deletion when dropping extents Btrfs: fix sleeping inside atomic context in qgroup rescan worker Btrfs: fix race waiting for qgroup rescan worker btrfs: qgroup: exit the rescan worker during umount Btrfs: fix extent accounting for partial direct IO writes
2015-11-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge second patch-bomb from Andrew Morton: - most of the rest of MM - procfs - lib/ updates - printk updates - bitops infrastructure tweaks - checkpatch updates - nilfs2 update - signals - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc, dma-debug, dma-mapping, ... * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) ipc,msg: drop dst nil validation in copy_msg include/linux/zutil.h: fix usage example of zlib_adler32() panic: release stale console lock to always get the logbuf printed out dma-debug: check nents in dma_sync_sg* dma-mapping: tidy up dma_parms default handling pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode kexec: use file name as the output message prefix fs, seqfile: always allow oom killer seq_file: reuse string_escape_str() fs/seq_file: use seq_* helpers in seq_hex_dump() coredump: change zap_threads() and zap_process() to use for_each_thread() coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT) signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread() signal: turn dequeue_signal_lock() into kernel_dequeue_signal() signals: kill block_all_signals() and unblock_all_signals() nilfs2: fix gcc uninitialized-variable warnings in powerpc build nilfs2: fix gcc unused-but-set-variable warnings MAINTAINERS: nilfs2: add header file for tracing nilfs2: add tracepoints for analyzing reading and writing metadata files ...
2015-11-06mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman
sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05btrfs: qgroup: exit the rescan worker during umountJustin Maggard
I was hitting a consistent NULL pointer dereference during shutdown that showed the trace running through end_workqueue_bio(). I traced it back to the endio_meta_workers workqueue being poked after it had already been destroyed. Eventually I found that the root cause was a qgroup rescan that was still in progress while we were stopping all the btrfs workers. Currently we explicitly pause balance and scrub operations in close_ctree(), but we do nothing to stop the qgroup rescan. We should probably be doing the same for qgroup rescan, but that's a much larger change. This small change is good enough to allow me to unmount without crashing. Signed-off-by: Justin Maggard <jmaggard@netgear.com> Reviewed-by: Filipe Manana <fdmanana@suse.com>
2015-10-26btrfs: clear PF_NOFREEZE in cleaner_kthread()Jiri Kosina
cleaner_kthread() kthread calls try_to_freeze() at the beginning of every cleanup attempt. This operation can't ever succeed though, as the kthread hasn't marked itself as freezable. Before (hopefully eventually) kthread freezing gets converted to fileystem freezing, we'd rather mark cleaner_kthread() freezable (as my understanding is that it can generate filesystem I/O during suspend). Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21Merge branch 'allocator-fixes' into for-linus-4.4Chris Mason
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21Btrfs: change how we wait for pending ordered extentsJosef Bacik
We have a mechanism to make sure we don't lose updates for ordered extents that were logged in the transaction that is currently running. We add the ordered extent to a transaction list and then the transaction waits on all the ordered extents in that list. However are substantially large file systems this list can be extremely large, and can give us soft lockups, since the ordered extents don't remove themselves from the list when they do complete. To fix this we simply add a counter to the transaction that is incremented any time we have a logged extent that needs to be completed in the current transaction. Then when the ordered extent finally completes it decrements the per transaction counter and wakes up the transaction if we are the last ones. This will eliminate the softlockup. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21btrfs: qgroup: Introduce new functions to reserve/free metadataQu Wenruo
Introduce new functions btrfs_qgroup_reserve/free_meta() to reserve/free metadata reserved space. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21Merge branch 'cleanups/for-4.4' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
2015-10-21btrfs: use btrfs_raid_array for btrfs_get_num_tolerated_disk_barrier_failures()Zhao Lei
btrfs_raid_array[] is used to define all raid attributes, use it to get tolerated_failures in btrfs_get_num_tolerated_disk_barrier_failures(), instead of complex condition in function. It can make code simple and auto-support other possible raid-type in future. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-12Merge branch 'fix/waitqueue-barriers' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
2015-10-12Merge branch 'anand/sysfs-updates-v4.3-rc3' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 Signed-off-by: Chris Mason <clm@fb.com>
2015-10-12Merge branch 'cleanup/messages' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
2015-10-10btrfs: comment the rest of implicit barriers before waitqueue_activeDavid Sterba
There are atomic operations that imply the barrier for waitqueue_active mixed in an if-condition. Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-09Merge branch 'for-linus-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are small and assorted. Neil's is the oldest, I dropped the ball thinking he was going to send it in" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: support NFSv2 export Btrfs: open_ctree: Fix possible memory leak Btrfs: fix deadlock when finalizing block group creation Btrfs: update fix for read corruption of compressed and shared extents Btrfs: send, fix corner case for reference overwrite detection
2015-10-08btrfs: switch more printks to our helpersDavid Sterba
Convert the simple cases, not all functions provide a way to reach the fs_info. Also skipped debugging messages (print-tree, integrity checker and pr_debug) and messages that are printed from possibly unfinished mount. Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08btrfs: switch message printers to ratelimited variantsDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08btrfs: switch message printers to ratelimited _in_rcu variantsDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-06Btrfs: open_ctree: Fix possible memory leakchandan
After reading one of chunk or tree root tree's root node from disk, if the root node does not have EXTENT_BUFFER_UPTODATE flag set, we fail to release the memory used by the root node. Fix this. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
2015-10-01Btrfs: add btrfs_read_dev_one_super() to read one specific SBAnand Jain
This uses a chunk of code from btrfs_read_dev_super() and creates a function called btrfs_read_dev_one_super() so that next patch can use it for scratch superblock. Signed-off-by: Anand Jain <anand.jain@oracle.com> [renamed bufhead to bh] Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29Btrfs: consolidate btrfs_error() to btrfs_std_error()Anand Jain
btrfs_error() and btrfs_std_error() does the same thing and calls _btrfs_std_error(), so consolidate them together. And the main motivation is that btrfs_error() is closely named with btrfs_err(), one handles error action the other is to log the error, so don't closely name them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Suggested-by: David Sterba <dsterba@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29Btrfs: SB read failure should return EIO for __bread failureAnand Jain
This will return EIO when __bread() fails to read SB, instead of EINVAL. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29Btrfs: rename btrfs_sysfs_remove_one to btrfs_sysfs_remove_mountedAnand Jain
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29Btrfs: rename btrfs_sysfs_add_one to btrfs_sysfs_add_mountedAnand Jain
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-25Merge branch 'for-linus-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is an assorted set I've been queuing up: Jeff Mahoney tracked down a tricky one where we ended up starting IO on the wrong mapping for special files in btrfs_evict_inode. A few people reported this one on the list. Filipe found (and provided a test for) a difficult bug in reading compressed extents, and Josef fixed up some quota record keeping with snapshot deletion. Chandan killed off an accounting bug during DIO that lead to WARN_ONs as we freed inodes" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: keep dropped roots in cache until transaction commit Btrfs: Direct I/O: Fix space accounting btrfs: skip waiting on ordered range for special files Btrfs: fix read corruption of compressed and shared extents Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock Btrfs: don't initialize a space info as full to prevent ENOSPC
2015-09-11Merge branch 'for-linus-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs cleanups and fixes from Chris Mason: "These are small cleanups, and also some fixes for our async worker thread initialization. I was having some trouble testing these, but it ended up being a combination of changing around my test servers and a shiny new schedule while atomic from the new start/finish_plug in writeback_sb_inodes(). That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't changing a bunch of things in my test setup at once it would have been really clear. Fix for writeback_sb_inodes() on the way as well" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called btrfs: async_thread: Fix workqueue 'max_active' value when initializing btrfs: Add raid56 support for updating num_tolerated_disk_barrier_failures in btrfs_balance btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk btrfs: Update out-of-date "skip parity stripe" comment
2015-09-10Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlockFilipe Manana
After commmit e44163e17796 ("btrfs: explictly delete unused block groups in close_ctree and ro-remount"), added in the 4.3 merge window, we have calls to btrfs_delete_unused_bgs() while holding the cleaner_mutex. This can cause a deadlock with a concurrent block group relocation (when a filesystem balance or shrink operation is in progress for example) because btrfs_delete_unused_bgs() locks delete_unused_bgs_mutex and the relocation path locks first delete_unused_bgs_mutex and then it locks cleaner_mutex, resulting in a classic ABBA deadlock: CPU 0 CPU 1 lock fs_info->cleaner_mutex __btrfs_balance() || btrfs_shrink_device() lock fs_info->delete_unused_bgs_mutex btrfs_relocate_chunk() btrfs_relocate_block_group() lock fs_info->cleaner_mutex btrfs_delete_unused_bgs() lock fs_info->delete_unused_bgs_mutex Fix this by not taking the cleaner_mutex before calling btrfs_delete_unused_bgs() because it's no longer needed after commit 67c5e7d464bc ("Btrfs: fix race between balance and unused block group deletion"). The mutex fs_info->delete_unused_bgs_mutex, the spinlock fs_info->unused_bgs_lock and a block group's spinlock are enough to get correct serialization between tasks running relocation and unused block group deletion (as well as between multiple tasks concurrently calling btrfs_delete_unused_bgs()). This issue was discussed (in the mailing list) during the review of the patch titled "btrfs: explictly delete unused block groups in close_ctree and ro-remount" and it was agreed that acquiring the cleaner mutex had to be dropped after the patch titled "Btrfs: fix race between balance and unused block group deletion" got merged (both patches were submitted at about the same time, but one landed in kernel 4.2 and the other in the 4.3 merge window). Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-09-05Merge branch 'for-linus-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has Jeff Mahoney's long standing trim patch that fixes corners where trims were missing. Omar has some raid5/6 fixes, especially for using scrub and device replace when devices are missing. Zhao Lie continues cleaning and fixing things, this series fixes some really hard to hit corners in xfstests. I had to pull it last merge window due to some deadlocks, but those are now resolved. I added support for Tejun's new blkio controllers. It seems to work well for single devices, we'll expand to multi-device as well" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits) btrfs: fix compile when block cgroups are not enabled Btrfs: fix file read corruption after extent cloning and fsync Btrfs: check if previous transaction aborted to avoid fs corruption btrfs: use __GFP_NOFAIL in alloc_btrfs_bio btrfs: Prevent from early transaction abort btrfs: Remove unused arguments in tree-log.c btrfs: Remove useless condition in start_log_trans() Btrfs: add support for blkio controllers Btrfs: remove unused mutex from struct 'btrfs_fs_info' Btrfs: fix parity scrub of RAID 5/6 with missing device Btrfs: fix device replace of a missing RAID 5/6 device Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation Btrfs: count devices correctly in readahead during RAID 5/6 replace Btrfs: remove misleading handling of missing device scrub btrfs: fix clone / extent-same deadlocks Btrfs: fix defrag to merge tail file extent Btrfs: fix warning in backref walking btrfs: Add WARN_ON() for double lock in btrfs_tree_lock() btrfs: Remove root argument in extent_data_ref_count() btrfs: Fix wrong comment of btrfs_alloc_tree_block() ...
2015-09-02Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
2015-08-31btrfs: Add raid56 support for updatingZhao Lei
num_tolerated_disk_barrier_failures in btrfs_balance Code for updating fs_info->num_tolerated_disk_barrier_failures in btrfs_balance() lacks raid56 support. Reason: Above code was wroten in 2012-08-01, together with btrfs_calc_num_tolerated_disk_barrier_failures()'s first version. Then, btrfs_calc_num_tolerated_disk_barrier_failures() got updated later to support raid56, but code in btrfs_balance() was not updated together. Fix: Merge above similar code to a common function: btrfs_get_num_tolerated_disk_barrier_failures() and make it support both case. It can fix this bug with a bonus of cleanup, and make these code never in above no-sync state from now on. Suggested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failuresZhao Lei
1: Use ARRAY_SIZE(types) to replace a static-value variant: int num_types = 4; 2: Use 'continue' on condition to reduce one level tab if (!XXX) { code; ... } -> if (XXX) continue; code; ... 3: Put setting 'num_tolerated_disk_barrier_failures = 2' to (num_tolerated_disk_barrier_failures > 2) condition to make make logic neat. if (num_tolerated_disk_barrier_failures > 0 && XXX) num_tolerated_disk_barrier_failures = 0; else if (num_tolerated_disk_barrier_failures > 1) { if (XXX) num_tolerated_disk_barrier_failures = 1; else if (XXX) num_tolerated_disk_barrier_failures = 2; -> if (num_tolerated_disk_barrier_failures > 0 && XXX) num_tolerated_disk_barrier_failures = 0; if (num_tolerated_disk_barrier_failures > 1 && XXX) num_tolerated_disk_barrier_failures = ; if (num_tolerated_disk_barrier_failures > 2 && XXX) num_tolerated_disk_barrier_failures = 2; 4: Remove comment of: num_mirrors - 1: if RAID1 or RAID10 is configured and more than 2 mirrors are used. which is not fit with code. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Merge branch 'jeffm-discard-4.3' into for-linus-4.3Chris Mason
2015-08-09Btrfs: add support for blkio controllersChris Mason
This attaches accounting information to bios as we submit them so the new blkio controllers can throttle on btrfs filesystems. Not much is required, we're just associating bios with blkcgs during clone, calling wbc_init_bio()/wbc_account_io() during writepages submission, and attaching the bios to the current context during direct IO. Finally if we are splitting bios during btrfs_map_bio, this attaches accounting information to the split. The end result is able to throttle nicely on single disk filesystems. A little more work is required for multi-device filesystems. Signed-off-by: Chris Mason <clm@fb.com>