summaryrefslogtreecommitdiff
path: root/fs/btrfs/inode.c
AgeCommit message (Collapse)Author
2017-10-30btrfs: pass root to various extent ref mod functionsJosef Bacik
We need the actual root for the ref verifier tool to work, so change these functions to pass the root around instead. This will be used in a subsequent patch. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30Btrfs: remove nr_async_submits and async_submit_drainingLiu Bo
Now that we have the combo of flushing twice, which can make sure IO have started since the second flush will wait for page lock which won't be unlocked unless setting page writeback and queuing ordered extents, we don't need %async_submit_draining, %async_delalloc_pages and %nr_async_submits to tell whether the IO has actually started. Moreover, all the flushers in use are followed by functions that wait for ordered extents to complete, so %nr_async_submits, which tracks whether bio's async submit has made progress, doesn't really make sense. However, %async_delalloc_pages is still required by shrink_delalloc() as that function doesn't flush twice in the normal case (just issues a writeback with WB_REASON_FS_FREE_SPACE). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30Btrfs: compress_file_range remove dead variable num_bytesTimofey Titovets
Remove dead assigment of num_bytes. Also as num_bytes only used in the will_compress block as copy of total_in just replace that with total_in and drop num_bytes entirely. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30btrfs: Fix bool initialization/comparisonThomas Meyer
Bool initializations should use true and false. Bool tests don't need comparisons. Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30Btrfs: cleanup 'start' subtraction from try uncompressed inline extentTimofey Titovets
Was added in: c8b978188c9a0fd3d535c13debd19d522b726f1f "Btrfs: Add zlib compression support" Survive to near time (from 08.10.2008). Because 'start' checked for zero before branch, so it's safe to remove that subtraction. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30btrfs: Remove unused parameter from check_direct_IONikolay Borisov
Introduced by 5a5f79b57069 ("Btrfs: allow unaligned DIO") and never used. The buffered fallback from unaligned DIO works as expected. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30Btrfs: do not async submit for nodatasum inodesLiu Bo
While we submit direct writes, if the inode is flagged with nodatasum, there's no benefit to submit asynchronously, because a) we don't have to calculate checksum across processors, b) and direct IO has started a plug, but async submit makes us queue IO on each device's scheduled IO list instead of DIO's plug list, so that IOs get much less merges in general. Lets use sync submit for nodatasum inodes. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-29Merge branch 'for-4.14-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We've collected a bunch of isolated fixes, for crashes, user-visible behaviour or missing bits from other subsystem cleanups from the past. The overall number is not small but I was not able to make it significantly smaller. Most of the patches are supposed to go to stable" * 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: log csums for all modified extents Btrfs: fix unexpected result when dio reading corrupted blocks btrfs: Report error on removing qgroup if del_qgroup_item fails Btrfs: skip checksum when reading compressed data if some IO have failed Btrfs: fix kernel oops while reading compressed data Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block Btrfs: do not backup tree roots when fsync btrfs: remove BTRFS_FS_QUOTA_DISABLING flag btrfs: propagate error to btrfs_cmp_data_prepare caller btrfs: prevent to set invalid default subvolid Btrfs: send: fix error number for unknown inode types btrfs: fix NULL pointer dereference from free_reloc_roots() btrfs: finish ordered extent cleaning if no progress is found btrfs: clear ordered flag on cleaning up ordered extents Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO Btrfs: do not reset bio->bi_ops while writing bio Btrfs: use the new helper wbc_to_write_flags
2017-09-26Btrfs: fix unexpected result when dio reading corrupted blocksLiu Bo
commit 4246a0b63bd8 ("block: add a bi_error field to struct bio") changed the logic of how dio read endio reports errors. For single stripe dio read, %bio->bi_status reflects the error before verifying checksum, and now we're updating it when data block matches with its checksum, while in the mismatching case, %bio->bi_status is not updated to relfect that. When some blocks in a file have been corrupted on disk, reading such a file ends up with 1) checksum errors are reported in kernel log 2) read(2) returns successfully with some content being 0x01. In order to fix it, we need to report its checksum mismatch error to the upper layer (dio layer in this case) as well. Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio") Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reported-by: Goffredo Baroncelli <kreijack@inwind.it> Tested-by: Goffredo Baroncelli <kreijack@inwind.it> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26btrfs: finish ordered extent cleaning if no progress is foundNaohiro Aota
__endio_write_update_ordered() repeats the search until it reaches the end of the specified range. This works well with direct IO path, because before the function is called, it's ensured that there are ordered extents filling whole the range. It's not the case, however, when it's called from run_delalloc_range(): it is possible to have error in the midle of the loop in e.g. run_delalloc_nocow(), so that there exisits the range not covered by any ordered extents. By cleaning such "uncomplete" range, __endio_write_update_ordered() stucks at offset where there're no ordered extents. Since the ordered extents are created from head to tail, we can stop the search if there are no offset progress. Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang") Cc: <stable@vger.kernel.org> # 4.12 Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26btrfs: clear ordered flag on cleaning up ordered extentsNaohiro Aota
Commit 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup submitted ordered extents. However, it does not clear the ordered bit (Private2) of corresponding pages. Thus, the following BUG occurs from free_pages_check_bad() (on btrfs/125 with nospace_cache). BUG: Bad page state in process btrfs pfn:3fa787 page:ffffdf2acfe9e1c0 count:0 mapcount:0 mapping: (null) index:0xd flags: 0x8000000000002008(uptodate|private_2) raw: 8000000000002008 0000000000000000 000000000000000d 00000000ffffffff raw: ffffdf2acf5c1b20 ffffb443802238b0 0000000000000000 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: 0x2000(private_2) This patch clears the flag same as other places calling btrfs_dec_test_ordered_pending() for every page in the specified range. Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang") Cc: <stable@vger.kernel.org> # 4.12 Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-14Merge branch 'work.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount flag updates from Al Viro: "Another chunk of fmount preparations from dhowells; only trivial conflicts for that part. It separates MS_... bits (very grotty mount(2) ABI) from the struct super_block ->s_flags (kernel-internal, only a small subset of MS_... stuff). This does *not* convert the filesystems to new constants; only the infrastructure is done here. The next step in that series is where the conflicts would be; that's the conversion of filesystems. It's purely mechanical and it's better done after the merge, so if you could run something like list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$') sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \ -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \ -e 's/\<MS_NODEV\>/SB_NODEV/g' \ -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \ -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \ -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \ -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \ -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \ -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \ -e 's/\<MS_SILENT\>/SB_SILENT/g' \ -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \ -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \ -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \ -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \ $list and commit it with something along the lines of 'convert filesystems away from use of MS_... constants' as commit message, it would save a quite a bit of headache next cycle" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: Differentiate mount flags (MS_*) from internal superblock flags VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-09Merge branch 'for-4.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The changes range through all types: cleanups, core chagnes, sanity checks, fixes, other user visible changes, detailed list below: - deprecated: user transaction ioctl - mount option ssd does not change allocation alignments - degraded read-write mount is allowed if all the raid profile constraints are met, now based on more accurate check - defrag: do not reset compression afterwards; the NOCOMPRESS flag can be now overriden by defrag - prep work for better extent reference tracking (related to the qgroup slowness with balance) - prep work for compression heuristics - memory allocation reductions (may help latencies on a loaded system) - better accounting for io waiting states - error handling improvements (removed BUGs) - added more sanity checks for shared refs - fix readdir vs pagefault deadlock under some circumstances - fix for 'no-hole' mode, certain combination of compressed and inline extents - send: fix emission of invalid clone operations - fixup file mode if setting acls fail - more fixes from fuzzing - oher cleanups" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits) btrfs: submit superblock io with REQ_META and REQ_PRIO btrfs: remove unnecessary memory barrier in btrfs_direct_IO btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent btrfs: pass fs_info to btrfs_del_root instead of tree_root Btrfs: add one more sanity check for shared ref type Btrfs: remove BUG_ON in __add_tree_block Btrfs: remove BUG() in add_data_reference Btrfs: remove BUG() in print_extent_item Btrfs: remove BUG() in btrfs_extent_inline_ref_size Btrfs: convert to use btrfs_get_extent_inline_ref_type Btrfs: add a helper to retrive extent inline ref type btrfs: scrub: simplify scrub worker initialization btrfs: scrub: clean up division in scrub_find_csum btrfs: scrub: clean up division in __scrub_mark_bitmap btrfs: scrub: use bool for flush_all_writes btrfs: preserve i_mode if __btrfs_set_acl() fails btrfs: Remove extraneous chunk_objectid variable btrfs: Remove chunk_objectid argument from btrfs_make_block_group btrfs: Remove extra parentheses from condition in copy_items() ...
2017-08-24Btrfs: fix blk_status_t/errno confusionOmar Sandoval
This fixes several instances of blk_status_t and bare errno ints being mixed up, some of which are real bugs. In the normal case, 0 matches BLK_STS_OK, so we don't observe any effects of the missing conversion, but in case of errors or passes through the repair/retry paths, the errors get mixed up. The changes were identified using 'sparse', we don't have reports of the buggy behaviour. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-21btrfs: remove unnecessary memory barrier in btrfs_direct_IONikolay Borisov
Commit 38851cc19adb ("Btrfs: implement unlocked dio write") implemented unlocked dio write, allowing multiple dio writers to write to non-overlapping, and non-eof-extending regions. In doing so it also introduced a broken memory barrier. It is broken due to 2 things: 1. Memory barriers _MUST_ always be paired, this is clearly not the case here 2. Checkpatch actually produces a warning if a memory barrier is introduced that doesn't have a comment explaining how it's being paired. Specifically for inode::i_dio_count that's wrapped inside inode_dio_begin, there is no explicit barrier semantics attached, so removing is fine as the atomic is used in common the waiter/wakeup pattern. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ enhance changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18btrfs: Move skip checksum check from btrfs_submit_direct to ↵Nikolay Borisov
__btrfs_submit_dio_bio Currently the code checks whether we should do data checksumming in btrfs_submit_direct and the boolean result of this check is passed to btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which actually consumes it. The last function actually has all the necessary context to figure out whether to skip the check or not, so let's move the check closer to where it's being consumed. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-18Btrfs: avoid unnecessarily locking inode when clearing a rangeFilipe Manana
If the range being cleared was not marked for defrag and we are not about to clear the range from the defrag status, we don't need to lock and unlock the inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: fix readdir deadlock with pagefaultJosef Bacik
Readdir does dir_emit while under the btree lock. dir_emit can trigger the page fault which means we can deadlock. Fix this by allocating a buffer on opening a directory and copying the readdir into this buffer and doing dir_emit from outside of the tree lock. Thread A readdir <holding tree lock> dir_emit <page fault> down_read(mmap_sem) Thread B mmap write down_write(mmap_sem) page_mkwrite wait_ordered_extents Process C finish_ordered_extent insert_reserved_file_extent try to lock leaf <hang> Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy the deadlock scenario to changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: allow defrag compress to override NOCOMPRESS attributeDavid Sterba
Currently, the BTRFS_INODE_NOCOMPRESS will prevent any compression on a given file, except when the mount is force-compress. As users have reported on IRC, this will also prevent compression when requested by defrag (btrfs fi defrag -c file). The nocompress flag is set automatically by filesystem when the ratios are bad and the user would have to manually drop the bit in order to make defrag -c work. This is not good from the usability perspective. This patch will raise priority for the defrag -c over nocompress, ie. any file with NOCOMPRESS bit set will get defragmented. The bit will remain untouched. Alternate option was to also drop the nocompress bit and keep the decision logic as is, but I think this is not the right solution. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: separate defrag and property compressionDavid Sterba
Add new value for compression to distinguish between defrag and property. Previously, a single variable was used and this caused clashes when the per-file 'compression' was set and a defrag -c was called. The property-compression is loaded when the file is open, defrag will overwrite the same variable and reset to 0 (ie. NONE) at when the file defragmentaion is finished. That's considered a usability bug. Now we won't touch the property value, use the defrag-compression. The precedence of defrag is higher than for property (and whole-filesystem). Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: rename variable holding per-inode compression typeDavid Sterba
This is preparatory for separating inode compression requested by defrag and set via properties. This will fix a usability bug when defrag will reset compression type to NONE. If the file has compression set via property, it will not apply anymore (until next mount or reset through command line). We're going to fix that by adding another variable just for the defrag call and won't touch the property. The defrag will have higher priority when deciding whether to compress the data. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16Btrfs: add skeleton code for compression heuristicTimofey Titovets
Add skeleton code for compresison heuristics. Now it iterates over all the pages, but in the end always says "yes, compress please", ie it does not change the current behaviour. In the future we're going to add various heuristics to analyze the data. This patch can be used as a baseline for measuring if the effectivness and performance. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> [ enhanced changelog, modified comments ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: account that we're waiting for DIO readDavid Sterba
Correctly account for IO when waiting for a submitted DIO read, the case when we're retrying. This only for the accounting purposes and should not change other behaviour. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: fix spelling of snapshottingDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: cleanup types storing REQ_*David Sterba
Unify types of local variables and parameters that store various REQ_* values to unsigned int. Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: btrfs_inherit_iflags() can be staticAnand Jain
btrfs_new_inode() is the only consumer move it to inode.c, from ioctl.c. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16btrfs: drop newlines from strings when using btrfs_* helpersDavid Sterba
The helpers append "\n" so we can keep the actual strings shorter. The extra newline will print an empty line. Some messages have been slightly modified to be more consistent with the rest (lowercase first letter). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-16Btrfs: report errors when checksum is not foundLiu Bo
When btrfs fails the checksum check, it'll fill the whole page with "1". However, if %csum_expected is 0 (which means there is no checksum), then for some unknown reason, we just pretend that the read is correct, so userspace would be confused about the dilemma that read is successful but getting a page with all content being "1". This can happen due to a bug in btrfs-convert. This fixes it by always returning errors if checksum doesn't match. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-17VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)David Howells
Firstly by applying the following with coccinelle's spatch: @@ expression SB; @@ -SB->s_flags & MS_RDONLY +sb_rdonly(SB) to effect the conversion to sb_rdonly(sb), then by applying: @@ expression A, SB; @@ ( -(!sb_rdonly(SB)) && A +!sb_rdonly(SB) && A | -A != (sb_rdonly(SB)) +A != sb_rdonly(SB) | -A == (sb_rdonly(SB)) +A == sb_rdonly(SB) | -!(sb_rdonly(SB)) +!sb_rdonly(SB) | -A && (sb_rdonly(SB)) +A && sb_rdonly(SB) | -A || (sb_rdonly(SB)) +A || sb_rdonly(SB) | -(sb_rdonly(SB)) != A +sb_rdonly(SB) != A | -(sb_rdonly(SB)) == A +sb_rdonly(SB) == A | -(sb_rdonly(SB)) && A +sb_rdonly(SB) && A | -(sb_rdonly(SB)) || A +sb_rdonly(SB) || A ) @@ expression A, B, SB; @@ ( -(sb_rdonly(SB)) ? 1 : 0 +sb_rdonly(SB) | -(sb_rdonly(SB)) ? A : B +sb_rdonly(SB) ? A : B ) to remove left over excess bracketage and finally by applying: @@ expression A, SB; @@ ( -(A & MS_RDONLY) != sb_rdonly(SB) +(bool)(A & MS_RDONLY) != sb_rdonly(SB) | -(A & MS_RDONLY) == sb_rdonly(SB) +(bool)(A & MS_RDONLY) == sb_rdonly(SB) ) to make comparisons against the result of sb_rdonly() (which is a bool) work correctly. Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-14Merge branch 'for-4.13-part2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "We've identified and fixed a silent corruption (introduced by code in the first pull), a fixup after the blk_status_t merge and two fixes to incremental send that Filipe has been hunting for some time" * 'for-4.13-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix unexpected return value of bio_readpage_error btrfs: btrfs_create_repair_bio never fails, skip error handling btrfs: cloned bios must not be iterated by bio_for_each_segment_all Btrfs: fix write corruption due to bio cloning on raid5/6 Btrfs: incremental send, fix invalid memory access Btrfs: incremental send, fix invalid path for link commands
2017-07-14btrfs: btrfs_create_repair_bio never fails, skip error handlingDavid Sterba
As the function uses the non-failing bio allocation, we can remove error handling from the callers as well. Signed-off-by: David Sterba <dsterba@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14btrfs: cloned bios must not be iterated by bio_for_each_segment_allDavid Sterba
We've started using cloned bios more in 4.13, there are some specifics regarding the iteration. Filipe found [1] that the raid56 iterated a cloned bio using bio_for_each_segment_all, which is incorrect. The cloned bios have wrong bi_vcnt and this could lead to silent corruptions. This patch adds assertions to all remaining bio_for_each_segment_all cases. [1] https://patchwork.kernel.org/patch/9838535/ Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-06Merge branch 'for-4.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "These are the percpu changes for the v4.13-rc1 merge window. There are a couple visibility related changes - tracepoints and allocator stats through debugfs, along with __ro_after_init markings and a cosmetic rename in percpu_counter. Please note that the simple O(#elements_in_the_chunk) area allocator used by percpu allocator is again showing scalability issues, primarily with bpf allocating and freeing large number of counters. Dennis is working on the replacement allocator and the percpu allocator will be seeing increased churns in the coming cycles" * 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: fix static checker warnings in pcpu_destroy_chunk percpu: fix early calls for spinlock in pcpu_stats percpu: resolve err may not be initialized in pcpu_alloc percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch percpu: add tracepoint support for percpu memory percpu: expose statistics about percpu memory via debugfs percpu: migrate percpu data structures to internal header percpu: add missing lockdep_assert_held to func pcpu_free_area mark most percpu globals as __ro_after_init
2017-07-05Merge branch 'for-4.13-part1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The core updates improve error handling (mostly related to bios), with the usual incremental work on the GFP_NOFS (mis)use removal, refactoring or cleanups. Except the two top patches, all have been in for-next for an extensive amount of time. User visible changes: - statx support - quota override tunable - improved compression thresholds - obsoleted mount option alloc_start Core updates: - bio-related updates: - faster bio cloning - no allocation failures - preallocated flush bios - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates - prep work for btree_inode removal - dir-item validation - qgoup fixes and updates - cleanups: - removed unused struct members, unused code, refactoring - argument refactoring (fs_info/root, caller -> callee sink) - SEARCH_TREE ioctl docs" * 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits) btrfs: Remove false alert when fiemap range is smaller than on-disk extent btrfs: Don't clear SGID when inheriting ACLs btrfs: fix integer overflow in calc_reclaim_items_nr btrfs: scrub: fix target device intialization while setting up scrub context btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges btrfs: qgroup: Introduce extent changeset for qgroup reserve functions btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled btrfs: qgroup: Return actually freed bytes for qgroup release or free data btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function btrfs: qgroup: Add quick exit for non-fs extents Btrfs: rework delayed ref total_bytes_pinned accounting Btrfs: return old and new total ref mods when adding delayed refs Btrfs: always account pinned bytes when dropping a tree block ref Btrfs: update total_bytes_pinned when pinning down extents Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 btrfs: fix validation of XATTR_ITEM dir items btrfs: Verify dir_item in iterate_object_props btrfs: Check name_len before in btrfs_del_root_ref btrfs: Check name_len before reading btrfs_get_name ...
2017-06-29btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ↵Qu Wenruo
ranges [BUG] For the following case, btrfs can underflow qgroup reserved space at an error path: (Page size 4K, function name without "btrfs_" prefix) Task A | Task B ---------------------------------------------------------------------- Buffered_write [0, 2K) | |- check_data_free_space() | | |- qgroup_reserve_data() | | Range aligned to page | | range [0, 4K) <<< | | 4K bytes reserved <<< | |- copy pages to page cache | | Buffered_write [2K, 4K) | |- check_data_free_space() | | |- qgroup_reserved_data() | | Range alinged to page | | range [0, 4K) | | Already reserved by A <<< | | 0 bytes reserved <<< | |- delalloc_reserve_metadata() | | And it *FAILED* (Maybe EQUOTA) | |- free_reserved_data_space() |- qgroup_free_data() Range aligned to page range [0, 4K) Freeing 4K (Special thanks to Chandan for the detailed report and analyse) [CAUSE] Above Task B is freeing reserved data range [0, 4K) which is actually reserved by Task A. And at writeback time, page dirty by Task A will go through writeback routine, which will free 4K reserved data space at file extent insert time, causing the qgroup underflow. [FIX] For btrfs_qgroup_free_data(), add @reserved parameter to only free data ranges reserved by previous btrfs_qgroup_reserve_data(). So in above case, Task B will try to free 0 byte, so no underflow. Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29btrfs: qgroup: Introduce extent changeset for qgroup reserve functionsQu Wenruo
Introduce a new parameter, struct extent_changeset for btrfs_qgroup_reserved_data() and its callers. Such extent_changeset was used in btrfs_qgroup_reserve_data() to record which range it reserved in current reserve, so it can free it in error paths. The reason we need to export it to callers is, at buffered write error path, without knowing what exactly which range we reserved in current allocation, we can free space which is not reserved by us. This will lead to qgroup reserved space underflow. Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write ↵Qu Wenruo
and quotas being enabled [BUG] Under the following case, we can underflow qgroup reserved space. Task A | Task B --------------------------------------------------------------- Quota disabled | Buffered write | |- btrfs_check_data_free_space() | | *NO* qgroup space is reserved | | since quota is *DISABLED* | |- All pages are copied to page | cache | | Enable quota | Quota scan finished | | Sync_fs | |- run_delalloc_range | |- Write pages | |- btrfs_finish_ordered_io | |- insert_reserved_file_extent | |- btrfs_qgroup_release_data() | Since no qgroup space is reserved in Task A, we underflow qgroup reserved space This can be detected by fstest btrfs/104. [CAUSE] In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes size of qgroup reserved_space in all cases. And btrfs_qgroup_release_data() will check if quotas are enabled. However in the above case, the buffered write happens before quota is enabled, so we don't have the reserved space for that range. [FIX] In insert_reserved_file_extent(), we tell qgroup to release the acctual byte number it released. In the above case, since we don't have the reserved space, we tell qgroups to release 0 byte, so the problem can be fixed. And thanks to the @reserved parameter introduced by the qgroup rework, and previous patch to return released bytes, the fix can be as small as 10 lines. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> [ changelog updates ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-21btrfs: Check name_len with boundary in verify dir_itemSu Yue
Originally, verify_dir_item verifies name_len of dir_item with fixed values but not item boundary. If corrupted name_len was not bigger than the fixed value, for example 255, the function will think the dir_item is fine. And then reading beyond boundary will cause crash. Example: 1. Corrupt one dir_item name_len to be 255. 2. Run 'ls -lar /mnt/test/ > /dev/null' dmesg: [ 48.451449] BTRFS info (device vdb1): disk space caching is enabled [ 48.451453] BTRFS info (device vdb1): has skinny extents [ 48.489420] general protection fault: 0000 [#1] SMP [ 48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq [ 48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5 [ 48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 [ 48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000 [ 48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs] [ 48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202 [ 48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000 [ 48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368 [ 48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000 [ 48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288 [ 48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20 [ 48.490008] FS: 00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000 [ 48.490008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0 [ 48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 48.490008] Call Trace: [ 48.490008] btrfs_real_readdir+0x3b7/0x4a0 [btrfs] [ 48.490008] iterate_dir+0x181/0x1b0 [ 48.490008] SyS_getdents+0xa7/0x150 [ 48.490008] ? fillonedir+0x150/0x150 [ 48.490008] entry_SYSCALL_64_fastpath+0x18/0xad [ 48.490008] RIP: 0033:0x7fef5032546b [ 48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e [ 48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b [ 48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003 [ 48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000 [ 48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38 [ 48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e [ 48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98 [ 48.499455] ---[ end trace 321920d8e8339505 ]--- Fix it by adding a parameter @slot and check name_len with item boundary by calling btrfs_is_name_len_valid. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> rev Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-20percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batchNikolay Borisov
Currently, percpu_counter_add is a wrapper around __percpu_counter_add which is preempt safe due to explicit calls to preempt_disable. Given how __ prefix is used in percpu related interfaces, the naming unfortunately creates the false sense that __percpu_counter_add is less safe than percpu_counter_add. In terms of context-safety, they're equivalent. The only difference is that the __ version takes a batch parameter. Make this a bit more explicit by just renaming __percpu_counter_add to percpu_counter_add_batch. This patch doesn't cause any functional changes. tj: Minor updates to patch description for clarity. Cosmetic indentation updates. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@fb.com> Cc: linux-mm@kvack.org Cc: "David S. Miller" <davem@davemloft.net>
2017-06-20btrfs: nowait aio supportGoldwyn Rodrigues
Return EAGAIN if any of the following checks fail + i_rwsem is not lockable + NODATACOW or PREALLOC is not set + Cannot nocow at the desired location + Writing beyond end of file which is not allocated Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20btrfs: cleanup duplicate return value in insert_inline_extentDavid Sterba
The pattern when err is used for function exit and ret is used for return values of callees is not used here. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: compression must free at least one sector sizeTimofey Titovets
We already skip storing data where compression does not make the result at least one byte less. Let's make the logic better and check that compression frees at least one sector size of bytes, otherwise it's not that useful. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> [ changelog updated ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: skip checksum verification if IO error occursLiu Bo
Currently dio read also goes to verify checksum if -EIO has been returned, although it usually fails on checksum, it's not necessary at all, we could directly check if there is another copy to read. And with this, the behavior of dio read is now consistent with that of buffered read. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use bool for uptodate ] Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: tolerate errors if we have retried successfullyLiu Bo
With raid1 profile, dio read isn't tolerating IO errors if read length is less than the stripe length (64K). Our bio didn't get split in btrfs_submit_direct_hook() if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED) is true and that happens when the read length is less than 64k. In this case, if the underlying device returns error somehow, bio->bi_error has recorded that error. If we could recover the correct data from another copy in profile raid1/10/5/6, with btrfs_subio_endio_read() returning 0, bio would have the correct data in its vector, but bio->bi_error is not updated accordingly so that the following dio_end_io(dio_bio, bio->bi_error) makes directIO think this read has failed. This fixes the problem by setting bio's error to 0 if a good copy has been found. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: sink gfp parameter to btrfs_bio_cloneDavid Sterba
All callers pass GFP_NOFS. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: btrfs_bio_clone never fails, skip error handlingDavid Sterba
Update direct callers of btrfs_bio_clone that do error handling, that we can now remove. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: simplify code with bio_io_errorGuoqing Jiang
bio_io_error was introduced in the commit 4246a0b63bd8f56a1469b ("block: add a bi_error field to struct bio"), so use it to simplify code. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: use generic slab for for btrfs_transactionDavid Sterba
Observing the number of slab objects of btrfs_transaction, there's just one active on an almost quiescent filesystem, and the number of objects goes to about ten when sync is in progress. Then the nubmer goes down to 1. This matches the expectations of the transaction lifetime. For such use the separate slab cache is not justified, as we do not reuse objects frequently. For the shortlived transaction, the generic slab (size 512) should be ok. We can optimistically expect that the 512 slabs are not all used (fragmentation) and there are free slots to take when we do the allocation, compared to potentially allocating a whole new page for the separate slab. We'll lose the stats about the object use, which could be added later if we really need them. Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19Btrfs: add statx supportYonghong Song
Return enhanced file attributes from the btrfs, including: (1). inode creation time as stx_btime, and (2). Certain BTRFS_INODE_xxx flags are mapped to stx_attributes flags. Example output: [root@localhost ~]# cat t.sh touch t chattr +aic t ~/linux/samples/statx/test-statx t chattr -aic t touch t echo "========================================" ~/linux/samples/statx/test-statx t /bin/rm t [root@localhost ~]# ./t.sh statx(t) = 0 results=fff Size: 0 Blocks: 0 IO Block: 4096 regular file Device: 00:1c Inode: 63962 Links: 1 Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 Access: 2017-05-11 16:03:13.999856591-0700 Modify: 2017-05-11 16:03:13.999856591-0700 Change: 2017-05-11 16:03:14.000856663-0700 Birth: 2017-05-11 16:03:13.999856591-0700 Attributes: 0000000000000034 (........ ........ ........ ........ ........ ........ ........ .-ai.c..) ======================================== statx(t) = 0 results=fff Size: 0 Blocks: 0 IO Block: 4096 regular file Device: 00:1c Inode: 63962 Links: 1 Access: (0644/-rw-r--r--) Uid: 0 Gid: 0 Access: 2017-05-11 16:03:14.006857097-0700 Modify: 2017-05-11 16:03:14.006857097-0700 Change: 2017-05-11 16:03:14.006857097-0700 Birth: 2017-05-11 16:03:13.999856591-0700 Attributes: 0000000000000000 (........ ........ ........ ........ ........ ........ ........ .---.-..) [root@localhost ~]# Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19btrfs: cleanup root usage by btrfs_get_alloc_profileJeff Mahoney
There are two places where we don't already know what kind of alloc profile we need before calling btrfs_get_alloc_profile, but we need access to a root everywhere we call it. This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile() and relegates btrfs_system_alloc_profile to a static for use in those two cases. The next patch will eliminate one of those. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>