summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_io.c
AgeCommit message (Collapse)Author
2012-04-18Btrfs: avoid possible use-after-free in clear_extent_bit()Li Zefan
clear_extent_bit() { next_node = rb_next(&state->rb_node); ... clear_state_bit(state); <-- this may free next_node if (next_node) { state = rb_entry(next_node); ... } } clear_state_bit() calls merge_state() which may free the next node of the passing extent_state, so clear_extent_bit() may end up referencing freed memory. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-18Btrfs: retrurn void from clear_state_bitLi Zefan
Currently it returns a set of bits that were cleared, but this return value is not used at all. Moreover it doesn't seem to be useful, because we may clear the bits of a few extent_states, but only the cleared bits of last one is returned. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-13Merge branch 'for-linus-min' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull the minimal btrfs branch from Chris Mason: "We have a use-after-free in there, along with errors when mount -o discard is enabled, and a BUG_ON(we should compile with UP more often)." * 'for-linus-min' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: use commit root when loading free space cache Btrfs: fix use-after-free in __btrfs_end_transaction Btrfs: check return value of bio_alloc() properly Btrfs: remove lock assert from get_restripe_target() Btrfs: fix eof while discarding extents Btrfs: fix uninit variable in repair_eb_io_failure Revert "Btrfs: increase the global block reserve estimates"
2012-04-12Btrfs: check return value of bio_alloc() properlyTsutomu Itoh
bio_alloc() has the possibility of returning NULL. So, it is necessary to check the return value. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-04-12Btrfs: fix uninit variable in repair_eb_io_failureChris Mason
We'd have to be passing bogus extent buffers for this uninit variable to actually be used, but set it to zero just in case. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-30Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes and features from Chris Mason: "We've merged in the error handling patches from SuSE. These are already shipping in the sles kernel, and they give btrfs the ability to abort transactions and go readonly on errors. It involves a lot of churn as they clarify BUG_ONs, and remove the ones we now properly deal with. Josef reworked the way our metadata interacts with the page cache. page->private now points to the btrfs extent_buffer object, which makes everything faster. He changed it so we write an whole extent buffer at a time instead of allowing individual pages to go down,, which will be important for the raid5/6 code (for the 3.5 merge window ;) Josef also made us more aggressive about dropping pages for metadata blocks that were freed due to COW. Overall, our metadata caching is much faster now. We've integrated my patch for metadata bigger than the page size. This allows metadata blocks up to 64KB in size. In practice 16K and 32K seem to work best. For workloads with lots of metadata, this cuts down the size of the extent allocation tree dramatically and fragments much less. Scrub was updated to support the larger block sizes, which ended up being a fairly large change (thanks Stefan Behrens). We also have an assortment of fixes and updates, especially to the balancing code (Ilya Dryomov), the back ref walker (Jan Schmidt) and the defragging code (Liu Bo)." Fixed up trivial conflicts in fs/btrfs/scrub.c that were just due to removal of the second argument to k[un]map_atomic() in commit 7ac687d9e047. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (75 commits) Btrfs: update the checks for mixed block groups with big metadata blocks Btrfs: update to the right index of defragment Btrfs: do not bother to defrag an extent if it is a big real extent Btrfs: add a check to decide if we should defrag the range Btrfs: fix recursive defragment with autodefrag option Btrfs: fix the mismatch of page->mapping Btrfs: fix race between direct io and autodefrag Btrfs: fix deadlock during allocating chunks Btrfs: show useful info in space reservation tracepoint Btrfs: don't use crc items bigger than 4KB Btrfs: flush out and clean up any block device pages during mount btrfs: disallow unequal data/metadata blocksize for mixed block groups Btrfs: enhance superblock sanity checks Btrfs: change scrub to support big blocks Btrfs: minor cleanup in scrub Btrfs: introduce common define for max number of mirrors Btrfs: fix infinite loop in btrfs_shrink_device() Btrfs: fix memory leak in resolver code Btrfs: allow dup for data chunks in mixed mode Btrfs: validate target profiles only if we are going to use them ...
2012-03-28Merge branch 'error-handling' into for-linusChris Mason
Conflicts: fs/btrfs/ctree.c fs/btrfs/disk-io.c fs/btrfs/extent-tree.c fs/btrfs/extent_io.c fs/btrfs/extent_io.h fs/btrfs/inode.c fs/btrfs/scrub.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26Btrfs: deal with read errors on extent buffers differentlyJosef Bacik
Since we need to read and write extent buffers in their entirety we can't use the normal bio_readpage_error stuff since it only works on a per page basis. So instead make it so that if we see an io error in endio we just mark the eb as having an IO error and then in btree_read_extent_buffer_pages we will manually try other mirrors and then overwrite the bad mirror if we find a good copy. This works with larger than page size blocks. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26Btrfs: loop waiting on writebackChris Mason
lock_extent_buffer_for_io needs to loop around and make sure the writeback bits are not set. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26Btrfs: ensure an entire eb is written at onceJosef Bacik
This patch simplifies how we track our extent buffers. Previously we could exit writepages with only having written half of an extent buffer, which meant we had to track the state of the pages and the state of the extent buffers differently. Now we only read in entire extent buffers and write out entire extent buffers, this allows us to simply set bits in our bflags to indicate the state of the eb and we no longer have to do things like track uptodate with our iotree. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26Btrfs: introduce mark_extent_buffer_accessedJosef Bacik
Because an eb can have multiple pages we need to make sure that all pages within the eb are markes as accessed, since releasepage can be called against any page in the eb. This will keep us from possibly evicting hot eb's when we're doing larger than pagesize eb's. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26Btrfs: introduce free_extent_buffer_staleJosef Bacik
Because btrfs cow's we can end up with extent buffers that are no longer necessary just sitting around in memory. So instead of evicting these pages, we could end up evicting things we actually care about. Thus we have free_extent_buffer_stale for use when we are freeing tree blocks. This will make it so that the ref for the eb being in the radix tree is dropped as soon as possible and then is freed when the refcount hits 0 instead of waiting to be released by releasepage. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26Btrfs: only use the existing eb if it's count isn't 0Josef Bacik
We can run into a problem where we find an eb for our existing page already on the radix tree but it has a ref count of 0. It hasn't yet been removed by RCU yet so this can cause issues where we will use the EB after free. So do atomic_inc_not_zero on the exists->refs and if it is zero just do synchronize_rcu() and try again. We won't have to worry about new allocators coming in since they will block on the page lock at this point. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26Btrfs: set page->private to the ebJosef Bacik
We spend a lot of time looking up extent buffers from pages when we could just store the pointer to the eb the page is associated with in page->private. This patch does just that, and it makes things a little simpler and reduces a bit of CPU overhead involved with doing metadata IO. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2012-03-26Btrfs: allow metadata blocks larger than the page sizeChris Mason
A few years ago the btrfs code to support blocks lager than the page size was disabled to fix a few corner cases in the page cache handling. This fixes the code to properly support large metadata blocks again. Since current kernels will crash early and often with larger metadata blocks, this adds an incompat bit so that older kernels can't mount it. This also does away with different blocksizes for nodes and leaves. You get a single block size for all tree blocks. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-22btrfs: replace many BUG_ONs with proper error handlingJeff Mahoney
btrfs currently handles most errors with BUG_ON. This patch is a work-in- progress but aims to handle most errors other than internal logic errors and ENOMEM more gracefully. This iteration prevents most crashes but can run into lockups with the page lock on occasion when the timing "works out." Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22btrfs: split extent_state opsJeff Mahoney
set_extent_bit can do exclusive locking but only when called by lock_extent*, Drop the exclusive bits argument except when called by lock_extent. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22btrfs: drop gfp_t from lock_extentJeff Mahoney
lock_extent and unlock_extent are always called with GFP_NOFS, drop the argument and use GFP_NOFS consistently. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22btrfs: return void in functions without error conditionsJeff Mahoney
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22btrfs: ->submit_bio_hook error push-upJeff Mahoney
This pushes failures from the submit_bio_hook callbacks, btrfs_submit_bio_hook and btree_submit_bio_hook into the callers, including callers of submit_one_bio where it catches the failures with BUG_ON. It also pushes up through the ->readpage_io_failed_hook to end_bio_extent_writepage where the error is already caught with BUG_ON. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22btrfs: Factor out tree->ops->merge_bio_hook callJeff Mahoney
In submit_extent_page, there's a visually noisy if statement that, in the midst of other conditions, does the tree dependency for tree->ops and tree->ops->merge_bio_hook before calling it, and then another condition afterwards. If an error is returned from merge_bio_hook, there's no way to catch it. It's considered a routine "1" return value instead of a failure. This patch factors out the dependency check into a new local merge_bio routine and BUG's on an error. The if statement is less noisy as a side- effect. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22btrfs: Remove set bits return from clear_extent_bitJeff Mahoney
There is only one caller of clear_extent_bit that checks the return value and it only checks if it's negative. Since there are no users of the returned bits functionality of clear_extent_bit, stop returning it and avoid complicating error handling. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22btrfs: Catch locking failures in {set,clear,convert}_extent_bitJeff Mahoney
The *_state functions can only return 0 or -EEXIST. This patch addresses the cases where those functions returning -EEXIST represent a locking failure. It handles them by panicking with an appropriate error message. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-20btrfs: remove the second argument of k[un]map_atomic()Cong Wang
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-02-23Btrfs: clear the extent uptodate bits during parent transid failuresChris Mason
If btrfs reads a block and finds a parent transid mismatch, it clears the uptodate flags on the extent buffer, and the pages inside it. But we only clear the uptodate bits in the state tree if the block straddles more than one page. This is from an old optimization from to reduce contention on the extent state tree. But it is buggy because the code that retries a read from a different copy of the block is going to find the uptodate state bits set and skip the IO. The end result of the bug is that we'll never actually read the good copy (if there is one). The fix here is to always clear the uptodate state bits, which is safe because this code is only called when the parent transid fails. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-21Btrfs: be less strict on finding next node in clear_extent_bitLiu Bo
In clear_extent_bit, it is enough that next node is adjacent in tree level. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-16Btrfs: kick out redundant stuff in convert_extent_bitLiu Bo
clear_state_bit will do merge_state for us, so kick out the redundant one. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-16Btrfs: skip states when they does not contain bits to clearLiu Bo
Clearing a range's bits is different with setting them, since we don't need to touch them when states do not contain bits we want. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-02-16Btrfs: check return value of lookup_extent_mapping() correctlyTsutomu Itoh
This patch corrects error checking of lookup_extent_mapping(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-02-16Btrfs: fix return value check of extent_io_opsTsutomu Itoh
This patch adds the check on the return value of extent_io_ops. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-02-15btrfs: delalloc for page dirtied out-of-band in fixup workerJeff Mahoney
We encountered an issue that was easily observable on s/390 systems but could really happen anywhere. The timing just seemed to hit reliably on s/390 with limited memory. The gist is that when an unexpected set_page_dirty() happened, we'd run into the BUG() in btrfs_writepage_fixup_worker since it wasn't properly set up for delalloc. This patch does the following: - Performs the missing delalloc in the fixup worker - Allow the start hook to return -EBUSY which informs __extent_writepage that it should mark the page skipped and not to redirty it. This is required since the fixup worker can fail with -ENOSPC and the page will have already been redirtied. That causes an Oops in drop_outstanding_extents later. Retrying the fixup worker could lead to an infinite loop. Deferring the page redirty also saves us some cycles since the page would be stuck in a resubmit-redirty loop until the fixup worker completes. It's not harmful, just wasteful. - If the fixup worker fails, we mark the page and mapping as errored, and end the writeback, similar to what we would do had the page actually been submitted to writeback. Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-01-26Btrfs: Check for NULL page in extent_range_uptodateMitch Harder
A user has encountered a NULL pointer kernel oops in btrfs when encountering media errors. The problem has been identified as an unhandled NULL pointer returned from find_get_page(). This modification simply checks for a NULL page, and returns with an error if found (the extent_range_uptodate() function returns 1 on errors). After testing this patch, the user reported that the error with the NULL pointer oops was solved. However, there is still a remaining problem with a thread becoming stuck in wait_on_page_locked(page) in the read_extent_buffer_pages(...) function in extent_io.c for (i = start_i; i < num_pages; i++) { page = extent_buffer_page(eb, i); wait_on_page_locked(page); if (!PageUptodate(page)) ret = -EIO; } This patch leaves the issue with the locked page yet to be resolved. Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-16Merge branch 'integrity-check-patch-v2' of ↵Chris Mason
git://btrfs.giantdisaster.de/git/btrfs into integration Conflicts: fs/btrfs/ctree.h fs/btrfs/super.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-16Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into ↵Chris Mason
integration
2012-01-04Btrfs: add nested locking mode for pathsArne Jansen
This patch adds the possibilty to read-lock an extent even if it is already write-locked from the same thread. btrfs_find_all_roots() needs this capability. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-12-21Btrfs: integrate integrity check module into btrfsStefan Behrens
This is the last part of the patch series. It modifies the btrfs code to use the integrity check module if configured to do so with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set, the only effective change is that code is added that handles the mount option to activate the integrity check. If the mount option is set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code complains in the log and the mount fails with EINVAL. Add the mount option to activate the usage of the integrity check code. Add invocation of btrfs integrity check code init and cleanup function on mount and umount, respectively. Add hook to call btrfs integrity check code version of submit_bh/submit_bio. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2011-12-08Btrfs: drop spin lock when memory alloc failsLiu Bo
Drop spin lock in convert_extent_bit() when memory alloc fails, otherwise, it will be a deadlock. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-01Btrfs: fix meta data raid-repair merge problemJan Schmidt
Commit 4a54c8c16 introduced raid-repair, killing the individual readpage_io_failed_hook entries from inode.c and disk-io.c. Commit 4bb31e92 introduced new readahead code, adding a readpage_io_failed_hook to disk-io.c. The raid-repair commit had logic to disable raid-repair, if readpage_io_failed_hook is set. Thus, the readahead commit effectively disabled raid-repair for meta data. This commit changes the logic to always attempt raid-repair when needed and call the readpage_io_failed_hook in case raid-repair fails. This is much more straight forward and should have been like that from the beginning. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Reported-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20Btrfs: sectorsize align offsets in fiemapJosef Bacik
We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere else that calls btrfs_get_extent while running xfstests 13 in a loop. This is because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets, which will end up adding mappings that are not sectorsize aligned, which will cause problems in some cases for subsequent calls to btrfs_get_extent for similar areas that are sectorsize aligned. With this patch I ran xfstests 13 in a loop for a couple of hours and didn't hit the problem that I could previously hit in at most 20 minutes. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-11-20btrfs: mirror_num should be int, not u64Jan Schmidt
My previous patch introduced some u64 for failed_mirror variables, this one makes it consistent again. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20btrfs: Fix up 32/64-bit compatibility for new ioctlsJeff Mahoney
This patch casts to unsigned long before casting to a pointer and fixes the following warnings: fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06Merge git://git.jan-o-sch.net/btrfs-unstable into integrationChris Mason
Conflicts: fs/btrfs/Makefile fs/btrfs/extent_io.c fs/btrfs/extent_io.h fs/btrfs/scrub.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06Merge branch 'for-chris' of git://github.com/sensille/linux into integrationChris Mason
Conflicts: fs/btrfs/ctree.h Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06Btrfs: ClearPageError during writepage and clean_tree_blockChris Mason
Failure testing was tripping up over stale PageError bits in metadata pages. If we have an io error on a block, and later on end up reusing it, nobody ever clears PageError on those pages. During commit, we'll find PageError and think we had trouble writing the block, which will lead to aborts and other problems. This changes clean_tree_block and the btrfs writepage code to clear the PageError bit. In both cases we're either completely done with the page or the page has good stuff and the error bit is no longer valid. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06Btrfs: make sure to flush queued bios if write_cache_pages waitsChris Mason
write_cache_pages tries to build up a large bio to stuff down the pipe. But if it needs to wait for a page lock, it needs to make sure and send down any pending writes so we don't deadlock with anyone who has the page lock and is waiting for writeback of things inside the bio. Dave Sterba triggered this as a deadlock between the autodefrag code and the extent write_cache_pages Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-10-20Btrfs: do not set EXTENT_DIRTY along with EXTENT_DELALLOCLiu Bo
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2011-10-19Btrfs: introduce convert_extent_bitJosef Bacik
If I have a range where I know a certain bit is and I want to set it to another bit the only option I have is to call set and then clear bit, which will result in 2 tree searches. This is inefficient, so introduce convert_extent_bit which will go through and set the bit I want and clear the old bit I don't want. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19Btrfs: skip looking for delalloc if we don't have ->fill_delallocJosef Bacik
We always look for delalloc bytes in our io_tree so we can fill in delalloc. This is fine in most cases, but if we're writing out the btree_inode this is just a superfluous tree search on the io_tree, and if we have a lot of metadata dirty this could be an expensive check. So instead check to see if our io_tree has a ->fill_delalloc op, and if not don't even bother doing the lookup. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-02btrfs: hooks for readaheadArne Jansen
This adds the hooks needed for readahead. In the readpage_end_io_hook, the extent state is checked for the EXTENT_READAHEAD flag. Only in this case the readahead hook is called, to keep the impact on non-ra as low as possible. Additionally, a hook for a failed IO is added, otherwise readahead would wait indefinitely for the extent to finish. Changes for v2: - eliminate race condition Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02btrfs: add an extra wait mode to read_extent_buffer_pagesArne Jansen
read_extent_buffer_pages currently has two modes, either trigger a read without waiting for anything, or wait for the I/O to finish. The former also bails when it's unable to lock the page. This patch now adds an additional parameter to allow it to block on page lock, but don't wait for completion. Changes v5: - merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and WAIT_PAGE_LOCK Change v6: - fix bug introduced in v5 Signed-off-by: Arne Jansen <sensille@gmx.net>