summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs/xfs_bmap.c
AgeCommit message (Collapse)Author
2015-06-01Merge branch 'xfs-misc-fixes-for-4.2' into for-nextDave Chinner
2015-06-01xfs: always log the inode on unwritten extent conversionBrian Foster
The fsync() requirements for crash consistency on XFS are to flush file data and force any in-core inode updates to the log. We currently check whether the inode is pinned to identify whether the log needs to be forced, since a non-zero pin count generally represents an inode that has transactions awaiting a flush to the on-disk log. This is not sufficient in all cases, however. Reports of xfstests test generic/311 failures on ppc64/s390x hosts have identified failures to fsync outstanding inode modifications due to the inode not being pinned at the time of the fsync. This occurs because certain bmap updates can complete by logging bmapbt buffers but without ever dirtying (and thus pinning) the core inode. The following is a specific incarnation of this problem: $ mount $dev /mnt -o noatime,nobarrier $ for i in $(seq 0 2 31); do \ xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \ done $ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \ hexdump /mnt/file; \ ./xfstests-dev/src/godown /mnt ... 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0014000 0000 0000 0000 0000 0000 0000 0000 0000 * 00f8000 $ umount /mnt; mount ... $ hexdump /mnt/file 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 00f8000 In short, the unwritten extent conversion for the last write is lost despite the fact that an fsync executed before the filesystem was shutdown. Note that this is impossible to reproduce on v5 supers due to unconditional time callbacks for di_changecount and highly difficult to reproduce on CONFIG_HZ=1000 kernels due to those same callbacks frequently updating cmtime prior to the bmap update. CONFIG_HZ=100 reduces timer granularity enough to increase the odds that time updates are skipped and allows this to reproduce within a handful of attempts. To deal with this problem, unconditionally log the core in the unwritten extent conversion path. Fix up logflags after the extent conversion to keep the extent update code consistent with the other extent update helpers. This fixup is not necessary for the other (hole, delay) extent helpers because they execute in the block allocation codepath, which already logs the inode for other reasons (e.g., for di_nblocks). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: extent size hints can round up extents past MAXEXTLENDave Chinner
This results in BMBT corruption, as seen by this test: # mkfs.xfs -f -d size=40051712b,agcount=4 /dev/vdc .... # mount /dev/vdc /mnt/scratch # xfs_io -ft -c "extsize 16m" -c "falloc 0 30g" -c "bmap -vp" /mnt/scratch/foo which results in this failure on a debug kernel: XFS: Assertion failed: (blockcount & xfs_mask64hi(64-BMBT_BLOCKCOUNT_BITLEN)) == 0, file: fs/xfs/libxfs/xfs_bmap_btree.c, line: 211 .... Call Trace: [<ffffffff814cf0ff>] xfs_bmbt_set_allf+0x8f/0x100 [<ffffffff814cf18d>] xfs_bmbt_set_all+0x1d/0x20 [<ffffffff814f2efe>] xfs_iext_insert+0x9e/0x120 [<ffffffff814c7956>] ? xfs_bmap_add_extent_hole_real+0x1c6/0xc70 [<ffffffff814c7956>] xfs_bmap_add_extent_hole_real+0x1c6/0xc70 [<ffffffff814caaab>] xfs_bmapi_write+0x72b/0xed0 [<ffffffff811c72ac>] ? kmem_cache_alloc+0x15c/0x170 [<ffffffff814fe070>] xfs_alloc_file_space+0x160/0x400 [<ffffffff81ddcc29>] ? down_write+0x29/0x60 [<ffffffff815063eb>] xfs_file_fallocate+0x29b/0x310 [<ffffffff811d2bc8>] ? __sb_start_write+0x58/0x120 [<ffffffff811e3e18>] ? do_vfs_ioctl+0x318/0x570 [<ffffffff811cd680>] vfs_fallocate+0x140/0x260 [<ffffffff811ce6f8>] SyS_fallocate+0x48/0x80 [<ffffffff81ddec09>] system_call_fastpath+0x12/0x17 The tracepoint that indicates the extent that triggered the assert failure is: xfs_iext_insert: idx 0 offset 0 block 16777224 count 2097152 flag 1 Clearly indicating that the extent length is greater than MAXEXTLEN, which is 2097151. A prior trace point shows the allocation was an exact size match and that a length greater than MAXEXTLEN was asked for: xfs_alloc_size_done: agno 1 agbno 8 minlen 2097152 maxlen 2097152 ^^^^^^^ ^^^^^^^ We don't see this problem with extent size hints through the IO path because we can't do single IOs large enough to trigger MAXEXTLEN allocation. fallocate(), OTOH, is not limited in it's allocation sizes and so needs help here. The issue is that the extent size hint alignment is rounding up the extent size past MAXEXTLEN, because xfs_bmapi_write() is not taking into account extent size hints when calculating the maximum extent length to allocate. xfs_bmapi_reserve_delalloc() is already doing this, but direct extent allocation is not. Unfortunately, the calculation in xfs_bmapi_reserve_delalloc() is wrong, and it works only because delayed allocation extents are not limited in size to MAXEXTLEN in the in-core extent tree. hence this calculation does not work for direct allocation, and the delalloc code needs fixing. This may, in fact be the underlying bug that occassionally causes transaction overruns in delayed allocation extent conversion, so now we know it's wrong we should fix it, too. Many thanks to Brian Foster for finding this problem during review of this patch. Hence the fix, after much code reading, is to allow xfs_bmap_extsize_align() to align partial extents when full alignment would extend the alignment past MAXEXTLEN. We can safely do this because all callers have higher layer allocation loops that already handle short allocations, and so will simply run another allocation to cover the remainder of the requested allocation range that we ignored during alignment. The advantage of this approach is that it also removes the need for callers to do anything other than limit their requests to MAXEXTLEN - they don't really need to be aware of extent size hints at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-03-25Merge branch 'fallocate-insert-range' into for-nextDave Chinner
2015-03-25Merge branch 'xfs-misc-fixes-for-4.1-2' into for-nextDave Chinner
Conflicts: fs/xfs/libxfs/xfs_bmap.c fs/xfs/xfs_inode.c
2015-03-25xfs: Add support FALLOC_FL_INSERT_RANGE for fallocateNamjae Jeon
This patch implements fallocate's FALLOC_FL_INSERT_RANGE for XFS. 1) Make sure that both offset and len are block size aligned. 2) Update the i_size of inode by len bytes. 3) Compute the file's logical block number against offset. If the computed block number is not the starting block of the extent, split the extent such that the block number is the starting block of the extent. 4) Shift all the extents which are lying bewteen [offset, last allocated extent] towards right by len bytes. This step will make a hole of len bytes at offset. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-03-25xfs: remove xfs_bmap_sanity_check()Dave Chinner
This code is redundant now that we have verifiers that sanity check the buffers as they are read from disk. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-24Merge branch 'xfs-generic-sb-counters' into for-nextDave Chinner
Conflicts: fs/xfs/xfs_super.c
2015-02-23xfs: pass mp to XFS_WANT_CORRUPTED_RETURNEric Sandeen
Today, if we hit an XFS_WANT_CORRUPTED_RETURN we don't print any information about which filesystem hit it. Passing in the mp allows us to print the filesystem (device) name, which is a pretty critical piece of information. Tested by running fsfuzzer 'til I hit some. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: pass mp to XFS_WANT_CORRUPTED_GOTOEric Sandeen
Today, if we hit an XFS_WANT_CORRUPTED_GOTO we don't print any information about which filesystem hit it. Passing in the mp allows us to print the filesystem (device) name, which is a pretty critical piece of information. Tested by running fsfuzzer 'til I hit some. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: introduce xfs_mod_frextentsDave Chinner
Add a new helper to modify the incore counter of free realtime extents. This matches the helpers used for inode and data block counters, and removes a significant users of the xfs_mod_incore_sb() interface. Based on a patch originally from Christoph Hellwig. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23xfs: use generic percpu counters for free block counterDave Chinner
XFS has hand-rolled per-cpu counters for the superblock since before there was any generic implementation. The free block counter is special in that it is used for ENOSPC detection outside transaction contexts for for delayed allocation. This means that the counter needs to be accurate at zero. The current per-cpu counter code jumps through lots of hoops to ensure we never run past zero, but we don't need to make all those jumps with the generic counter implementation. The generic counter implementation allows us to pass a "batch" threshold at which the addition/subtraction to the counter value will be folded back into global value under lock. We can use this feature to reduce the batch size as we approach 0 in a very similar manner to the existing counters and their rebalance algorithm. If we use a batch size of 1 as we approach 0, then every addition and subtraction will be done against the global value and hence allow accurate detection of zero threshold crossing. Hence we can replace the handrolled, accurate-at-zero counters with generic percpu counters. Note: this removes just enough of the icsb infrastructure to compile without warnings. The rest will go in subsequent commits. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22Merge branch 'xfs-buf-type-fixes' into for-nextDave Chinner
2015-01-22xfs: set buf types when converting extent formatsDave Chinner
Conversion from local to extent format does not set the buffer type correctly on the new extent buffer when a symlink data is moved out of line. Fix the symlink code and leave a comment in the generic bmap code reminding us that the format-specific data copy needs to set the destination buffer type appropriately. cc: <stable@vger.kernel.org> # 3.10 to current Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22xfs: consolidate superblock logging functionsDave Chinner
We now have several superblock loggin functions that are identical except for the transaction reservation and whether it shoul dbe a synchronous transaction or not. Consolidate these all into a single function, a single reserveration and a sync flag and call it xfs_sync_sb(). Also, xfs_mod_sb() is not really a modification function - it's the operation of logging the superblock buffer. hence change the name of it to reflect this. Note that we have to change the mp->m_update_flags that are passed around at mount time to a boolean simply to indicate a superblock update is needed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22xfs: remove bitfield based superblock updatesDave Chinner
When we log changes to the superblock, we first have to write them to the on-disk buffer, and then log that. Right now we have a complex bitfield based arrangement to only write the modified field to the buffer before we log it. This used to be necessary as a performance optimisation because we logged the superblock buffer in every extent or inode allocation or freeing, and so performance was extremely important. We haven't done this for years, however, ever since the lazy superblock counters pulled the superblock logging out of the transaction commit fast path. Hence we have a bunch of complexity that is not necessary that makes writing the in-core superblock to disk much more complex than it needs to be. We only need to log the superblock now during management operations (e.g. during mount, unmount or quota control operations) so it is not a performance critical path anymore. As such, remove the complex field based logging mechanism and replace it with a simple conversion function similar to what we use for all other on-disk structures. This means we always log the entirity of the superblock, but again because we rarely modify the superblock this is not an issue for log bandwidth or CPU time. Indeed, if we do log the superblock frequently, delayed logging will minimise the impact of this overhead. [Fixed gquota/pquota inode sharing regression noticed by bfoster.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04Merge branch 'xfs-misc-fixes-for-3.19-2' into for-nextDave Chinner
Conflicts: fs/xfs/xfs_iops.c
2014-12-04xfs: fix set-but-unused warningsDave Chinner
The kernel compile doesn't turn on these checks by default, so it's only when I do a kernel-user sync that I find that there are lots of compiler warnings waiting to be fixed. Fix up these set-but-unused warnings. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04xfs: cleanup xfs_bmse_merge returnsDave Chinner
Signed-off-by: Dave Chinner <dchinner@redhat.com> xfs_bmse_merge() has a jump label for return that just returns the error value. Convert all the code to just return the error directly and use XFS_WANT_CORRUPTED_RETURN. This also allows the final call to xfs_bmbt_update() to return directly. Noticed while reviewing coccinelle return cleanup patches and wondering why the same return pattern as in xfs_bmse_shift_one() wasn't picked up by the checker pattern... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04xfs: cleanup xfs_bmse_shift_one goto messDave Chinner
xfs_bmse_shift_one() jumps around determining whether to shift or merge, making the code flow difficult to follow. Clean it up and use direct error returns (including XFS_WANT_CORRUPTED_RETURN) to make the code flow better and be easier to read. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01Merge branch 'xfs-coccinelle-cleanups' into for-nextDave Chinner
2014-12-01xfs: fix simple_return.cocci warning in xfs_bmse_shift_onekbuild test robot
fs/xfs/libxfs/xfs_bmap.c:5591:1-6: WARNING: end returns can be simpified Simplify a trivial if-return sequence. Possibly combine with a preceding function call. Generated by: scripts/coccinelle/misc/simple_return.cocci CC: Brian Foster <bfoster@redhat.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2014-11-28xfs: merge xfs_inum.h into xfs_format.hChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28xfs: merge xfs_ag.h into xfs_format.hChristoph Hellwig
More on-disk format consolidation. A few declarations that weren't on-disk format related move into better suitable spots. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28xfs: merge xfs_dinode.h into xfs_format.hChristoph Hellwig
More consolidatation for the on-disk format defintions. Note that the XFS_IS_REALTIME_INODE moves to xfs_linux.h instead as it is not related to the on disk format, but depends on a CONFIG_ option. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23xfs: writeback and inval. file range to be shifted by collapseBrian Foster
The collapse range operation currently writes the entire file before starting the collapse to avoid changes in the in-core extent list due to writeback causing the extent count to change. Now that collapse range is fsb based rather than extent index based it can sustain changes in the extent list during the shift sequence without disruption. Modify xfs_collapse_file_space() to writeback and invalidate pages associated with the range of the file to be shifted. xfs_free_file_space() currently has similar behavior, but the space free need only affect the region of the file that is freed and this could change in the future. Also update the comments to reflect the current implementation. We retain the eofblocks trim permanently as a best option for dealing with delalloc extents. We don't shift delalloc extents because this scenario only occurs with post-eof preallocation (since data must be flushed such that the cache can be invalidated and data can be shifted). That means said space must also be initialized before being shifted into the accessible region of the file only to be immediately truncated off as the last part of the collapse. In other words, the eofblocks trim will happen anyways, we just run it first to ensure the file remains in a consistent state throughout the collapse. Finally, detect and fail explicitly in the event of a delalloc extent during the extent shift. The implementation does not support delalloc extents and the caller is expected to prevent this scenario in advance as is done by collapse. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23xfs: refactor single extent shift into xfs_bmse_shift_one() helperBrian Foster
xfs_bmap_shift_extents() has a variety of conditions and error checks that make the logic difficult to follow and indent heavy. Refactor the loop body of this function into a new xfs_bmse_shift_one() helper. This simplifies the error checks, eliminates index decrement on merge hack by pushing the index increment down into the helper, and makes the code more readable by reducing multiple levels of indentation. This is a code refactor only. The behavior of extent shift and collapse range is not modified. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23xfs: refactor shift-by-merge into xfs_bmse_merge() helperBrian Foster
The extent shift mechanism in xfs_bmap_shift_extents() is complicated and handles several different, non-deterministic scenarios. These include extent shifts, extent merges and potential btree updates in either of the former scenarios. Refactor the code to be more linear and readable. The loop logic in xfs_bmap_shift_extents() and some initial error checking is adjusted slightly. The associated btree lookup and update/delete operations are condensed into single blocks of code. This reduces the number of btree-specific blocks and facilitates the separation of the merge operation into a new xfs_bmse_merge() and xfs_bmse_can_merge() helpers. This is a code refactor only. The behavior of extent shift and collapse range is not modified. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23xfs: track collapse via file offset rather than extent indexBrian Foster
The collapse range implementation uses a transaction per extent shift. The progress of the overall operation is tracked via the current extent index of the in-core extent list. This is racy because the ilock must be dropped and reacquired for each transaction according to locking and log reservation rules. Therefore, writeback to prior regions of the file is possible and can change the extent count. This changes the extent to which the current index refers and causes the collapse to fail mid operation. To avoid this problem, the entire file is currently written back before the collapse operation starts. To eliminate the need to flush the entire file, use the file offset (fsb) to track the progress of the overall extent shift operation rather than the extent index. Modify xfs_bmap_shift_extents() to unconditionally convert the start_fsb parameter to an extent index and return the file offset of the extent where the shift left off, if further extents exist. The bulk of ths function can remain based on extent index as ilock is held by the caller. xfs_collapse_file_space() now uses the fsb output as the starting point for the subsequent shift. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02xfs: don't log inode unless extent shift makes extent modificationsBrian Foster
The file collapse mechanism uses xfs_bmap_shift_extents() to collapse all subsequent extents down into the specified, previously punched out, region. This function performs some validation, such as whether a sufficient hole exists in the target region of the collapse, then shifts the remaining exents downward. The exit path of the function currently logs the inode unconditionally. While we must log the inode (and abort) if an error occurs and the transaction is dirty, the initial validation paths can generate errors before the transaction has been dirtied. This creates an unnecessary filesystem shutdown scenario, as the caller will cancel a transaction that has been marked dirty. Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications are made to the inode bmap. Only log the inode in the exit path if logflags has been set. This ensures we only have to cancel a dirty transaction if modifications have been made and prevents an unnecessary filesystem shutdown otherwise. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04Merge branch 'xfs-misc-fixes-3.17-1' into for-nextDave Chinner
2014-07-30xfs: require 64-bit sector_tChristoph Hellwig
Trying to support tiny disks only and saving a bit memory might have made sense on an SGI O2 15 years ago, but is pretty pointless today. Remove the rarely tested codepath that uses various smaller in-memory types to reduce our test matrix and make the codebase a little bit smaller and less complicated. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15Merge branch 'xfs-libxfs-restructure' into for-nextDave Chinner
2014-06-25xfs: global error sign conversionDave Chinner
Convert all the errors the core XFs code to negative error signs like the rest of the kernel and remove all the sign conversion we do in the interface layers. Errors for conversion (and comparison) found via searches like: $ git grep " E" fs/xfs $ git grep "return E" fs/xfs $ git grep " E[A-Z].*;$" fs/xfs Negation points found via searches like: $ git grep "= -[a-z,A-Z]" fs/xfs $ git grep "return -[a-z,A-D,F-Z]" fs/xfs $ git grep " -[a-z].*;" fs/xfs [ with some bits I missed from Brian Foster ] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25libxfs: move source filesDave Chinner
Move all the source files that are shared with userspace into libxfs/. This is done as one big chunk simpy to get it done quickly Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>