summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2019-05-31chardev: add additional check for minor range overlapChengguang Xu
[ Upstream commit de36e16d1557a0b6eb328bc3516359a12ba5c25c ] Current overlap checking cannot correctly handle a case which is baseminor < existing baseminor && baseminor + minorct > existing baseminor + minorct. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31fscrypt: use READ_ONCE() to access ->i_crypt_infoEric Biggers
[ Upstream commit e37a784d8b6a1e726de5ddc7b4809c086a08db09 ] ->i_crypt_info starts out NULL and may later be locklessly set to a non-NULL value by the cmpxchg() in fscrypt_get_encryption_info(). But ->i_crypt_info is used directly, which technically is incorrect. It's a data race, and it doesn't include the data dependency barrier needed to safely dereference the pointer on at least one architecture. Fix this by using READ_ONCE() instead. Note: we don't need to use smp_load_acquire(), since dereferencing the pointer only requires a data dependency barrier, which is already included in READ_ONCE(). We also don't need READ_ONCE() in places where ->i_crypt_info is unconditionally dereferenced, since it must have already been checked. Also downgrade the cmpxchg() to cmpxchg_release(), since RELEASE semantics are sufficient on the write side. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31Fix nfs4.2 return -EINVAL when do dedupe operationXiaoli Feng
[ Upstream commit ce96e888fe48ecfa868c9a39adc03292c78a80ff ] dedupe_file_range operations is combiled into remap_file_range. But in nfs42_remap_file_range, it's skiped for dedupe operations. Before this patch: # dd if=/dev/zero of=nfs/file bs=1M count=1 # xfs_io -c "dedupe nfs/file 4k 64k 4k" nfs/file XFS_IOC_FILE_EXTENT_SAME: Invalid argument After this patch: # dd if=/dev/zero of=nfs/file bs=1M count=1 # xfs_io -c "dedupe nfs/file 4k 64k 4k" nfs/file deduped 4096/4096 bytes at offset 65536 4 KiB, 1 ops; 0.0046 sec (865.988 KiB/sec and 216.4971 ops/sec) Signed-off-by: Xiaoli Feng <fengxiaoli0714@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31btrfs: Don't panic when we can't find a root keyQu Wenruo
[ Upstream commit 7ac1e464c4d473b517bb784f30d40da1f842482e ] When we failed to find a root key in btrfs_update_root(), we just panic. That's definitely not cool, fix it by outputting an unique error message, aborting current transaction and return -EUCLEAN. This should not normally happen as the root has been used by the callers in some way. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31btrfs: fix panic during relocation after ENOSPC before writeback happensJosef Bacik
[ Upstream commit ff612ba7849964b1898fd3ccd1f56941129c6aab ] We've been seeing the following sporadically throughout our fleet panic: kernel BUG at fs/btrfs/relocation.c:4584! netversion: 5.0-0 Backtrace: #0 [ffffc90003adb880] machine_kexec at ffffffff81041da8 #1 [ffffc90003adb8c8] __crash_kexec at ffffffff8110396c #2 [ffffc90003adb988] crash_kexec at ffffffff811048ad #3 [ffffc90003adb9a0] oops_end at ffffffff8101c19a #4 [ffffc90003adb9c0] do_trap at ffffffff81019114 #5 [ffffc90003adba00] do_error_trap at ffffffff810195d0 #6 [ffffc90003adbab0] invalid_op at ffffffff81a00a9b [exception RIP: btrfs_reloc_cow_block+692] RIP: ffffffff8143b614 RSP: ffffc90003adbb68 RFLAGS: 00010246 RAX: fffffffffffffff7 RBX: ffff8806b9c32000 RCX: ffff8806aad00690 RDX: ffff880850b295e0 RSI: ffff8806b9c32000 RDI: ffff88084f205bd0 RBP: ffff880849415000 R8: ffffc90003adbbe0 R9: ffff88085ac90000 R10: ffff8805f7369140 R11: 0000000000000000 R12: ffff880850b295e0 R13: ffff88084f205bd0 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffc90003adbbb0] __btrfs_cow_block at ffffffff813bf1cd #8 [ffffc90003adbc28] btrfs_cow_block at ffffffff813bf4b3 #9 [ffffc90003adbc78] btrfs_search_slot at ffffffff813c2e6c The way relocation moves data extents is by creating a reloc inode and preallocating extents in this inode and then copying the data into these preallocated extents. Once we've done this for all of our extents, we'll write out these dirty pages, which marks the extent written, and goes into btrfs_reloc_cow_block(). From here we get our current reloc_control, which _should_ match the reloc_control for the current block group we're relocating. However if we get an ENOSPC in this path at some point we'll bail out, never initiating writeback on this inode. Not a huge deal, unless we happen to be doing relocation on a different block group, and this block group is now rc->stage == UPDATE_DATA_PTRS. This trips the BUG_ON() in btrfs_reloc_cow_block(), because we expect to be done modifying the data inode. We are in fact done modifying the metadata for the data inode we're currently using, but not the one from the failed block group, and thus we BUG_ON(). (This happens when writeback finishes for extents from the previous group, when we are at btrfs_finish_ordered_io() which updates the data reloc tree (inode item, drops/adds extent items, etc).) Fix this by writing out the reloc data inode always, and then breaking out of the loop after that point to keep from tripping this BUG_ON() later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [ add note from Filipe ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota ↵Robbie Ko
reserve [ Upstream commit 39ad317315887c2cb9a4347a93a8859326ddf136 ] When doing fallocate, we first add the range to the reserve_list and then reserve the quota. If quota reservation fails, we'll release all reserved parts of reserve_list. However, cur_offset is not updated to indicate that this range is already been inserted into the list. Therefore, the same range is freed twice. Once at list_for_each_entry loop, and once at the end of the function. This will result in WARN_ON on bytes_may_use when we free the remaining space. At the end, under the 'out' label we have a call to: btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset); The start offset, third argument, should be cur_offset. Everything from alloc_start to cur_offset was freed by the list_for_each_entry_safe_loop. Fixes: 18513091af94 ("btrfs: update btrfs_space_info's bytes_may_use timely") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31gfs2: Fix occasional glock use-after-freeAndreas Gruenbacher
[ Upstream commit 9287c6452d2b1f24ea8e84bd3cf6f3c6f267f712 ] This patch has to do with the life cycle of glocks and buffers. When gfs2 metadata or journaled data is queued to be written, a gfs2_bufdata object is assigned to track the buffer, and that is queued to various lists, including the glock's gl_ail_list to indicate it's on the active items list. Once the page associated with the buffer has been written, it is removed from the ail list, but its life isn't over until a revoke has been successfully written. So after the block is written, its bufdata object is moved from the glock's gl_ail_list to a file-system-wide list of pending revokes, sd_log_le_revoke. At that point the glock still needs to track how many revokes it contributed to that list (in gl_revokes) so that things like glock go_sync can ensure all the metadata has been not only written, but also revoked before the glock is granted to a different node. This is to guarantee journal replay doesn't replay the block once the glock has been granted to another node. Ross Lagerwall recently discovered a race in which an inode could be evicted, and its glock freed after its ail list had been synced, but while it still had unwritten revokes on the sd_log_le_revoke list. The evict decremented the glock reference count to zero, which allowed the glock to be freed. After the revoke was written, function revoke_lo_after_commit tried to adjust the glock's gl_revokes counter and clear its GLF_LFLUSH flag, at which time it referenced the freed glock. This patch fixes the problem by incrementing the glock reference count in gfs2_add_revoke when the glock's first bufdata object is moved from the glock to the global revokes list. Later, when the glock's last such bufdata object is freed, the reference count is decremented. This guarantees that whichever process finishes last (the revoke writing or the evict) will properly free the glock, and neither will reference the glock after it has been freed. Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31io_uring: use cpu_online() to check p->sq_thread_cpu instead of cpu_possible()Shenghui Wang
[ Upstream commit 7889f44dd9cee15aff1c3f7daf81ca4dfed48fc7 ] This issue is found by running liburing/test/io_uring_setup test. When test run, the testcase "attempt to bind to invalid cpu" would not pass with messages like: io_uring_setup(1, 0xbfc2f7c8), \ flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, \ resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, \ sq_thread_cpu: 2 expected -1, got 3 FAIL On my system, there is: CPU(s) possible : 0-3 CPU(s) online : 0-1 CPU(s) offline : 2-3 CPU(s) present : 0-1 The sq_thread_cpu 2 is offline on my system, so the bind should fail. But cpu_possible() will pass the check. We shouldn't be able to bind to an offline cpu. Use cpu_online() to do the check. After the change, the testcase run as expected: EINVAL will be returned for cpu offlined. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31gfs2: fix race between gfs2_freeze_func and unmountAbhi Das
[ Upstream commit 8f91821990fd6f170a5dca79697a441181a41b16 ] As part of the freeze operation, gfs2_freeze_func() is left blocking on a request to hold the sd_freeze_gl in SH. This glock is held in EX by the gfs2_freeze() code. A subsequent call to gfs2_unfreeze() releases the EXclusively held sd_freeze_gl, which allows gfs2_freeze_func() to acquire it in SH and resume its operation. gfs2_unfreeze(), however, doesn't wait for gfs2_freeze_func() to complete. If a umount is issued right after unfreeze, it could result in an inconsistent filesystem because some journal data (statfs update) isn't written out. Refer to commit 24972557b12c for a more detailed explanation of how freeze/unfreeze work. This patch causes gfs2_unfreeze() to wait for gfs2_freeze_func() to complete before returning to the user. Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31NFS: make nfs_match_client killableRoberto Bergantinos Corpas
[ Upstream commit 950a578c6128c2886e295b9c7ecb0b6b22fcc92b ] Actually we don't do anything with return value from nfs_wait_client_init_complete in nfs_match_client, as a consequence if we get a fatal signal and client is not fully initialised, we'll loop to "again" label This has been proven to cause soft lockups on some scenarios (no-carrier but configured network interfaces) Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31afs: Fix getting the afs.fid xattrDavid Howells
[ Upstream commit a2f611a3dc317d8ea1c98ad6c54b911cf7f93193 ] The AFS3 FID is three 32-bit unsigned numbers and is represented as three up-to-8-hex-digit numbers separated by colons to the afs.fid xattr. However, with the advent of support for YFS, the FID is now a 64-bit volume number, a 96-bit vnode/inode number and a 32-bit uniquifier (as before). Whilst the sprintf in afs_xattr_get_fid() has been partially updated (it currently ignores the upper 32 bits of the 96-bit vnode number), the size of the stack-based buffer has not been increased to match, thereby allowing stack corruption to occur. Fix this by increasing the buffer size appropriately and conditionally including the upper part of the vnode number if it is non-zero. The latter requires the lower part to be zero-padded if the upper part is non-zero. Fixes: 3b6492df4153 ("afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31gfs2: Fix lru_count going negativeRoss Lagerwall
[ Upstream commit 7881ef3f33bb80f459ea6020d1e021fc524a6348 ] Under certain conditions, lru_count may drop below zero resulting in a large amount of log spam like this: vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \ negative objects to delete nr=-1 This happens as follows: 1) A glock is moved from lru_list to the dispose list and lru_count is decremented. 2) The dispose function calls cond_resched() and drops the lru lock. 3) Another thread takes the lru lock and tries to add the same glock to lru_list, checking if the glock is on an lru list. 4) It is on a list (actually the dispose list) and so it avoids incrementing lru_count. 5) The glock is moved to lru_list. 5) The original thread doesn't dispose it because it has been re-added to the lru list but the lru_count has still decreased by one. Fix by checking if the LRU flag is set on the glock rather than checking if the glock is on some list and rearrange the code so that the LRU flag is added/removed precisely when the glock is added/removed from lru_list. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-05-31Revert "btrfs: Honour FITRIM range constraints during free space trim"David Sterba
This reverts commit eb432217d775a90c061681c0dfa3c7abfba75123. There is currently no corresponding patch in master due to additional changes that would be significantly different from plain revert in the respective stable branch. The range argument was not handled correctly and could cause trim to overlap allocated areas or reach beyond the end of the device. The address space that fitrim normally operates on is in logical coordinates, while the discards are done on the physical device extents. This distinction cannot be made with the current ioctl interface and caused the confusion. The bug depends on the layout of block groups and does not always happen. The whole-fs trim (run by default by the fstrim tool) is not affected. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31acct_on(): don't mess with freeze protectionAl Viro
commit 9419a3191dcb27f24478d288abaab697228d28e6 upstream. What happens there is that we are replacing file->path.mnt of a file we'd just opened with a clone and we need the write count contribution to be transferred from original mount to new one. That's it. We do *NOT* want any kind of freeze protection for the duration of switchover. IOW, we should just use __mnt_{want,drop}_write() for that switchover; no need to bother with mnt_{want,drop}_write() there. Tested-by: Amir Goldstein <amir73il@gmail.com> Reported-by: syzbot+2a73a6ea9507b7112141@syzkaller.appspotmail.com Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31ovl: relax WARN_ON() for overlapping layers use caseAmir Goldstein
commit acf3062a7e1ccf67c6f7e7c28671a6708fde63b0 upstream. This nasty little syzbot repro: https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000 Creates overlay mounts where the same directory is both in upper and lower layers. Simplified example: mkdir foo work mount -t overlay none foo -o"lowerdir=.,upperdir=foo,workdir=work" The repro runs several threads in parallel that attempt to chdir into foo and attempt to symlink/rename/exec/mkdir the file bar. The repro hits a WARN_ON() I placed in ovl_instantiate(), which suggests that an overlay inode already exists in cache and is hashed by the pointer of the real upper dentry that ovl_create_real() has just created. At the point of the WARN_ON(), for overlay dir inode lock is held and upper dir inode lock, so at first, I did not see how this was possible. On a closer look, I see that after ovl_create_real(), because of the overlapping upper and lower layers, a lookup by another thread can find the file foo/bar that was just created in upper layer, at overlay path foo/foo/bar and hash the an overlay inode with the new real dentry as lower dentry. This is possible because the overlay directory foo/foo is not locked and the upper dentry foo/bar is in dcache, so ovl_lookup() can find it without taking upper dir inode shared lock. Overlapping layers is considered a wrong setup which would result in unexpected behavior, but it shouldn't crash the kernel and it shouldn't trigger WARN_ON() either, so relax this WARN_ON() and leave a pr_warn() instead to cover all cases of failure to get an overlay inode. The error returned from failure to insert new inode to cache with inode_insert5() was changed to -EEXIST, to distinguish from the error -ENOMEM returned on failure to get/allocate inode with iget5_locked(). Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com Fixes: 01b39dcc9568 ("ovl: use inode_insert5() to hash a newly...") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31btrfs: sysfs: don't leak memory when failing add fsidTobin C. Harding
commit e32773357d5cc271b1d23550b3ed026eb5c2a468 upstream. A failed call to kobject_init_and_add() must be followed by a call to kobject_put(). Currently in the error path when adding fs_devices we are missing this call. This could be fixed by calling btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid(). Here we choose the second option because it prevents the slightly unusual error path handling requirements of kobject from leaking out into btrfs functions. Add a call to kobject_put() in the error path of kobject_add_and_init(). This causes the release method to be called if kobject_init_and_add() fails. open_tree() is the function that calls btrfs_sysfs_add_fsid() and the error code in this function is already written with the assumption that the release method is called during the error path of open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the fail_fsdev_sysfs label). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31btrfs: sysfs: Fix error path kobject memory leakTobin C. Harding
commit 450ff8348808a89cc27436771aa05c2b90c0eef1 upstream. If a call to kobject_init_and_add() fails we must call kobject_put() otherwise we leak memory. Calling kobject_put() when kobject_init_and_add() fails drops the refcount back to 0 and calls the ktype release method (which in turn calls the percpu destroy and kfree). Add call to kobject_put() in the error path of call to kobject_init_and_add(). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31Btrfs: fix race between ranged fsync and writeback of adjacent rangesFilipe Manana
commit 0c713cbab6200b0ab6473b50435e450a6e1de85d upstream. When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode) that happens to be ranged, which happens during a msync() or writes for files opened with O_SYNC for example, we can end up with a corrupt log, due to different file extent items representing ranges that overlap with each other, or hit some assertion failures. When doing a ranged fsync we only flush delalloc and wait for ordered exents within that range. If while we are logging items from our inode ordered extents for adjacent ranges complete, we end up in a race that can make us insert the file extent items that overlap with others we logged previously and the assertion failures. For example, if tree-log.c:copy_items() receives a leaf that has the following file extents items, all with a length of 4K and therefore there is an implicit hole in the range 68K to 72K - 1: (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ... It copies them to the log tree. However due to the need to detect implicit holes, it may release the path, in order to look at the previous leaf to detect an implicit hole, and then later it will search again in the tree for the first file extent item key, with the goal of locking again the leaf (which might have changed due to concurrent changes to other inodes). However when it locks again the leaf containing the first key, the key corresponding to the extent at offset 72K may not be there anymore since there is an ordered extent for that range that is finishing (that is, somewhere in the middle of btrfs_finish_ordered_io()), and it just removed the file extent item but has not yet replaced it with a new file extent item, so the part of copy_items() that does hole detection will decide that there is a hole in the range starting from 68K to 76K - 1, and therefore insert a file extent item to represent that hole, having a key offset of 68K. After that we now have a log tree with 2 different extent items that have overlapping ranges: 1) The file extent item copied before copy_items() released the path, which has a key offset of 72K and a length of 4K, representing the file range 72K to 76K - 1. 2) And a file extent item representing a hole that has a key offset of 68K and a length of 8K, representing the range 68K to 76K - 1. This item was inserted after releasing the path, and overlaps with the extent item inserted before. The overlapping extent items can cause all sorts of unpredictable and incorrect behaviour, either when replayed or if a fast (non full) fsync happens later, which can trigger a BUG_ON() when calling btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a trace like the following: [61666.783269] ------------[ cut here ]------------ [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182! [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP (...) [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000 [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs] [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246 [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000 [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937 [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000 [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418 [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000 [61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000 [61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0 [61666.786253] Call Trace: [61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs] [61666.786253] ? time_hardirqs_on+0x9/0x14 [61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs] [61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs] [61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs] [61666.786253] ? lock_acquire+0x131/0x1c5 [61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs] [61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs] [61666.786253] ? arch_local_irq_save+0x9/0xc [61666.786253] ? lockref_get_not_zero+0x2c/0x34 [61666.786253] ? rcu_read_unlock+0x3e/0x5d [61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs] [61666.786253] btrfs_sync_file+0x317/0x42c [btrfs] [61666.786253] vfs_fsync_range+0x8c/0x9e [61666.786253] SyS_msync+0x13c/0x1c9 [61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad A sample of a corrupt log tree leaf with overlapping extents I got from running btrfs/072: item 14 key (295 108 200704) itemoff 2599 itemsize 53 extent data disk bytenr 0 nr 0 extent data offset 0 nr 458752 ram 458752 item 15 key (295 108 659456) itemoff 2546 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 606208 nr 163840 ram 770048 item 16 key (295 108 663552) itemoff 2493 itemsize 53 extent data disk bytenr 4343541760 nr 770048 extent data offset 610304 nr 155648 ram 770048 item 17 key (295 108 819200) itemoff 2440 itemsize 53 extent data disk bytenr 4334788608 nr 4096 extent data offset 0 nr 4096 ram 4096 The file extent item at offset 659456 (item 15) ends at offset 823296 (659456 + 163840) while the next file extent item (item 16) starts at offset 663552. Another different problem that the race can trigger is a failure in the assertions at tree-log.c:copy_items(), which expect that the first file extent item key we found before releasing the path exists after we have released path and that the last key we found before releasing the path also exists after releasing the path: $ cat -n fs/btrfs/tree-log.c 4080 if (need_find_last_extent) { 4081 /* btrfs_prev_leaf could return 1 without releasing the path */ 4082 btrfs_release_path(src_path); 4083 ret = btrfs_search_slot(NULL, inode->root, &first_key, 4084 src_path, 0, 0); 4085 if (ret < 0) 4086 return ret; 4087 ASSERT(ret == 0); (...) 4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) { 4104 ret = btrfs_next_leaf(inode->root, src_path); 4105 if (ret < 0) 4106 return ret; 4107 ASSERT(ret == 0); 4108 src = src_path->nodes[0]; 4109 i = 0; 4110 need_find_last_extent = true; 4111 } (...) The second assertion implicitly expects that the last key before the path release still exists, because the surrounding while loop only stops after we have found that key. When this assertion fails it produces a stack like this: [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107 [139590.037406] ------------[ cut here ]------------ [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546! [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1 (...) [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs] (...) [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282 [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000 [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868 [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000 [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013 [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001 [139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000 [139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0 [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [139590.044250] Call Trace: [139590.044631] copy_items+0xa3f/0x1000 [btrfs] [139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs] [139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs] [139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs] [139590.046143] ? do_raw_spin_unlock+0x49/0xc0 [139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs] [139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs] [139590.047592] __vfs_write+0x129/0x1c0 [139590.047932] vfs_write+0xc2/0x1b0 [139590.048270] ksys_write+0x55/0xc0 [139590.048608] do_syscall_64+0x60/0x1b0 [139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe [139590.049287] RIP: 0033:0x7f2efc4be190 (...) [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190 [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003 [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60 [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003 [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000 (...) [139590.055128] ---[ end trace 193f35d0215cdeeb ]--- So fix this race between a full ranged fsync and writeback of adjacent ranges by flushing all delalloc and waiting for all ordered extents to complete before logging the inode. This is the simplest way to solve the problem because currently the full fsync path does not deal with ranges at all (it assumes a full range from 0 to LLONG_MAX) and it always needs to look at adjacent ranges for hole detection. For use cases of ranged fsyncs this can make a few fsyncs slower but on the other hand it can make some following fsyncs to other ranges do less work or no need to do anything at all. A full fsync is rare anyway and happens only once after loading/creating an inode and once after less common operations such as a shrinking truncate. This is an issue that exists for a long time, and was often triggered by generic/127, because it does mmap'ed writes and msync (which triggers a ranged fsync). Adding support for the tree checker to detect overlapping extents (next patch in the series) and trigger a WARN() when such cases are found, and then calling btrfs_check_leaf_full() at the end of btrfs_insert_file_extent() made the issue much easier to detect. Running btrfs/072 with that change to the tree checker and making fsstress open files always with O_SYNC made it much easier to trigger the issue (as triggering it with generic/127 is very rare). CC: stable@vger.kernel.org # 3.16+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31Btrfs: avoid fallback to transaction commit during fsync of files with holesFilipe Manana
commit ebb929060aeb162417b4c1307e63daee47b208d9 upstream. When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a file that has holes and has file extent items spanning two or more leafs, we can end up falling to back to a full transaction commit due to a logic bug that leads to failure to insert a duplicate file extent item that is meant to represent a hole between the last file extent item of a leaf and the first file extent item in the next leaf. The failure (EEXIST error) leads to a transaction commit (as most errors when logging an inode do). For example, we have the two following leafs: Leaf N: ----------------------------------------------- | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) | ----------------------------------------------- The file extent item at the end of leaf N has a length of 4Kb, representing the file range from 64K to 68K - 1. Leaf N + 1: ----------------------------------------------- | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... | ----------------------------------------------- The file extent item at the first slot of leaf N + 1 has a length of 4Kb too, representing the file range from 72K to 76K - 1. During the full fsync path, when we are at tree-log.c:copy_items() with leaf N as a parameter, after processing the last file extent item, that represents the extent at offset 64K, we take a look at the first file extent item at the next leaf (leaf N + 1), and notice there's a 4K hole between the two extents, and therefore we insert a file extent item representing that hole, starting at file offset 68K and ending at offset 72K - 1. However we don't update the value of *last_extent, which is used to represent the end offset (plus 1, non-inclusive end) of the last file extent item inserted in the log, so it stays with a value of 68K and not with a value of 72K. Then, when copy_items() is called for leaf N + 1, because the value of *last_extent is smaller then the offset of the first extent item in the leaf (68K < 72K), we look at the last file extent item in the previous leaf (leaf N) and see it there's a 4K gap between it and our first file extent item (again, 68K < 72K), so we decide to insert a file extent item representing the hole, starting at file offset 68K and ending at offset 72K - 1, this insertion will fail with -EEXIST being returned from btrfs_insert_file_extent() because we already inserted a file extent item representing a hole for this offset (68K) in the previous call to copy_items(), when processing leaf N. The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(), which falls back to a full transaction commit. Fix this by adjusting *last_extent after inserting a hole when we had to look at the next leaf. Fixes: 4ee3fad34a9c ("Btrfs: fix fsync after hole punching when using no-holes feature") Cc: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31Btrfs: do not abort transaction at btrfs_update_root() after failure to COW pathFilipe Manana
commit 72bd2323ec87722c115a5906bc6a1b31d11e8f54 upstream. Currently when we fail to COW a path at btrfs_update_root() we end up always aborting the transaction. However all the current callers of btrfs_update_root() are able to deal with errors returned from it, many do end up aborting the transaction themselves (directly or not, such as the transaction commit path), other BUG_ON() or just gracefully cancel whatever they were doing. When syncing the fsync log, we call btrfs_update_root() through tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log sync code does not abort the transaction, instead it gracefully handles the error and returns -EAGAIN to the fsync handler, so that it falls back to a transaction commit. Any other error different from -ENOSPC, makes the log sync code abort the transaction. So remove the transaction abort from btrfs_update_log() when we fail to COW a path to update the root item, so that if an -ENOSPC failure happens we avoid aborting the current transaction and have a chance of the fsync succeeding after falling back to a transaction commit. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413 Fixes: 79787eaab46121 ("btrfs: replace many BUG_ONs with proper error handling") Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31btrfs: Check the compression level before getting a workspaceJohnny Chang
commit 2b90883c561ddcc641741c2e4df1f702a4f2acb8 upstream. When a file's compression property is set as zlib or zstd but leave the compression mount option not be set, that means btrfs will try to compress the file with default compression level. But in btrfs_compress_pages(), it calls get_workspace() with level = 0. This will return a workspace with a wrong compression level. For zlib, the compression level in the workspace will be 0 (that means "store only"). And for zstd, the compression in the workspace will be 1, not the default level 3. How to reproduce: mkfs -t btrfs /dev/sdb mount /dev/sdb /mnt/ mkdir /mnt/zlib btrfs property set /mnt/zlib/ compression zlib dd if=/dev/zero of=/mnt/zlib/compression-friendly-file-10M bs=1M count=10 sync btrfs-debugfs -f /mnt/zlib/compression-friendly-file-10M btrfs-debugfs output: * before: ... (258 9961472): ram 524288 disk 1106247680 disk_size 524288 file: ... extents 20 disk size 10485760 logical size 10485760 ratio 1.00 * after: ... (258 10354688): ram 131072 disk 14217216 disk_size 4096 file: ... extents 80 disk size 327680 logical size 10485760 ratio 32.00 The steps for zstd are similar, but need to put a debugging message to show the level of the return workspace in zstd_get_workspace(). This commit adds a check of the compression level before getting a workspace by set_level(). CC: stable@vger.kernel.org # 5.1+ Signed-off-by: Johnny Chang <johnnyc@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31btrfs: don't double unlock on error in btrfs_punch_holeJosef Bacik
commit 8fca955057b9c58467d1b231e43f19c4cf26ae8c upstream. If we have an error writing out a delalloc range in btrfs_punch_hole_lock_range we'll unlock the inode and then goto out_only_mutex, where we will again unlock the inode. This is bad, don't do this. Fixes: f27451f22996 ("Btrfs: add support for fallocate's zero range operation") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31gfs2: Fix sign extension bug in gfs2_update_statsAndreas Gruenbacher
commit 5a5ec83d6ac974b12085cd99b196795f14079037 upstream. Commit 4d207133e9c3 changed the types of the statistic values in struct gfs2_lkstats from s64 to u64. Because of that, what should be a signed value in gfs2_update_stats turned into an unsigned value. When shifted right, we end up with a large positive value instead of a small negative value, which results in an incorrect variance estimate. Fixes: 4d207133e9c3 ("gfs2: Make statistics unsigned, suitable for use with do_div()") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: stable@vger.kernel.org # v4.4+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31ext4: wait for outstanding dio during truncate in nojournal modeJan Kara
commit 82a25b027ca48d7ef197295846b352345853dfa8 upstream. We didn't wait for outstanding direct IO during truncate in nojournal mode (as we skip orphan handling in that case). This can lead to fs corruption or stale data exposure if truncate ends up freeing blocks and these get reallocated before direct IO finishes. Fix the condition determining whether the wait is necessary. CC: stable@vger.kernel.org Fixes: 1c9114f9c0f1 ("ext4: serialize unlocked dio reads with truncate") Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-31ext4: do not delete unlinked inode from orphan list on failed truncateJan Kara
commit ee0ed02ca93ef1ecf8963ad96638795d55af2c14 upstream. It is possible that unlinked inode enters ext4_setattr() (e.g. if somebody calls ftruncate(2) on unlinked but still open file). In such case we should not delete the inode from the orphan list if truncate fails. Note that this is mostly a theoretical concern as filesystem is corrupted if we reach this path anyway but let's be consistent in our orphan handling. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25btrfs: reloc: Fix NULL pointer dereference due to expanded reloc_root lifespanQu Wenruo
commit 10995c0491204c861948c9850939a7f4e90760a4 upstream. Commit d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots()") expands the life span of root->reloc_root. This breaks certain checs of fs_info->reloc_ctl. Before that commit, if we have a root with valid reloc_root, then it's ensured to have fs_info->reloc_ctl. But now since reloc_root doesn't always mean a valid fs_info->reloc_ctl, such check is unreliable and can cause the following NULL pointer dereference: BUG: unable to handle kernel NULL pointer dereference at 00000000000005c1 IP: btrfs_reloc_pre_snapshot+0x20/0x50 [btrfs] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 10379 Comm: snapperd Not tainted Call Trace: create_pending_snapshot+0xd7/0xfc0 [btrfs] create_pending_snapshots+0x8e/0xb0 [btrfs] btrfs_commit_transaction+0x2ac/0x8f0 [btrfs] btrfs_mksubvol+0x561/0x570 [btrfs] btrfs_ioctl_snap_create_transid+0x189/0x190 [btrfs] btrfs_ioctl_snap_create_v2+0x102/0x150 [btrfs] btrfs_ioctl+0x5c9/0x1e60 [btrfs] do_vfs_ioctl+0x90/0x5f0 SyS_ioctl+0x74/0x80 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7fd7cdab8467 Fix it by explicitly checking fs_info->reloc_ctl other than using the implied root->reloc_root. Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25fuse: Add FOPEN_STREAM to use stream_open()Kirill Smelkov
commit bbd84f33652f852ce5992d65db4d020aba21f882 upstream. Starting from commit 9c225f2655e3 ("vfs: atomic f_pos accesses as per POSIX") files opened even via nonseekable_open gate read and write via lock and do not allow them to be run simultaneously. This can create read vs write deadlock if a filesystem is trying to implement a socket-like file which is intended to be simultaneously used for both read and write from filesystem client. See commit 10dce8af3422 ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") for details and e.g. commit 581d21a2d02a ("xenbus: fix deadlock on writes to /proc/xen/xenbus") for a similar deadlock example on /proc/xen/xenbus. To avoid such deadlock it was tempting to adjust fuse_finish_open to use stream_open instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE, and in particular GVFS which actually uses offset in its read and write handlers https://codesearch.debian.net/search?q=-%3Enonseekable+%3D https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346 https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481 so if we would do such a change it will break a real user. Add another flag (FOPEN_STREAM) for filesystem servers to indicate that the opened handler is having stream-like semantics; does not use file position and thus the kernel is free to issue simultaneous read and write request on opened file handle. This patch together with stream_open() should be added to stable kernels starting from v3.14+. This will allow to patch OSSPD and other FUSE filesystems that provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE in open handler and this way avoid the deadlock on all kernel versions. This should work because fuse_finish_open ignores unknown open flags returned from a filesystem and so passing FOPEN_STREAM to a kernel that is not aware of this flag cannot hurt. In turn the kernel that is not aware of FOPEN_STREAM will be < v3.14 where just FOPEN_NONSEEKABLE is sufficient to implement streams without read vs write deadlock. Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Kirill Smelkov <kirr@nexedi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25ceph: flush dirty inodes before proceeding with remountJeff Layton
commit 00abf69dd24f4444d185982379c5cc3bb7b6d1fc upstream. xfstest generic/452 was triggering a "Busy inodes after umount" warning. ceph was allowing the mount to go read-only without first flushing out dirty inodes in the cache. Ensure we sync out the filesystem before allowing a remount to proceed. Cc: stable@vger.kernel.org Link: http://tracker.ceph.com/issues/39571 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25fsnotify: fix unlink performance regressionAmir Goldstein
commit 4d8e7055a4058ee191296699803c5090e14f0dff upstream. __fsnotify_parent() has an optimization in place to avoid unneeded take_dentry_name_snapshot(). When fsnotify_nameremove() was changed not to call __fsnotify_parent(), we left out the optimization. Kernel test robot reported a 5% performance regression in concurrent unlink() workload. Reported-by: kernel test robot <rong.a.chen@intel.com> Link: https://lore.kernel.org/lkml/20190505062153.GG29809@shao2-debian/ Link: https://lore.kernel.org/linux-fsdevel/20190104090357.GD22409@quack2.suse.cz/ Fixes: 5f02a8776384 ("fsnotify: annotate directory entry modification events") CC: stable@vger.kernel.org Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25ovl: fix missing upper fs freeze protection on copy up for ioctlAmir Goldstein
commit 3428030da004a1128cbdcf93dc03e16f184d845b upstream. Generalize the helper ovl_open_maybe_copy_up() and use it to copy up file with data before FS_IOC_SETFLAGS ioctl. The FS_IOC_SETFLAGS ioctl is a bit of an odd ball in vfs, which probably caused the confusion. File may be open O_RDONLY, but ioctl modifies the file. VFS does not call mnt_want_write_file() nor lock inode mutex, but fs-specific code for FS_IOC_SETFLAGS does. So ovl_ioctl() calls mnt_want_write_file() for the overlay file, and fs-specific code calls mnt_want_write_file() for upper fs file, but there was no call for ovl_want_write() for copy up duration which prevents overlayfs from copying up on a frozen upper fs. Fixes: dab5ca8fd9dd ("ovl: add lsattr/chattr support") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25fuse: honor RLIMIT_FSIZE in fuse_file_fallocateLiu Bo
commit 0cbade024ba501313da3b7e5dd2a188a6bc491b5 upstream. fstests generic/228 reported this failure that fuse fallocate does not honor what 'ulimit -f' has set. This adds the necessary inode_newsize_ok() check. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Fixes: 05ba1f082300 ("fuse: add FALLOCATE operation") Cc: <stable@vger.kernel.org> # v3.5 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25fuse: fix writepages on 32bitMiklos Szeredi
commit 9de5be06d0a89ca97b5ab902694d42dfd2bb77d2 upstream. Writepage requests were cropped to i_size & 0xffffffff, which meant that mmaped writes to any file larger than 4G might be silently discarded. Fix by storing the file size in a properly sized variable (loff_t instead of size_t). Reported-by: Antonio SJ Musumeci <trapexit@spawn.link> Fixes: 6eaf4782eb09 ("fuse: writepages: crop secondary requests") Cc: <stable@vger.kernel.org> # v3.13 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25PNFS fallback to MDS if no deviceid foundOlga Kornievskaia
commit b1029c9bc078a6f1515f55dd993b507dcc7e3440 upstream. If we fail to find a good deviceid while trying to pnfs instead of propogating an error back fallback to doing IO to the MDS. Currently, code with fals the IO with EINVAL. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Fixes: 8d40b0f14846f ("NFS filelayout:call GETDEVICEINFO after pnfs_layout_process completes" Cc: stable@vger.kernel.org # v4.11+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25NFS4: Fix v4.0 client state corruption when mountZhangXiaoxu
commit f02f3755dbd14fb935d24b14650fff9ba92243b8 upstream. stat command with soft mount never return after server is stopped. When alloc a new client, the state of the client will be set to NFS4CLNT_LEASE_EXPIRED. When the server is stopped, the state manager will work, and accord the state to recover. But the state is NFS4CLNT_LEASE_EXPIRED, it will drain the slot table and lead other task to wait queue, until the client recovered. Then the stat command is hung. When discover server trunking, the client will renew the lease, but check the client state, it lead the client state corruption. So, we need to call state manager to recover it when detect server ip trunking. Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25cifs: fix strcat buffer overflow and reduce raciness in smb21_set_oplock_level()Christoph Probst
commit 6a54b2e002c9d00b398d35724c79f9fe0d9b38fb upstream. Change strcat to strncpy in the "None" case to fix a buffer overflow when cinode->oplock is reset to 0 by another thread accessing the same cinode. It is never valid to append "None" to any other message. Consolidate multiple writes to cinode->oplock to reduce raciness. Signed-off-by: Christoph Probst <kernel@probst.it> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25cifs: fix credits leak for SMB1 oplock breaksRonnie Sahlberg
commit d69cb728e70c40268762182a62f5d5d6fa51c5b2 upstream. For SMB1 oplock breaks we would grab one credit while sending the PDU but we would never relese the credit back since we will never receive a response to this from the server. Eventuallt this would lead to a hang once all credits are leaked. Fix this by defining a new flag CIFS_NO_SRV_RSP which indicates that there is no server response to this command and thus we need to add any credits back immediately after sending the PDU. CC: Stable <stable@vger.kernel.org> #v5.0+ Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25dcache: sort the freeing-without-RCU-delay mess for good.Al Viro
commit 5467a68cbf6884c9a9d91e2a89140afb1839c835 upstream. For lockless accesses to dentries we don't have pinned we rely (among other things) upon having an RCU delay between dropping the last reference and actually freeing the memory. On the other hand, for things like pipes and sockets we neither do that kind of lockless access, nor want to deal with the overhead of an RCU delay every time a socket gets closed. So delay was made optional - setting DCACHE_RCUACCESS in ->d_flags made sure it would happen. We tried to avoid setting it unless we knew we need it. Unfortunately, that had led to recurring class of bugs, in which we missed the need to set it. We only really need it for dentries that are created by d_alloc_pseudo(), so let's not bother with trying to be smart - just make having an RCU delay the default. The ones that do *not* get it set the replacement flag (DCACHE_NORCU) and we'd better use that sparingly. d_alloc_pseudo() is the only such user right now. FWIW, the race that finally prompted that switch had been between __lock_parent() of immediate subdirectory of what's currently the root of a disconnected tree (e.g. from open-by-handle in progress) racing with d_splice_alias() elsewhere picking another alias for the same inode, either on outright corrupted fs image, or (in case of open-by-handle on NFS) that subdirectory having been just moved on server. It's not easy to hit, so the sky is not falling, but that's not the first race on similar missed cases and the logics for settinf DCACHE_RCUACCESS has gotten ridiculously convoluted. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-25proc: prevent changes to overridden credentialsPaul Moore
commit 35a196bef449b5824033865b963ed9a43fb8c730 upstream. Prevent userspace from changing the the /proc/PID/attr values if the task's credentials are currently overriden. This not only makes sense conceptually, it also prevents some really bizarre error cases caused when trying to commit credentials to a task with overridden credentials. Cc: <stable@vger.kernel.org> Reported-by: "chengjian (D)" <cj.chengjian@huawei.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Acked-by: John Johansen <john.johansen@canonical.com> Acked-by: James Morris <james.morris@microsoft.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22smb3: display session id in debug dataSteve French
commit b63a9de02d64ecd5ff0749e90253f5b30ba5b9c0 upstream. Displaying the session id in /proc/fs/cifs/DebugData is needed in order to correlate Linux client information with network and server traces for many common support scenarios. Turned out to be very important for debugging. Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22jbd2: fix potential double freeChengguang Xu
commit 0d52154bb0a700abb459a2cbce0a30fc2549b67e upstream. When failing from creating cache jbd2_inode_cache, we will destroy the previously created cache jbd2_handle_cache twice. This patch fixes this by moving each cache initialization/destruction to its own separate, individual function. Signed-off-by: Chengguang Xu <cgxu519@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22ext4: avoid panic during forced reboot due to aborted journalJan Kara
commit 2c1d0e3631e5732dba98ef49ac0bec1388776793 upstream. Handling of aborted journal is a special code path different from standard ext4_error() one and it can call panic() as well. Commit 1dc1097ff60e ("ext4: avoid panic during forced reboot") forgot to update this path so fix that omission. Fixes: 1dc1097ff60e ("ext4: avoid panic during forced reboot") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org # 5.1 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22ext4: fix use-after-free in dx_release()Sahitya Tummala
commit 08fc98a4d6424af66eb3ac4e2cedd2fc927ed436 upstream. The buffer_head (frames[0].bh) and it's corresping page can be potentially free'd once brelse() is done inside the for loop but before the for loop exits in dx_release(). It can be free'd in another context, when the page cache is flushed via drop_caches_sysctl_handler(). This results into below data abort when accessing info->indirect_levels in dx_release(). Unable to handle kernel paging request at virtual address ffffffc17ac3e01e Call trace: dx_release+0x70/0x90 ext4_htree_fill_tree+0x2d4/0x300 ext4_readdir+0x244/0x6f8 iterate_dir+0xbc/0x160 SyS_getdents64+0x94/0x174 Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22ext4: fix data corruption caused by overlapping unaligned and aligned IOLukas Czerner
commit 57a0da28ced8707cb9f79f071a016b9d005caf5a upstream. Unaligned AIO must be serialized because the zeroing of partial blocks of unaligned AIO can result in data corruption in case it's overlapping another in flight IO. Currently we wait for all unwritten extents before we submit unaligned AIO which protects data in case of unaligned AIO is following overlapping IO. However if a unaligned AIO is followed by overlapping aligned AIO we can still end up corrupting data. To fix this, we must make sure that the unaligned AIO is the only IO in flight by waiting for unwritten extents conversion not just before the IO submission, but right after it as well. This problem can be reproduced by xfstest generic/538 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22ext4: zero out the unused memory region in the extent tree blockSriram Rajagopalan
commit 592acbf16821288ecdc4192c47e3774a4c48bb64 upstream. This commit zeroes out the unused memory region in the buffer_head corresponding to the extent metablock after writing the extent header and the corresponding extent node entries. This is done to prevent random uninitialized data from getting into the filesystem when the extent block is synced. This fixes CVE-2019-11833. Signed-off-by: Sriram Rajagopalan <sriramr@arista.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going ↵Jiufei Xue
into workqueue when umount commit ec084de929e419e51bcdafaafe567d9e7d0273b7 upstream. synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb switch may not go to the workqueue after synchronize_rcu(). Thus previous scheduled switches was not finished even flushing the workqueue, which will cause a NULL pointer dereferenced followed below. VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds. Have a nice day... BUG: unable to handle kernel NULL pointer dereference at 0000000000000278 evict+0xb3/0x180 iput+0x1b0/0x230 inode_switch_wbs_work_fn+0x3c0/0x6a0 worker_thread+0x4e/0x490 ? process_one_work+0x410/0x410 kthread+0xe6/0x100 ret_from_fork+0x39/0x50 Replace the synchronize_rcu() call with a rcu_barrier() to wait for all pending callbacks to finish. And inc isw_nr_in_flight after call_rcu() in inode_switch_wbs() to make more sense. Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.com Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Acked-by: Tejun Heo <tj@kernel.org> Suggested-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22Btrfs: fix race between send and deduplication that lead to failures and crashesFilipe Manana
commit 62d54f3a7fa27ef6a74d6cdf643ce04beba3afa7 upstream. Send operates on read only trees and expects them to never change while it is using them. This is part of its initial design, and this expection is due to two different reasons: 1) When it was introduced, no operations were allowed to modifiy read-only subvolumes/snapshots (including defrag for example). 2) It keeps send from having an impact on other filesystem operations. Namely send does not need to keep locks on the trees nor needs to hold on to transaction handles and delay transaction commits. This ends up being a consequence of the former reason. However the deduplication feature was introduced later (on September 2013, while send was introduced in July 2012) and it allowed for deduplication with destination files that belong to read-only trees (subvolumes and snapshots). That means that having a send operation (either full or incremental) running in parallel with a deduplication that has the destination inode in one of the trees used by the send operation, can result in tree nodes and leaves getting freed and reused while send is using them. This problem is similar to the problem solved for the root nodes getting freed and reused when a snapshot is made against one tree that is currenly being used by a send operation, fixed in commits [1] and [2]. These commits explain in detail how the problem happens and the explanation is valid for any node or leaf that is not the root of a tree as well. This problem was also discussed and explained recently in a thread [3]. The problem is very easy to reproduce when using send with large trees (snapshots) and just a few concurrent deduplication operations that target files in the trees used by send. A stress test case is being sent for fstests that triggers the issue easily. The most common error to hit is the send ioctl return -EIO with the following messages in dmesg/syslog: [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400 [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27 The first one is very easy to hit while the second one happens much less frequently, except for very large trees (in that test case, snapshots with 100000 files having large xattrs to get deep and wide trees). Less frequently, at least one BUG_ON can be hit: [1631742.130080] ------------[ cut here ]------------ [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806! [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1 [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246 [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002 [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000 [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708 [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000 [1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0 [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1631742.141272] Call Trace: [1631742.141826] ? do_raw_spin_unlock+0x49/0xc0 [1631742.142390] tree_advance+0x173/0x1d0 [btrfs] [1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs] [1631742.143533] ? process_extent+0x1070/0x1070 [btrfs] [1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs] [1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0 [1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs] [1631742.146179] ? account_entity_enqueue+0xd3/0x100 [1631742.146662] ? reweight_entity+0x154/0x1a0 [1631742.147135] ? update_curr+0x20/0x2a0 [1631742.147593] ? check_preempt_wakeup+0x103/0x250 [1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0 [1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [1631742.148942] do_vfs_ioctl+0xa2/0x6f0 [1631742.149361] ? __fget+0x113/0x200 [1631742.149767] ksys_ioctl+0x70/0x80 [1631742.150159] __x64_sys_ioctl+0x16/0x20 [1631742.150543] do_syscall_64+0x60/0x1b0 [1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe [1631742.151326] RIP: 0033:0x7f4cd9f5add7 (...) [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7 [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007 [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700 [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003 [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001 (...) [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]--- That BUG_ON happens because while send is using a node, that node is COWed by a concurrent deduplication, gets freed and gets reused as a leaf (because a transaction commit happened in between), so when it attempts to read a slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer contents were wiped out and it now matches a leaf (which can even belong to some other tree now), hitting the BUG_ON(level == 0). Fix this concurrency issue by not allowing send and deduplication to run in parallel if both operate on the same readonly trees, returning EAGAIN to user space and logging an exlicit warning in dmesg/syslog. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2 [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22Btrfs: do not start a transaction at iterate_extent_inodes()Filipe Manana
commit bfc61c36260ca990937539cd648ede3cd749bc10 upstream. When finding out which inodes have references on a particular extent, done by backref.c:iterate_extent_inodes(), from the BTRFS_IOC_LOGICAL_INO (both v1 and v2) ioctl and from scrub we use the transaction join API to grab a reference on the currently running transaction, since in order to give accurate results we need to inspect the delayed references of the currently running transaction. However, if there is currently no running transaction, the join operation will create a new transaction. This is inefficient as the transaction will eventually be committed, doing unnecessary IO and introducing a potential point of failure that will lead to a transaction abort due to -ENOSPC, as recently reported [1]. That's because the join, creates the transaction but does not reserve any space, so when attempting to update the root item of the root passed to btrfs_join_transaction(), during the transaction commit, we can end up failling with -ENOSPC. Users of a join operation are supposed to actually do some filesystem changes and reserve space by some means, which is not the case of iterate_extent_inodes(), it is a read-only operation for all contextes from which it is called. The reported [1] -ENOSPC failure stack trace is the following: heisenberg kernel: ------------[ cut here ]------------ heisenberg kernel: BTRFS: Transaction aborted (error -28) heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs] (...) heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018 heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs] (...) heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286 heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006 heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0 heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007 heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078 heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068 heisenberg kernel: FS: 0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000 heisenberg kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0 heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 heisenberg kernel: Call Trace: heisenberg kernel: commit_fs_roots+0x166/0x1d0 [btrfs] heisenberg kernel: ? _cond_resched+0x15/0x30 heisenberg kernel: ? btrfs_run_delayed_refs+0xac/0x180 [btrfs] heisenberg kernel: btrfs_commit_transaction+0x2bd/0x870 [btrfs] heisenberg kernel: ? start_transaction+0x9d/0x3f0 [btrfs] heisenberg kernel: transaction_kthread+0x147/0x180 [btrfs] heisenberg kernel: ? btrfs_cleanup_transaction+0x530/0x530 [btrfs] heisenberg kernel: kthread+0x112/0x130 heisenberg kernel: ? kthread_bind+0x30/0x30 heisenberg kernel: ret_from_fork+0x35/0x40 heisenberg kernel: ---[ end trace 05de912e30e012d9 ]--- So fix that by using the attach API, which does not create a transaction when there is currently no running transaction. [1] https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/ Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22Btrfs: do not start a transaction during fiemapFilipe Manana
commit 03628cdbc64db6262e50d0357960a4e9562676a1 upstream. During fiemap, for regular extents (non inline) we need to check if they are shared and if they are, set the shared bit. Checking if an extent is shared requires checking the delayed references of the currently running transaction, since some reference might have not yet hit the extent tree and be only in the in-memory delayed references. However we were using a transaction join for this, which creates a new transaction when there is no transaction currently running. That means that two more potential failures can happen: creating the transaction and committing it. Further, if no write activity is currently happening in the system, and fiemap calls keep being done, we end up creating and committing transactions that do nothing. In some extreme cases this can result in the commit of the transaction created by fiemap to fail with ENOSPC when updating the root item of a subvolume tree because a join does not reserve any space, leading to a trace like the following: heisenberg kernel: ------------[ cut here ]------------ heisenberg kernel: BTRFS: Transaction aborted (error -28) heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs] (...) heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018 heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs] (...) heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286 heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006 heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0 heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007 heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078 heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068 heisenberg kernel: FS: 0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000 heisenberg kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0 heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 heisenberg kernel: Call Trace: heisenberg kernel: commit_fs_roots+0x166/0x1d0 [btrfs] heisenberg kernel: ? _cond_resched+0x15/0x30 heisenberg kernel: ? btrfs_run_delayed_refs+0xac/0x180 [btrfs] heisenberg kernel: btrfs_commit_transaction+0x2bd/0x870 [btrfs] heisenberg kernel: ? start_transaction+0x9d/0x3f0 [btrfs] heisenberg kernel: transaction_kthread+0x147/0x180 [btrfs] heisenberg kernel: ? btrfs_cleanup_transaction+0x530/0x530 [btrfs] heisenberg kernel: kthread+0x112/0x130 heisenberg kernel: ? kthread_bind+0x30/0x30 heisenberg kernel: ret_from_fork+0x35/0x40 heisenberg kernel: ---[ end trace 05de912e30e012d9 ]--- Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do a transaction join to avoid the overhead of creating a new transaction (if there is currently no running transaction) and introducing a potential point of failure when the new transaction gets committed, instead use a transaction attach to grab a handle for the currently running transaction if any. Reported-by: Christoph Anton Mitterer <calestyo@scientia.net> Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/ Fixes: afce772e87c36c ("btrfs: fix check_shared for fiemap ioctl") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22Btrfs: send, flush dellaloc in order to avoid data lossFilipe Manana
commit 9f89d5de8631c7930898a601b6612e271aa2261c upstream. When we set a subvolume to read-only mode we do not flush dellaloc for any of its inodes (except if the filesystem is mounted with -o flushoncommit), since it does not affect correctness for any subsequent operations - except for a future send operation. The send operation will not be able to see the delalloc data since the respective file extent items, inode item updates, backreferences, etc, have not hit yet the subvolume and extent trees. Effectively this means data loss, since the send stream will not contain any data from existing delalloc. Another problem from this is that if the writeback starts and finishes while the send operation is in progress, we have the subvolume tree being being modified concurrently which can result in send failing unexpectedly with EIO or hitting runtime errors, assertion failures or hitting BUG_ONs, etc. Simple reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ btrfs subvolume create /mnt/sv $ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo $ btrfs property set /mnt/sv ro true $ btrfs send -f /tmp/send.stream /mnt/sv $ od -t x1 -A d /mnt/sv/foo 0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea * 0110592 $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive -f /tmp/send.stream /mnt $ echo $? 0 $ od -t x1 -A d /mnt/sv/foo 0000000 # ---> empty file Since this a problem that affects send only, fix it in send by flushing dellaloc for all the roots used by the send operation before send starts to process the commit roots. This is a problem that affects send since it was introduced (commit 31db9f7c23fbf7 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive")) but backporting it to older kernels has some dependencies: - For kernels between 3.19 and 4.20, it depends on commit 3cd24c698004d2 ("btrfs: use tagged writepage to mitigate livelock of snapshot") because the function btrfs_start_delalloc_snapshot() does not exist before that commit. So one has to either pick that commit or replace the calls to btrfs_start_delalloc_snapshot() in this patch with calls to btrfs_start_delalloc_inodes(). - For kernels older than 3.19 it also requires commit e5fa8f865b3324 ("Btrfs: ensure send always works on roots without orphans") because it depends on the function ensure_commit_roots_uptodate() which that commits introduced. - No dependencies for 5.0+ kernels. A test case for fstests follows soon. CC: stable@vger.kernel.org # 3.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-22btrfs: Honour FITRIM range constraints during free space trimNikolay Borisov
commit c2d1b3aae33605a61cbab445d8ae1c708ccd2698 upstream. Up until now trimming the freespace was done irrespective of what the arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments will be entirely ignored. Fix it by correctly handling those paramter. This requires breaking if the found freespace extent is after the end of the passed range as well as completing trim after trimming fstrim_range::len bytes. Fixes: 499f377f49f0 ("btrfs: iterate over unused chunk space in FITRIM") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>