summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2018-09-05btrfs: don't leak ret from do_chunk_allocJosef Bacik
commit 4559b0a71749c442d34f7cfb9e72c9e58db83948 upstream. If we're trying to make a data reservation and we have to allocate a data chunk we could leak ret == 1, as do_chunk_alloc() will return 1 if it allocated a chunk. Since the end of the function is the success path just return 0. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05cachefiles: Wait rather than BUG'ing on "Unexpected object collision"Kiran Kumar Modukuri
[ Upstream commit c2412ac45a8f8f1cd582723c1a139608694d410d ] If we meet a conflicting object that is marked FSCACHE_OBJECT_IS_LIVE in the active object tree, we have been emitting a BUG after logging information about it and the new object. Instead, we should wait for the CACHEFILES_OBJECT_ACTIVE flag to be cleared on the old object (or return an error). The ACTIVE flag should be cleared after it has been removed from the active object tree. A timeout of 60s is used in the wait, so we shouldn't be able to get stuck there. Fixes: 9ae326a69004 ("CacheFiles: A cache that backs onto a mounted filesystem") Signed-off-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05cachefiles: Fix refcounting bug in backing-file read monitoringKiran Kumar Modukuri
[ Upstream commit 934140ab028713a61de8bca58c05332416d037d1 ] cachefiles_read_waiter() has the right to access a 'monitor' object by virtue of being called under the waitqueue lock for one of the pages in its purview. However, it has no ref on that monitor object or on the associated operation. What it is allowed to do is to move the monitor object to the operation's to_do list, but once it drops the work_lock, it's actually no longer permitted to access that object. However, it is trying to enqueue the retrieval operation for processing - but it can only do this via a pointer in the monitor object, something it shouldn't be doing. If it doesn't enqueue the operation, the operation may not get processed. If the order is flipped so that the enqueue is first, then it's possible for the work processor to look at the to_do list before the monitor is enqueued upon it. Fix this by getting a ref on the operation so that we can trust that it will still be there once we've added the monitor to the to_do list and dropped the work_lock. The op can then be enqueued after the lock is dropped. The bug can manifest in one of a couple of ways. The first manifestation looks like: FS-Cache: FS-Cache: Assertion failed FS-Cache: 6 == 5 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/operation.c:494! RIP: 0010:fscache_put_operation+0x1e3/0x1f0 ... fscache_op_work_func+0x26/0x50 process_one_work+0x131/0x290 worker_thread+0x45/0x360 kthread+0xf8/0x130 ? create_worker+0x190/0x190 ? kthread_cancel_work_sync+0x10/0x10 ret_from_fork+0x1f/0x30 This is due to the operation being in the DEAD state (6) rather than INITIALISED, COMPLETE or CANCELLED (5) because it's already passed through fscache_put_operation(). The bug can also manifest like the following: kernel BUG at fs/fscache/operation.c:69! ... [exception RIP: fscache_enqueue_operation+246] ... #7 [ffff883fff083c10] fscache_enqueue_operation at ffffffffa0b793c6 #8 [ffff883fff083c28] cachefiles_read_waiter at ffffffffa0b15a48 #9 [ffff883fff083c48] __wake_up_common at ffffffff810af028 I'm not entirely certain as to which is line 69 in Lei's kernel, so I'm not entirely clear which assertion failed. Fixes: 9ae326a69004 ("CacheFiles: A cache that backs onto a mounted filesystem") Reported-by: Lei Xue <carmark.dlut@gmail.com> Reported-by: Vegard Nossum <vegard.nossum@gmail.com> Reported-by: Anthony DeRobertis <aderobertis@metrics.net> Reported-by: NeilBrown <neilb@suse.com> Reported-by: Daniel Axtens <dja@axtens.net> Reported-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-05fscache: Allow cancelled operations to be enqueuedKiran Kumar Modukuri
[ Upstream commit d0eb06afe712b7b103b6361f40a9a0c638524669 ] Alter the state-check assertion in fscache_enqueue_operation() to allow cancelled operations to be given processing time so they can be cleaned up. Also fix a debugging statement that was requiring such operations to have an object assigned. Fixes: 9ae326a69004 ("CacheFiles: A cache that backs onto a mounted filesystem") Reported-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-28reiserfs: fix broken xattr handling (heap corruption, bad retval)Jann Horn
commit a13f085d111e90469faf2d9965eb39b11c114d7e upstream. This fixes the following issues: - When a buffer size is supplied to reiserfs_listxattr() such that each individual name fits, but the concatenation of all names doesn't fit, reiserfs_listxattr() overflows the supplied buffer. This leads to a kernel heap overflow (verified using KASAN) followed by an out-of-bounds usercopy and is therefore a security bug. - When a buffer size is supplied to reiserfs_listxattr() such that a name doesn't fit, -ERANGE should be returned. But reiserfs instead just truncates the list of names; I have verified that if the only xattr on a file has a longer name than the supplied buffer length, listxattr() incorrectly returns zero. With my patch applied, -ERANGE is returned in both cases and the memory corruption doesn't happen anymore. Credit for making me clean this code up a bit goes to Al Viro, who pointed out that the ->actor calling convention is suboptimal and should be changed. Link: http://lkml.kernel.org/r/20180802151539.5373-1-jannh@google.com Fixes: 48b32a3553a5 ("reiserfs: use generic xattr handlers") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-17fix __legitimize_mnt()/mntput() raceAl Viro
commit 119e1ef80ecfe0d1deb6378d4ab41f5b71519de1 upstream. __legitimize_mnt() has two problems - one is that in case of success the check of mount_lock is not ordered wrt preceding increment of refcount, making it possible to have successful __legitimize_mnt() on one CPU just before the otherwise final mntpu() on another, with __legitimize_mnt() not seeing mntput() taking the lock and mntput() not seeing the increment done by __legitimize_mnt(). Solved by a pair of barriers. Another is that failure of __legitimize_mnt() on the second read_seqretry() leaves us with reference that'll need to be dropped by caller; however, if that races with final mntput() we can end up with caller dropping rcu_read_lock() and doing mntput() to release that reference - with the first mntput() having freed the damn thing just as rcu_read_lock() had been dropped. Solution: in "do mntput() yourself" failure case grab mount_lock, check if MNT_DOOMED has been set by racing final mntput() that has missed our increment and if it has - undo the increment and treat that as "failure, caller doesn't need to drop anything" case. It's not easy to hit - the final mntput() has to come right after the first read_seqretry() in __legitimize_mnt() *and* manage to miss the increment done by __legitimize_mnt() before the second read_seqretry() in there. The things that are almost impossible to hit on bare hardware are not impossible on SMP KVM, though... Reported-by: Oleg Nesterov <oleg@redhat.com> Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-17fix mntput/mntput raceAl Viro
commit 9ea0a46ca2c318fcc449c1e6b62a7230a17888f1 upstream. mntput_no_expire() does the calculation of total refcount under mount_lock; unfortunately, the decrement (as well as all increments) are done outside of it, leading to false positives in the "are we dropping the last reference" test. Consider the following situation: * mnt is a lazy-umounted mount, kept alive by two opened files. One of those files gets closed. Total refcount of mnt is 2. On CPU 42 mntput(mnt) (called from __fput()) drops one reference, decrementing component * After it has looked at component #0, the process on CPU 0 does mntget(), incrementing component #0, gets preempted and gets to run again - on CPU 69. There it does mntput(), which drops the reference (component #69) and proceeds to spin on mount_lock. * On CPU 42 our first mntput() finishes counting. It observes the decrement of component #69, but not the increment of component #0. As the result, the total it gets is not 1 as it should've been - it's 0. At which point we decide that vfsmount needs to be killed and proceed to free it and shut the filesystem down. However, there's still another opened file on that filesystem, with reference to (now freed) vfsmount, etc. and we are screwed. It's not a wide race, but it can be reproduced with artificial slowdown of the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups. Fix consists of moving the refcount decrement under mount_lock; the tricky part is that we want (and can) keep the fast case (i.e. mount that still has non-NULL ->mnt_ns) entirely out of mount_lock. All places that zero mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu() before that mntput(). IOW, if mntput() observes (under rcu_read_lock()) a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to be dropped. Reported-by: Jann Horn <jannh@google.com> Tested-by: Jann Horn <jannh@google.com> Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-17root dentries need RCU-delayed freeingAl Viro
commit 90bad5e05bcdb0308cfa3d3a60f5c0b9c8e2efb3 upstream. Since mountpoint crossing can happen without leaving lazy mode, root dentries do need the same protection against having their memory freed without RCU delay as everything else in the tree. It's partially hidden by RCU delay between detaching from the mount tree and dropping the vfsmount reference, but the starting point of pathwalk can be on an already detached mount, in which case umount-caused RCU delay has already passed by the time the lazy pathwalk grabs rcu_read_lock(). If the starting point happens to be at the root of that vfsmount *and* that vfsmount covers the entire filesystem, we get trouble. Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-09jfs: Fix inconsistency between memory allocation and ea_buf->max_sizeShankara Pailoor
commit 92d34134193e5b129dc24f8d79cb9196626e8d7a upstream. The code is assuming the buffer is max_size length, but we weren't allocating enough space for it. Signed-off-by: Shankara Pailoor <shankarapailoor@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-09squashfs: more metadata hardeningsLinus Torvalds
commit 71755ee5350b63fb1f283de8561cdb61b47f4d1d upstream. The squashfs fragment reading code doesn't actually verify that the fragment is inside the fragment table. The end result _is_ verified to be inside the image when actually reading the fragment data, but before that is done, we may end up taking a page fault because the fragment table itself might not even exist. Another report from Anatoly and his endless squashfs image fuzzing. Reported-by: Анатолий Тросиненко <anatoly.trosinenko@gmail.com> Acked-by:: Phillip Lougher <phillip.lougher@gmail.com>, Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-09squashfs: more metadata hardeningLinus Torvalds
commit d512584780d3e6a7cacb2f482834849453d444a1 upstream. Anatoly reports another squashfs fuzzing issue, where the decompression parameters themselves are in a compressed block. This causes squashfs_read_data() to be called in order to read the decompression options before the decompression stream having been set up, making squashfs go sideways. Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Acked-by: Phillip Lougher <phillip.lougher@gmail.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-09squashfs: be more careful about metadata corruptionLinus Torvalds
commit 01cfb7937a9af2abb1136c7e89fbf3fd92952956 upstream. Anatoly Trosinenko reports that a corrupted squashfs image can cause a kernel oops. It turns out that squashfs can end up being confused about negative fragment lengths. The regular squashfs_read_data() does check for negative lengths, but squashfs_read_metadata() did not, and the fragment size code just blindly trusted the on-disk value. Fix both the fragment parsing and the metadata reading code. Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-28fat: fix memory allocation failure handling of match_strdup()OGAWA Hirofumi
commit 35033ab988c396ad7bce3b6d24060c16a9066db8 upstream. In parse_options(), if match_strdup() failed, parse_options() leaves opts->iocharset in unexpected state (i.e. still pointing the freed string). And this can be the cause of double free. To fix, this initialize opts->iocharset always when freeing. Link: http://lkml.kernel.org/r/8736wp9dzc.fsf@mail.parknet.co.jp Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: syzbot+90b8e10515ae88228a92@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-22Fix up non-directory creation in SGID directoriesLinus Torvalds
commit 0fa3ecd87848c9c93c2c828ef4c3a8ca36ce46c7 upstream. sgid directories have special semantics, making newly created files in the directory belong to the group of the directory, and newly created subdirectories will also become sgid. This is historically used for group-shared directories. But group directories writable by non-group members should not imply that such non-group members can magically join the group, so make sure to clear the sgid bit on non-directories for non-members (but remember that sgid without group execute means "mandatory locking", just to confuse things even more). Reported-by: Jann Horn <jannh@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: add more mount time checks of the superblockTheodore Ts'o
commit bfe0a5f47ada40d7984de67e59a7d3390b9b9ecc upstream. The kernel's ext4 mount-time checks were more permissive than e2fsprogs's libext2fs checks when opening a file system. The superblock is considered too insane for debugfs or e2fsck to operate on it, the kernel has no business trying to mount it. This will make file system fuzzing tools work harder, but the failure cases that they find will be more useful and be easier to evaluate. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: clear i_data in ext4_inode_info when removing inline dataTheodore Ts'o
commit 6e8ab72a812396996035a37e5ca4b3b99b5d214b upstream. When converting from an inode from storing the data in-line to a data block, ext4_destroy_inline_data_nolock() was only clearing the on-disk copy of the i_blocks[] array. It was not clearing copy of the i_blocks[] in ext4_inode_info, in i_data[], which is the copy actually used by ext4_map_blocks(). This didn't matter much if we are using extents, since the extents header would be invalid and thus the extents could would re-initialize the extents tree. But if we are using indirect blocks, the previous contents of the i_blocks array will be treated as block numbers, with potentially catastrophic results to the file system integrity and/or user data. This gets worse if the file system is using a 1k block size and s_first_data is zero, but even without this, the file system can get quite badly corrupted. This addresses CVE-2018-10881. https://bugzilla.kernel.org/show_bug.cgi?id=200015 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11ext4: make sure bitmaps and the inode table don't overlap with bg descriptorsTheodore Ts'o
commit 77260807d1170a8cf35dbb06e07461a655f67eee upstream. It's really bad when the allocation bitmaps and the inode table overlap with the block group descriptors, since it causes random corruption of the bg descriptors. So we really want to head those off at the pass. https://bugzilla.kernel.org/show_bug.cgi?id=199865 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-11cifs: Fix infinite loop when using hard mount optionPaulo Alcantara
commit 7ffbe65578b44fafdef577a360eb0583929f7c6e upstream. For every request we send, whether it is SMB1 or SMB2+, we attempt to reconnect tcon (cifs_reconnect_tcon or smb2_reconnect) before carrying out the request. So, while server->tcpStatus != CifsNeedReconnect, we wait for the reconnection to succeed on wait_event_interruptible_timeout(). If it returns, that means that either the condition was evaluated to true, or timeout elapsed, or it was interrupted by a signal. Since we're not handling the case where the process woke up due to a received signal (-ERESTARTSYS), the next call to wait_event_interruptible_timeout() will _always_ fail and we end up looping forever inside either cifs_reconnect_tcon() or smb2_reconnect(). Here's an example of how to trigger that: $ mount.cifs //foo/share /mnt/test -o username=foo,password=foo,vers=1.0,hard (break connection to server before executing bellow cmd) $ stat -f /mnt/test & sleep 140 [1] 2511 $ ps -aux -q 2511 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2511 0.0 0.0 12892 1008 pts/0 S 12:24 0:00 stat -f /mnt/test $ kill -9 2511 (wait for a while; process is stuck in the kernel) $ ps -aux -q 2511 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2511 83.2 0.0 12892 1008 pts/0 R 12:24 30:01 stat -f /mnt/test By using 'hard' mount point means that cifs.ko will keep retrying indefinitely, however we must allow the process to be killed otherwise it would hang the system. Signed-off-by: Paulo Alcantara <palcantara@suse.de> Cc: stable@vger.kernel.org Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03udf: Detect incorrect directory sizeJan Kara
commit fa65653e575fbd958bdf5fb9c4a71a324e39510d upstream. Detect when a directory entry is (possibly partially) beyond directory size and return EIO in that case since it means the filesystem is corrupted. Otherwise directory operations can further corrupt the directory and possibly also oops the kernel. CC: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> CC: stable@vger.kernel.org Reported-and-tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03nfsd: restrict rd_maxcount to svc_max_payload in nfsd_encode_readdirScott Mayhew
commit 9c2ece6ef67e9d376f32823086169b489c422ed0 upstream. nfsd4_readdir_rsize restricts rd_maxcount to svc_max_payload when estimating the size of the readdir reply, but nfsd_encode_readdir restricts it to INT_MAX when encoding the reply. This can result in log messages like "kernel: RPC request reserved 32896 but used 1049444". Restrict rd_dircount similarly (no reason it should be larger than svc_max_payload). Signed-off-by: Scott Mayhew <smayhew@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03UBIFS: Fix potential integer overflow in allocationSilvio Cesare
commit 353748a359f1821ee934afc579cf04572406b420 upstream. There is potential for the size and len fields in ubifs_data_node to be too large causing either a negative value for the length fields or an integer overflow leading to an incorrect memory allocation. Likewise, when the len field is small, an integer underflow may occur. Signed-off-by: Silvio Cesare <silvio.cesare@gmail.com> Fixes: 1e51764a3c2ac ("UBIFS: add new flash file system") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03fuse: don't keep dead fuse_conn at fuse_fill_super().Tetsuo Handa
commit 543b8f8662fe6d21f19958b666ab0051af9db21a upstream. syzbot is reporting use-after-free at fuse_kill_sb_blk() [1]. Since sb->s_fs_info field is not cleared after fc was released by fuse_conn_put() when initialization failed, fuse_kill_sb_blk() finds already released fc and tries to hold the lock. Fix this by clearing sb->s_fs_info field after calling fuse_conn_put(). [1] https://syzkaller.appspot.com/bug?id=a07a680ed0a9290585ca424546860464dd9658db Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+ec3986119086fe4eec97@syzkaller.appspotmail.com> Fixes: 3b463ae0c626 ("fuse: invalidation reverse calls") Cc: John Muir <john@jmuir.com> Cc: Csaba Henk <csaba@gluster.com> Cc: Anand Avati <avati@redhat.com> Cc: <stable@vger.kernel.org> # v2.6.31 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03fuse: atomic_o_trunc should truncate pagecacheMiklos Szeredi
commit df0e91d488276086bc07da2e389986cae0048c37 upstream. Fuse has an "atomic_o_trunc" mode, where userspace filesystem uses the O_TRUNC flag in the OPEN request to truncate the file atomically with the open. In this mode there's no need to send a SETATTR request to userspace after the open, so fuse_do_setattr() checks this mode and returns. But this misses the important step of truncating the pagecache. Add the missing parts of truncation to the ATTR_OPEN branch. Reported-by: Chad Austin <chadaustin@fb.com> Fixes: 6ff958edbf39 ("fuse: add atomic open+truncate support") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03fs/binfmt_misc.c: do not allow offset overflowThadeu Lima de Souza Cascardo
commit 5cc41e099504b77014358b58567c5ea6293dd220 upstream. WHen registering a new binfmt_misc handler, it is possible to overflow the offset to get a negative value, which might crash the system, or possibly leak kernel data. Here is a crash log when 2500000000 was used as an offset: BUG: unable to handle kernel paging request at ffff989cfd6edca0 IP: load_misc_binary+0x22b/0x470 [binfmt_misc] PGD 1ef3e067 P4D 1ef3e067 PUD 0 Oops: 0000 [#1] SMP NOPTI Modules linked in: binfmt_misc kvm_intel ppdev kvm irqbypass joydev input_leds serio_raw mac_hid parport_pc qemu_fw_cfg parpy CPU: 0 PID: 2499 Comm: bash Not tainted 4.15.0-22-generic #24-Ubuntu Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 RIP: 0010:load_misc_binary+0x22b/0x470 [binfmt_misc] Call Trace: search_binary_handler+0x97/0x1d0 do_execveat_common.isra.34+0x667/0x810 SyS_execve+0x31/0x40 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Use kstrtoint instead of simple_strtoul. It will work as the code already set the delimiter byte to '\0' and we only do it when the field is not empty. Tested with offsets -1, 2500000000, UINT_MAX and INT_MAX. Also tested with examples documented at Documentation/admin-guide/binfmt-misc.rst and other registrations from packages on Ubuntu. Link: http://lkml.kernel.org/r/20180529135648.14254-1-cascardo@canonical.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03btrfs: scrub: Don't use inode pages for device replaceQu Wenruo
commit ac0b4145d662a3b9e34085dea460fb06ede9b69b upstream. [BUG] Btrfs can create compressed extent without checksum (even though it shouldn't), and if we then try to replace device containing such extent, the result device will contain all the uncompressed data instead of the compressed one. Test case already submitted to fstests: https://patchwork.kernel.org/patch/10442353/ [CAUSE] When handling compressed extent without checksum, device replace will goe into copy_nocow_pages() function. In that function, btrfs will get all inodes referring to this data extents and then use find_or_create_page() to get pages direct from that inode. The problem here is, pages directly from inode are always uncompressed. And for compressed data extent, they mismatch with on-disk data. Thus this leads to corrupted compressed data extent written to replace device. [FIX] In this attempt, we could just remove the "optimization" branch, and let unified scrub_pages() to handle it. Although scrub_pages() won't bother reusing page cache, it will be a little slower, but it does the correct csum checking and won't cause such data corruption caused by "optimization". Note about the fix: this is the minimal fix that can be backported to older stable trees without conflicts. The whole callchain from copy_nocow_pages() can be deleted, and will be in followup patches. Fixes: ff023aac3119 ("Btrfs: add code to scrub to copy read data to another disk") CC: stable@vger.kernel.org # 4.4+ Reported-by: James Harvey <jamespharvey20@gmail.com> Reviewed-by: James Harvey <jamespharvey20@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [ remove code removal, add note why ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03ext4: fix fencepost error in check for inode count overflow during resizeJan Kara
commit 4f2f76f751433908364ccff82f437a57d0e6e9b7 upstream. ext4_resize_fs() has an off-by-one bug when checking whether growing of a filesystem will not overflow inode count. As a result it allows a filesystem with 8192 inodes per group to grow to 64TB which overflows inode count to 0 and makes filesystem unusable. Fix it. Cc: stable@vger.kernel.org Fixes: 3f8a6411fbada1fa482276591e037f3b1adcf55b Reported-by: Jaco Kroon <jaco@uls.co.za> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03ext4: update mtime in ext4_punch_hole even if no blocks are releasedLukas Czerner
commit eee597ac931305eff3d3fd1d61d6aae553bc0984 upstream. Currently in ext4_punch_hole we're going to skip the mtime update if there are no actual blocks to release. However we've actually modified the file by zeroing the partial block so the mtime should be updated. Moreover the sync and datasync handling is skipped as well, which is also wrong. Fix it. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Joe Habermann <joe.habermann@quantum.com> Cc: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03isofs: fix potential memory leak in mount option parsingChengguang Xu
[ Upstream commit 4f34a5130a471f32f2fe7750769ab4057dc3eaa0 ] When specifying string type mount option (e.g., iocharset) several times in a mount, current option parsing may cause memory leak. Hence, call kfree for previous one in this case. Meanwhile, check memory allocation result for it. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03fsnotify: fix ignore mask logic in send_to_group()Amir Goldstein
[ Upstream commit 92183a42898dc400b89da35685d1814ac6acd3d8 ] The ignore mask logic in send_to_group() does not match the logic in fanotify_should_send_event(). In the latter, a vfsmount mark ignore mask precedes an inode mark mask and in the former, it does not. That difference may cause events to be sent to fanotify backend for no reason. Fix the logic in send_to_group() to match that of fanotify_should_send_event(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-13fix io_destroy()/aio_complete() raceAl Viro
commit 4faa99965e027cc057c5145ce45fa772caa04e8d upstream. If io_destroy() gets to cancelling everything that can be cancelled and gets to kiocb_cancel() calling the function driver has left in ->ki_cancel, it becomes vulnerable to a race with IO completion. At that point req is already taken off the list and aio_complete() does *NOT* spin until we (in free_ioctx_users()) releases ->ctx_lock. As the result, it proceeds to kiocb_free(), freing req just it gets passed to ->ki_cancel(). Fix is simple - remove from the list after the call of kiocb_cancel(). All instances of ->ki_cancel() already have to cope with the being called with iocb still on list - that's what happens in io_cancel(2). Cc: stable@kernel.org Fixes: 0460fef2a921 "aio: use cancellation list lazily" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30udf: Provide saner default for invalid uid / gidJan Kara
[ Upstream commit 116e5258e4115aca0c64ac0bf40ded3b353ed626 ] Currently when UDF filesystem is recorded without uid / gid (ids are set to -1), we will assign INVALID_[UG]ID to vfs inode unless user uses uid= and gid= mount options. In such case filesystem could not be modified in any way as VFS refuses to modify files with invalid ids (even by root). This is confusing to users and not very useful default since such media mode is generally used for removable media. Use overflow[ug]id instead so that at least root can modify the filesystem. Reported-by: Steve Kenton <skenton@ou.edu> Reviewed-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30btrfs: fix lockdep splat in btrfs_alloc_subvolume_writersJeff Mahoney
[ Upstream commit 8a5a916d9a35e13576d79cc16e24611821b13e34 ] While running btrfs/011, I hit the following lockdep splat. This is the important bit: pcpu_alloc+0x1ac/0x5e0 __percpu_counter_init+0x4e/0xb0 btrfs_init_fs_root+0x99/0x1c0 [btrfs] btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs] resolve_indirect_refs+0x130/0x830 [btrfs] find_parent_nodes+0x69e/0xff0 [btrfs] btrfs_find_all_roots_safe+0xa0/0x110 [btrfs] btrfs_find_all_roots+0x50/0x70 [btrfs] btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs] btrfs_commit_transaction+0x3ce/0x9b0 [btrfs] The percpu_counter_init call in btrfs_alloc_subvolume_writers uses GFP_KERNEL, which we can't do during transaction commit. This switches it to GFP_NOFS. ======================================================== WARNING: possible irq lock inversion dependency detected 4.12.14-kvmsmall #8 Tainted: G W -------------------------------------------------------- kswapd0/50 just changed the state of lock: (&delayed_node->mutex){+.+.-.}, at: [<ffffffffc06994fa>] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] but this lock took another, RECLAIM_FS-unsafe lock in the past: (pcpu_alloc_mutex){+.+.+.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Chain exists of: &delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(pcpu_alloc_mutex); local_irq_disable(); lock(&delayed_node->mutex); lock(&found->groups_sem); <Interrupt> lock(&delayed_node->mutex); *** DEADLOCK *** 2 locks held by kswapd0/50: #0: (shrinker_rwsem){++++..}, at: [<ffffffff811dc11f>] shrink_slab+0x7f/0x5b0 #1: (&type->s_umount_key#30){+++++.}, at: [<ffffffff8126dec6>] trylock_super+0x16/0x50 the shortest dependencies between 2nd lock and 1st lock: -> (pcpu_alloc_mutex){+.+.+.} ops: 4904 { HARDIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 pcpu_alloc+0x1ac/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 __do_tune_cpucache+0x2c/0x220 do_tune_cpucache+0x26/0xc0 enable_cpucache+0x6d/0xf0 kmem_cache_init_late+0x42/0x75 start_kernel+0x343/0x4cb x86_64_start_kernel+0x127/0x134 secondary_startup_64+0xa5/0xb0 SOFTIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 pcpu_alloc+0x1ac/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 __do_tune_cpucache+0x2c/0x220 do_tune_cpucache+0x26/0xc0 enable_cpucache+0x6d/0xf0 kmem_cache_init_late+0x42/0x75 start_kernel+0x343/0x4cb x86_64_start_kernel+0x127/0x134 secondary_startup_64+0xa5/0xb0 RECLAIM_FS-ON-W at: __kmalloc+0x47/0x310 pcpu_extend_area_map+0x2b/0xc0 pcpu_alloc+0x3ec/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 __do_tune_cpucache+0x2c/0x220 do_tune_cpucache+0x26/0xc0 enable_cpucache+0x6d/0xf0 __kmem_cache_create+0x1bf/0x390 create_cache+0xba/0x1b0 kmem_cache_create+0x1f8/0x2b0 ksm_init+0x6f/0x19d do_one_initcall+0x50/0x1b0 kernel_init_freeable+0x201/0x289 kernel_init+0xa/0x100 ret_from_fork+0x3a/0x50 INITIAL USE at: __mutex_lock+0x4e/0x8c0 pcpu_alloc+0x1ac/0x5e0 alloc_kmem_cache_cpus.isra.70+0x25/0xa0 setup_cpu_cache+0x2f/0x1f0 __kmem_cache_create+0x1bf/0x390 create_boot_cache+0x8b/0xb1 kmem_cache_init+0xa1/0x19e start_kernel+0x270/0x4cb x86_64_start_kernel+0x127/0x134 secondary_startup_64+0xa5/0xb0 } ... key at: [<ffffffff821d8e70>] pcpu_alloc_mutex+0x70/0xa0 ... acquired at: pcpu_alloc+0x1ac/0x5e0 __percpu_counter_init+0x4e/0xb0 btrfs_init_fs_root+0x99/0x1c0 [btrfs] btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs] resolve_indirect_refs+0x130/0x830 [btrfs] find_parent_nodes+0x69e/0xff0 [btrfs] btrfs_find_all_roots_safe+0xa0/0x110 [btrfs] btrfs_find_all_roots+0x50/0x70 [btrfs] btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs] btrfs_commit_transaction+0x3ce/0x9b0 [btrfs] transaction_kthread+0x176/0x1b0 [btrfs] kthread+0x102/0x140 ret_from_fork+0x3a/0x50 -> (&fs_info->commit_root_sem){++++..} ops: 1566382 { HARDIRQ-ON-W at: down_write+0x3e/0xa0 cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] cow_file_range.isra.66+0x133/0x470 [btrfs] run_delalloc_range+0x121/0x410 [btrfs] writepage_delalloc.isra.50+0xfe/0x180 [btrfs] __extent_writepage+0x19a/0x360 [btrfs] extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs] extent_writepages+0x4d/0x60 [btrfs] do_writepages+0x1a/0x70 __filemap_fdatawrite_range+0xa7/0xe0 btrfs_rename+0x5ee/0xdb0 [btrfs] vfs_rename+0x52a/0x7e0 SyS_rename+0x351/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 HARDIRQ-ON-R at: down_read+0x35/0x90 caching_thread+0x57/0x560 [btrfs] normal_work_helper+0x1c0/0x5e0 [btrfs] process_one_work+0x1e0/0x5c0 worker_thread+0x44/0x390 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 SOFTIRQ-ON-W at: down_write+0x3e/0xa0 cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] cow_file_range.isra.66+0x133/0x470 [btrfs] run_delalloc_range+0x121/0x410 [btrfs] writepage_delalloc.isra.50+0xfe/0x180 [btrfs] __extent_writepage+0x19a/0x360 [btrfs] extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs] extent_writepages+0x4d/0x60 [btrfs] do_writepages+0x1a/0x70 __filemap_fdatawrite_range+0xa7/0xe0 btrfs_rename+0x5ee/0xdb0 [btrfs] vfs_rename+0x52a/0x7e0 SyS_rename+0x351/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-R at: down_read+0x35/0x90 caching_thread+0x57/0x560 [btrfs] normal_work_helper+0x1c0/0x5e0 [btrfs] process_one_work+0x1e0/0x5c0 worker_thread+0x44/0x390 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 INITIAL USE at: down_write+0x3e/0xa0 cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] cow_file_range.isra.66+0x133/0x470 [btrfs] run_delalloc_range+0x121/0x410 [btrfs] writepage_delalloc.isra.50+0xfe/0x180 [btrfs] __extent_writepage+0x19a/0x360 [btrfs] extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs] extent_writepages+0x4d/0x60 [btrfs] do_writepages+0x1a/0x70 __filemap_fdatawrite_range+0xa7/0xe0 btrfs_rename+0x5ee/0xdb0 [btrfs] vfs_rename+0x52a/0x7e0 SyS_rename+0x351/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 } ... key at: [<ffffffffc0729578>] __key.61970+0x0/0xfffffffffff9aa88 [btrfs] ... acquired at: cache_block_group+0x287/0x420 [btrfs] find_free_extent+0x106c/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs] btrfs_create_tree+0xbb/0x2a0 [btrfs] btrfs_create_uuid_tree+0x37/0x140 [btrfs] open_ctree+0x23c0/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> (&found->groups_sem){++++..} ops: 2134587 { HARDIRQ-ON-W at: down_write+0x3e/0xa0 __link_block_group+0x34/0x130 [btrfs] btrfs_read_block_groups+0x33d/0x7b0 [btrfs] open_ctree+0x2054/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 HARDIRQ-ON-R at: down_read+0x35/0x90 btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs] open_ctree+0x207b/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-W at: down_write+0x3e/0xa0 __link_block_group+0x34/0x130 [btrfs] btrfs_read_block_groups+0x33d/0x7b0 [btrfs] open_ctree+0x2054/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-R at: down_read+0x35/0x90 btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs] open_ctree+0x207b/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 INITIAL USE at: down_write+0x3e/0xa0 __link_block_group+0x34/0x130 [btrfs] btrfs_read_block_groups+0x33d/0x7b0 [btrfs] open_ctree+0x2054/0x2660 [btrfs] btrfs_mount+0xd36/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 btrfs_mount+0x18c/0xf90 [btrfs] mount_fs+0x3a/0x160 vfs_kern_mount+0x66/0x150 do_mount+0x1c1/0xcc0 SyS_mount+0x7e/0xd0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 } ... key at: [<ffffffffc0729488>] __key.59101+0x0/0xfffffffffff9ab78 [btrfs] ... acquired at: find_free_extent+0xcb4/0x12d0 [btrfs] btrfs_reserve_extent+0xd8/0x170 [btrfs] btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs] __btrfs_cow_block+0x110/0x5b0 [btrfs] btrfs_cow_block+0xd7/0x290 [btrfs] btrfs_search_slot+0x1f6/0x960 [btrfs] btrfs_lookup_inode+0x2a/0x90 [btrfs] __btrfs_update_delayed_inode+0x65/0x210 [btrfs] btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs] btrfs_evict_inode+0x3fe/0x6a0 [btrfs] evict+0xc4/0x190 __dentry_kill+0xbf/0x170 dput+0x2ae/0x2f0 SyS_rename+0x2a6/0x3b0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> (&delayed_node->mutex){+.+.-.} ops: 5580204 { HARDIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 btrfs_delayed_update_inode+0x46/0x6e0 [btrfs] btrfs_update_inode+0x83/0x110 [btrfs] btrfs_dirty_inode+0x62/0xe0 [btrfs] touch_atime+0x8c/0xb0 do_generic_file_read+0x818/0xb10 __vfs_read+0xdc/0x150 vfs_read+0x8a/0x130 SyS_read+0x45/0xa0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 SOFTIRQ-ON-W at: __mutex_lock+0x4e/0x8c0 btrfs_delayed_update_inode+0x46/0x6e0 [btrfs] btrfs_update_inode+0x83/0x110 [btrfs] btrfs_dirty_inode+0x62/0xe0 [btrfs] touch_atime+0x8c/0xb0 do_generic_file_read+0x818/0xb10 __vfs_read+0xdc/0x150 vfs_read+0x8a/0x130 SyS_read+0x45/0xa0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 IN-RECLAIM_FS-W at: __mutex_lock+0x4e/0x8c0 __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] btrfs_evict_inode+0x22c/0x6a0 [btrfs] evict+0xc4/0x190 dispose_list+0x35/0x50 prune_icache_sb+0x42/0x50 super_cache_scan+0x139/0x190 shrink_slab+0x262/0x5b0 shrink_node+0x2eb/0x2f0 kswapd+0x2eb/0x890 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 INITIAL USE at: __mutex_lock+0x4e/0x8c0 btrfs_delayed_update_inode+0x46/0x6e0 [btrfs] btrfs_update_inode+0x83/0x110 [btrfs] btrfs_dirty_inode+0x62/0xe0 [btrfs] touch_atime+0x8c/0xb0 do_generic_file_read+0x818/0xb10 __vfs_read+0xdc/0x150 vfs_read+0x8a/0x130 SyS_read+0x45/0xa0 do_syscall_64+0x79/0x1e0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 } ... key at: [<ffffffffc072d488>] __key.56935+0x0/0xfffffffffff96b78 [btrfs] ... acquired at: __lock_acquire+0x264/0x11c0 lock_acquire+0xbd/0x1e0 __mutex_lock+0x4e/0x8c0 __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] btrfs_evict_inode+0x22c/0x6a0 [btrfs] evict+0xc4/0x190 dispose_list+0x35/0x50 prune_icache_sb+0x42/0x50 super_cache_scan+0x139/0x190 shrink_slab+0x262/0x5b0 shrink_node+0x2eb/0x2f0 kswapd+0x2eb/0x890 kthread+0x102/0x140 ret_from_fork+0x3a/0x50 stack backtrace: CPU: 1 PID: 50 Comm: kswapd0 Tainted: G W 4.12.14-kvmsmall #8 SLE15 (unreleased) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0x78/0xb7 print_irq_inversion_bug.part.38+0x19f/0x1aa check_usage_forwards+0x102/0x120 ? ret_from_fork+0x3a/0x50 ? check_usage_backwards+0x110/0x110 mark_lock+0x16c/0x270 __lock_acquire+0x264/0x11c0 ? pagevec_lookup_entries+0x1a/0x30 ? truncate_inode_pages_range+0x2b3/0x7f0 lock_acquire+0xbd/0x1e0 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] __mutex_lock+0x4e/0x8c0 ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] ? btrfs_evict_inode+0x1f6/0x6a0 [btrfs] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs] btrfs_evict_inode+0x22c/0x6a0 [btrfs] evict+0xc4/0x190 dispose_list+0x35/0x50 prune_icache_sb+0x42/0x50 super_cache_scan+0x139/0x190 shrink_slab+0x262/0x5b0 shrink_node+0x2eb/0x2f0 kswapd+0x2eb/0x890 kthread+0x102/0x140 ? mem_cgroup_shrink_node+0x2c0/0x2c0 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x3a/0x50 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30Btrfs: fix copy_items() return value when logging an inodeFilipe Manana
[ Upstream commit 8434ec46c6e3232cebc25a910363b29f5c617820 ] When logging an inode, at tree-log.c:copy_items(), if we call btrfs_next_leaf() at the loop which checks for the need to log holes, we need to make sure copy_items() returns the value 1 to its caller and not 0 (on success). This is because the path the caller passed was released and is now different from what is was before, and the caller expects a return value of 0 to mean both success and that the path has not changed, while a return value of 1 means both success and signals the caller that it can not reuse the path, it has to perform another tree search. Even though this is a case that should not be triggered on normal circumstances or very rare at least, its consequences can be very unpredictable (especially when replaying a log tree). Fixes: 16e7549f045d ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30btrfs: tests/qgroup: Fix wrong tree backref levelQu Wenruo
[ Upstream commit 3c0efdf03b2d127f0e40e30db4e7aa0429b1b79a ] The extent tree of the test fs is like the following: BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919 item 0 key (4096 168 4096) itemoff 3944 itemsize 51 extent refs 1 gen 1 flags 2 tree block key (68719476736 0 0) level 1 ^^^^^^^ ref#0: tree block backref root 5 And it's using an empty tree for fs tree, so there is no way that its level can be 1. For REAL (created by mkfs) fs tree backref with no skinny metadata, the result should look like: item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51 refs 1 gen 4 flags TREE_BLOCK tree block key (256 INODE_ITEM 0) level 0 ^^^^^^^ tree block backref root 5 Fix the level to 0, so it won't break later tree level checker. Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30btrfs: Fix possible softlock on single core machinesNikolay Borisov
[ Upstream commit 1e1c50a929bc9e49bc3f9935b92450d9e69f8158 ] do_chunk_alloc implements a loop checking whether there is a pending chunk allocation and if so causes the caller do loop. Generally this loop is executed only once, however testing with btrfs/072 on a single core vm machines uncovered an extreme case where the system could loop indefinitely. This is due to a missing cond_resched when loop which doesn't give a chance to the previous chunk allocator finish its job. The fix is to simply add the missing cond_resched. Fixes: 6d74119f1a3e ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc") Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30Btrfs: fix NULL pointer dereference in log_dir_itemsLiu Bo
[ Upstream commit 80c0b4210a963e31529e15bf90519708ec947596 ] 0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is returned, path->nodes[0] could be NULL, log_dir_items lacks such a check for <0 and we may run into a null pointer dereference panic. Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30Btrfs: bail out on error during replay_dir_deletesLiu Bo
[ Upstream commit b98def7ca6e152ee55e36863dddf6f41f12d1dc6 ] If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs to bail out, otherwise @ret would be forced to be 0 after 'break;' and the caller won't be aware of it. Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30Force log to disk before reading the AGF during a fstrimCarlos Maiolino
[ Upstream commit 8c81dd46ef3c416b3b95e3020fb90dbd44e6140b ] Forcing the log to disk after reading the agf is wrong, we might be calling xfs_log_force with XFS_LOG_SYNC with a metadata lock held. This can cause a deadlock when racing a fstrim with a filesystem shutdown. The deadlock has been identified due a miscalculation bug in device-mapper dm-thin, which returns lack of space to its users earlier than the device itself really runs out of space, changing the device-mapper volume into an error state. The problem happened while filling the filesystem with a single file, triggering the bug in device-mapper, consequently causing an IO error and shutting down the filesystem. If such file is removed, and fstrim executed before the XFS finishes the shut down process, the fstrim process will end up holding the buffer lock, and going to sleep on the cil wait queue. At this point, the shut down process will try to wake up all the threads waiting on the cil wait queue, but for this, it will try to hold the same buffer log already held my the fstrim, locking up the filesystem. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30fs/proc/proc_sysctl.c: fix potential page fault while unregistering sysctl tableDanilo Krummrich
[ Upstream commit a0b0d1c345d0317efe594df268feb5ccc99f651e ] proc_sys_link_fill_cache() does not take currently unregistering sysctl tables into account, which might result into a page fault in sysctl_follow_link() - add a check to fix it. This bug has been present since v3.4. Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets") Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30Btrfs: send, fix issuing write op when processing hole in no data modeFilipe Manana
[ Upstream commit d4dfc0f4d39475ccbbac947880b5464a74c30b99 ] When doing an incremental send of a filesystem with the no-holes feature enabled, we end up issuing a write operation when using the no data mode send flag, instead of issuing an update extent operation. Fix this by issuing the update extent operation instead. Trivial reproducer: $ mkfs.btrfs -f -O no-holes /dev/sdc $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdc /mnt/sdc $ mount /dev/sdd /mnt/sdd $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1 $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2 $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \ | btrfs receive -vv /mnt/sdd Before this change the output of the second receive command is: receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447... utimes write foobar, offset 8192, len 8192 utimes foobar BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-... After this change it is: receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64... utimes update_extent foobar: offset=8192, len=8192 utimes foobar BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64... Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30cifs: silence compiler warnings showing up with gcc-8.0.0Arnd Bergmann
[ Upstream commit ade7db991b47ab3016a414468164f4966bd08202 ] This bug was fixed before, but came up again with the latest compiler in another function: fs/cifs/cifssmb.c: In function 'CIFSSMBSetEA': fs/cifs/cifssmb.c:6362:3: error: 'strncpy' offset 8 is out of the bounds [0, 4] [-Werror=array-bounds] strncpy(parm_data->list[0].name, ea_name, name_len); Let's apply the same fix that was used for the other instances. Fixes: b2a3ad9ca502 ("cifs: silence compiler warnings showing up with gcc-4.7.0") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30proc: fix /proc/*/map_files lookupAlexey Dobriyan
[ Upstream commit ac7f1061c2c11bb8936b1b6a94cdb48de732f7a4 ] Current code does: if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2) However sscanf() is broken garbage. It silently accepts whitespace between format specifiers (did you know that?). It silently accepts valid strings which result in integer overflow. Do not use sscanf() for any even remotely reliable parsing code. OK # readlink '/proc/1/map_files/55a23af39000-55a23b05b000' /lib/systemd/systemd broken # readlink '/proc/1/map_files/ 55a23af39000-55a23b05b000' /lib/systemd/systemd broken # readlink '/proc/1/map_files/55a23af39000-55a23b05b000 ' /lib/systemd/systemd very broken # readlink '/proc/1/map_files/1000000000000000055a23af39000-55a23b05b000' /lib/systemd/systemd Andrei said: : This patch breaks criu. It was a bug in criu. And this bug is on a minor : path, which works when memfd_create() isn't available. It is a reason why : I ask to not backport this patch to stable kernels. : : In CRIU this bug can be triggered, only if this patch will be backported : to a kernel which version is lower than v3.16. Link: http://lkml.kernel.org/r/20171120212706.GA14325@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attributepiaojun
[ Upstream commit 16c8d569f5704a84164f30ff01b29879f3438065 ] The race between *set_acl and *get_acl will cause getting incomplete xattr data as below: processA processB ocfs2_set_acl ocfs2_xattr_set __ocfs2_xattr_set_handle ocfs2_get_acl_nolock ocfs2_xattr_get_nolock: processB may get incomplete xattr data if processA hasn't set_acl done. So we should use 'ip_xattr_sem' to protect getting extended attribute in ocfs2_get_acl_nolock(), as other processes could be changing it concurrently. Link: http://lkml.kernel.org/r/5A5DDCFF.7030001@huawei.com Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Alex Chen <alex.chen@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30ocfs2: return -EROFS to mount.ocfs2 if inode block is invalidpiaojun
[ Upstream commit 025bcbde3634b2c9b316f227fed13ad6ad6817fb ] If metadata is corrupted such as 'invalid inode block', we will get failed by calling 'mount()' and then set filesystem readonly as below: ocfs2_mount ocfs2_initialize_super ocfs2_init_global_system_inodes ocfs2_iget ocfs2_read_locked_inode ocfs2_validate_inode_block ocfs2_error ocfs2_handle_error ocfs2_set_ro_flag(osb, 0); // set readonly In this situation we need return -EROFS to 'mount.ocfs2', so that user can fix it by fsck. And then mount again. In addition, 'mount.ocfs2' should be updated correspondingly as it only return 1 for all errno. And I will post a patch for 'mount.ocfs2' too. Link: http://lkml.kernel.org/r/5A4302FA.2010606@huawei.com Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30jffs2: Fix use-after-free bug in jffs2_iget()'s error handling pathJake Daryll Obina
[ Upstream commit 5bdd0c6f89fba430e18d636493398389dadc3b17 ] If jffs2_iget() fails for a newly-allocated inode, jffs2_do_clear_inode() can get called twice in the error handling path, the first call in jffs2_iget() itself and the second through iget_failed(). This can result to a use-after-free error in the second jffs2_do_clear_inode() call, such as shown by the oops below wherein the second jffs2_do_clear_inode() call was trying to free node fragments that were already freed in the first jffs2_do_clear_inode() call. [ 78.178860] jffs2: error: (1904) jffs2_do_read_inode_internal: CRC failed for read_inode of inode 24 at physical location 0x1fc00c [ 78.178914] Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b7b [ 78.185871] pgd = ffffffc03a567000 [ 78.188794] [6b6b6b6b6b6b6b7b] *pgd=0000000000000000, *pud=0000000000000000 [ 78.194968] Internal error: Oops: 96000004 [#1] PREEMPT SMP ... [ 78.513147] PC is at rb_first_postorder+0xc/0x28 [ 78.516503] LR is at jffs2_kill_fragtree+0x28/0x90 [jffs2] [ 78.520672] pc : [<ffffff8008323d28>] lr : [<ffffff8000eb1cc8>] pstate: 60000105 [ 78.526757] sp : ffffff800cea38f0 [ 78.528753] x29: ffffff800cea38f0 x28: ffffffc01f3f8e80 [ 78.532754] x27: 0000000000000000 x26: ffffff800cea3c70 [ 78.536756] x25: 00000000dc67c8ae x24: ffffffc033d6945d [ 78.540759] x23: ffffffc036811740 x22: ffffff800891a5b8 [ 78.544760] x21: 0000000000000000 x20: 0000000000000000 [ 78.548762] x19: ffffffc037d48910 x18: ffffff800891a588 [ 78.552764] x17: 0000000000000800 x16: 0000000000000c00 [ 78.556766] x15: 0000000000000010 x14: 6f2065646f6e695f [ 78.560767] x13: 6461657220726f66 x12: 2064656c69616620 [ 78.564769] x11: 435243203a6c616e x10: 7265746e695f6564 [ 78.568771] x9 : 6f6e695f64616572 x8 : ffffffc037974038 [ 78.572774] x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008 [ 78.576775] x5 : 002f91d85bd44a2f x4 : 0000000000000000 [ 78.580777] x3 : 0000000000000000 x2 : 000000403755e000 [ 78.584779] x1 : 6b6b6b6b6b6b6b6b x0 : 6b6b6b6b6b6b6b6b ... [ 79.038551] [<ffffff8008323d28>] rb_first_postorder+0xc/0x28 [ 79.042962] [<ffffff8000eb5578>] jffs2_do_clear_inode+0x88/0x100 [jffs2] [ 79.048395] [<ffffff8000eb9ddc>] jffs2_evict_inode+0x3c/0x48 [jffs2] [ 79.053443] [<ffffff8008201ca8>] evict+0xb0/0x168 [ 79.056835] [<ffffff8008202650>] iput+0x1c0/0x200 [ 79.060228] [<ffffff800820408c>] iget_failed+0x30/0x3c [ 79.064097] [<ffffff8000eba0c0>] jffs2_iget+0x2d8/0x360 [jffs2] [ 79.068740] [<ffffff8000eb0a60>] jffs2_lookup+0xe8/0x130 [jffs2] [ 79.073434] [<ffffff80081f1a28>] lookup_slow+0x118/0x190 [ 79.077435] [<ffffff80081f4708>] walk_component+0xfc/0x28c [ 79.081610] [<ffffff80081f4dd0>] path_lookupat+0x84/0x108 [ 79.085699] [<ffffff80081f5578>] filename_lookup+0x88/0x100 [ 79.089960] [<ffffff80081f572c>] user_path_at_empty+0x58/0x6c [ 79.094396] [<ffffff80081ebe14>] vfs_statx+0xa4/0x114 [ 79.098138] [<ffffff80081ec44c>] SyS_newfstatat+0x58/0x98 [ 79.102227] [<ffffff800808354c>] __sys_trace_return+0x0/0x4 [ 79.106489] Code: d65f03c0 f9400001 b40000e1 aa0103e0 (f9400821) The jffs2_do_clear_inode() call in jffs2_iget() is unnecessary since iget_failed() will eventually call jffs2_do_clear_inode() if needed, so just remove it. Fixes: 5451f79f5f81 ("iget: stop JFFS2 from using iget() and read_inode()") Reviewed-by: Richard Weinberger <richard@nod.at> Signed-off-by: Jake Daryll Obina <jake.obina@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30btrfs: Fix out of bounds access in btrfs_search_slotNikolay Borisov
[ Upstream commit 9ea2c7c9da13c9073e371c046cbbc45481ecb459 ] When modifying a tree where the root is at BTRFS_MAX_LEVEL - 1 then the level variable is going to be 7 (this is the max height of the tree). On the other hand btrfs_cow_block is always called with "level + 1" as an index into the nodes and slots arrays. This leads to an out of bounds access. Admittdely this will be benign since an OOB access of the nodes array will likely read the 0th element from the slots array, which in this case is going to be 0 (since we start CoW at the top of the tree). The OOB access into the slots array in turn will read the 0th and 1st values of the locks array, which would both be 0 at the time. However, this benign behavior relies on the fact that the path being passed hasn't been initialised, if it has already been used to query a btree then it could potentially have populated the nodes/slots arrays. Fix it by explicitly checking if we are at level 7 (the maximum allowed index in nodes/slots arrays) and explicitly call the CoW routine with NULL for parent's node/slot. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Fixes-coverity-id: 711515 Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30nfs: Do not convert nfs_idmap_cache_timeout to jiffiesJan Chochol
[ Upstream commit cbebc6ef4fc830f4040d4140bf53484812d5d5d9 ] Since commit 57e62324e469 ("NFS: Store the legacy idmapper result in the keyring") nfs_idmap_cache_timeout changed units from jiffies to seconds. Unfortunately sysctl interface was not updated accordingly. As a effect updating /proc/sys/fs/nfs/idmap_cache_timeout with some value will incorrectly multiply this value by HZ. Also reading /proc/sys/fs/nfs/idmap_cache_timeout will show real value divided by HZ. Fixes: 57e62324e469 ("NFS: Store the legacy idmapper result in the keyring") Signed-off-by: Jan Chochol <jan@chochol.info> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30do d_instantiate/unlock_new_inode combinations safelyAl Viro
commit 1e2e547a93a00ebc21582c06ca3c6cfea2a309ee upstream. For anything NFS-exported we do _not_ want to unlock new inode before it has grown an alias; original set of fixes got the ordering right, but missed the nasty complication in case of lockdep being enabled - unlock_new_inode() does lockdep_annotate_inode_mutex_key(inode) which can only be done before anyone gets a chance to touch ->i_mutex. Unfortunately, flipping the order and doing unlock_new_inode() before d_instantiate() opens a window when mkdir can race with open-by-fhandle on a guessed fhandle, leading to multiple aliases for a directory inode and all the breakage that follows from that. Correct solution: a new primitive (d_instantiate_new()) combining these two in the right order - lockdep annotate, then d_instantiate(), then the rest of unlock_new_inode(). All combinations of d_instantiate() with unlock_new_inode() should be converted to that. Cc: stable@kernel.org # 2.6.29 and later Tested-by: Mike Marshall <hubcap@omnibond.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30aio: fix io_destroy(2) vs. lookup_ioctx() raceAl Viro
commit baf10564fbb66ea222cae66fbff11c444590ffd9 upstream. kill_ioctx() used to have an explicit RCU delay between removing the reference from ->ioctx_table and percpu_ref_kill() dropping the refcount. At some point that delay had been removed, on the theory that percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was the wrong kind of RCU delay and it didn't care about rcu_read_lock() used by lookup_ioctx(). As the result, we could get ctx freed right under lookup_ioctx(). Tejun has fixed that in a6d7cff472e ("fs/aio: Add explicit RCU grace period when freeing kioctx"); however, that fix is not enough. Suppose io_destroy() from one thread races with e.g. io_setup() from another; CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2 has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the refcount, getting it to 0 and triggering a call of free_ioctx_users(), which proceeds to drop the secondary refcount and once that reaches zero calls free_ioctx_reqs(). That does INIT_RCU_WORK(&ctx->free_rwork, free_ioctx); queue_rcu_work(system_wq, &ctx->free_rwork); and schedules freeing the whole thing after RCU delay. In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the refcount from 0 to 1 and returned the reference to io_setup(). Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get freed until after percpu_ref_get(). Sure, we'd increment the counter before ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it has grabbed the reference, ctx is *NOT* going away until it gets around to dropping that reference. The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss. It's not costlier than what we currently do in normal case, it's safe to call since freeing *is* delayed and it closes the race window - either lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx() fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see the object in question at all. Cc: stable@kernel.org Fixes: a6d7cff472e "fs/aio: Add explicit RCU grace period when freeing kioctx" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-30affs_lookup(): close a race with affs_remove_link()Al Viro
commit 30da870ce4a4e007c901858a96e9e394a1daa74a upstream. we unlock the directory hash too early - if we are looking at secondary link and primary (in another directory) gets removed just as we unlock, we could have the old primary moved in place of the secondary, leaving us to look into freed entry (and leaving our dentry with ->d_fsdata pointing to a freed entry). Cc: stable@vger.kernel.org # 2.4.4+ Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>