Age | Commit message (Collapse) | Author |
|
commit 6cc7c266e5b47d3cd2b5bb7fd3aac4e6bb2dd1d2 upstream.
If the template field 'd' is chosen and the digest to be added to the
measurement entry was not calculated with SHA1 or MD5, it is
recalculated with SHA1, by using the passed file descriptor. However, this
cannot be done for boot_aggregate, because there is no file descriptor.
This patch adds a call to ima_calc_boot_aggregate() in
ima_eventdigest_init(), so that the digest can be recalculated also for the
boot_aggregate entry.
Cc: stable@vger.kernel.org # 3.13.x
Fixes: 3ce1217d6cd5d ("ima: define template fields library and new helpers")
Reported-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 067a436b1b0aafa593344fddd711a755a58afb3b upstream.
This patch prevents the following oops:
[ 10.771813] BUG: kernel NULL pointer dereference, address: 0000000000000
[...]
[ 10.779790] RIP: 0010:ima_match_policy+0xf7/0xb80
[...]
[ 10.798576] Call Trace:
[ 10.798993] ? ima_lsm_policy_change+0x2b0/0x2b0
[ 10.799753] ? inode_init_owner+0x1a0/0x1a0
[ 10.800484] ? _raw_spin_lock+0x7a/0xd0
[ 10.801592] ima_must_appraise.part.0+0xb6/0xf0
[ 10.802313] ? ima_fix_xattr.isra.0+0xd0/0xd0
[ 10.803167] ima_must_appraise+0x4f/0x70
[ 10.804004] ima_post_path_mknod+0x2e/0x80
[ 10.804800] do_mknodat+0x396/0x3c0
It occurs when there is a failure during IMA initialization, and
ima_init_policy() is not called. IMA hooks still call ima_match_policy()
but ima_rules is NULL. This patch prevents the crash by directly assigning
the ima_default_policy pointer to ima_rules when ima_rules is defined. This
wouldn't alter the existing behavior, as ima_rules is always set at the end
of ima_init_policy().
Cc: stable@vger.kernel.org # 3.7.x
Fixes: 07f6a79415d7d ("ima: add appraise action keywords and default rules")
Reported-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e144d6b265415ddbdc54b3f17f4f95133effa5a8 upstream.
Evaluate error in init_ima() before register_blocking_lsm_notifier() and
return if not zero.
Cc: stable@vger.kernel.org # 5.3.x
Fixes: b16942455193 ("ima: use the lsm policy update notifier")
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6f1a1d103b48b1533a9c804e7a069e2c8e937ce7 upstream.
boot_aggregate is the first entry of IMA measurement list. Its purpose is
to link pre-boot measurements to IMA measurements. As IMA was designed to
work with a TPM 1.2, the SHA1 PCR bank was always selected even if a
TPM 2.0 with support for stronger hash algorithms is available.
This patch first tries to find a PCR bank with the IMA default hash
algorithm. If it does not find it, it selects the SHA256 PCR bank for
TPM 2.0 and SHA1 for TPM 1.2. Ultimately, it selects SHA1 also for TPM 2.0
if the SHA256 PCR bank is not found.
If none of the PCR banks above can be found, boot_aggregate file digest is
filled with zeros, as for TPM bypass, making it impossible to perform a
remote attestation of the system.
Cc: stable@vger.kernel.org # 5.1.x
Fixes: 879b589210a9 ("tpm: retrieve digest size of unknown algorithms with PCR read")
Reported-by: Jerry Snitselaar <jsnitsel@redhat.com>
Suggested-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1129d31b55d509f15e72dc68e4b5c3a4d7b4da8d upstream.
Function hash_long() accepts unsigned long, while currently only one byte
is passed from ima_hash_key(), which calculates a key for ima_htable.
Given that hashing the digest does not give clear benefits compared to
using the digest itself, remove hash_long() and return the modulus
calculated on the first two bytes of the digest with the number of slots.
Also reduce the depth of the hash table by doubling the number of slots.
Cc: stable@vger.kernel.org
Fixes: 3323eec921ef ("integrity: IMA as an integrity service provider")
Co-developed-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Krzysztof Struczynski <krzysztof.struczynski@huawei.com>
Acked-by: David.Laight@aculab.com (big endian system concerns)
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit da97f2d56bbd880b4138916a7ef96f9881a551b2 upstream.
Now that deferred pages are initialized with interrupts enabled we can
replace touch_nmi_watchdog() with cond_resched(), as it was before
3a2d7fa8a3d5.
For now, we cannot do the same in deferred_grow_zone() as it is still
initializes pages with interrupts disabled.
This change fixes RCU problem described in
https://lkml.kernel.org/r/20200401104156.11564-2-david@redhat.com
[ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
[ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
[ 60.475000] Sending NMI from CPU 0 to CPUs 1:
[ 1.760091] NMI backtrace for cpu 1
[ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
[ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
[ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
[ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab <b8> 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
[ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
[ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
[ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
[ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
[ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
[ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
[ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
[ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
[ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1.760091] Call Trace:
[ 1.760091] deferred_init_pages+0x8f/0xbf
[ 1.760091] deferred_init_memmap+0x184/0x29d
[ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
[ 1.760091] kthread+0x112/0x130
[ 1.760091] ? kthread_flush_work_fn+0x10/0x10
[ 1.760091] ret_from_fork+0x35/0x40
[ 89.123011] node 0 initialised, 1055935372 pages in 88650ms
Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Reported-by: Yiqian Wei <yiwei@redhat.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> [4.17+]
Link: http://lkml.kernel.org/r/20200403140952.17177-4-pasha.tatashin@soleen.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
deferred init
commit 117003c32771df617acf66e140fbdbdeb0ac71f5 upstream.
Patch series "initialize deferred pages with interrupts enabled", v4.
Keep interrupts enabled during deferred page initialization in order to
make code more modular and allow jiffies to update.
Original approach, and discussion can be found here:
http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com
This patch (of 3):
deferred_init_memmap() disables interrupts the entire time, so it calls
touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it
will run with interrupts enabled, at which point cond_resched() should be
used instead.
deferred_grow_zone() makes the same watchdog calls through code shared
with deferred init but will continue to run with interrupts disabled, so
it can't call cond_resched().
Pull the watchdog calls up to these two places to allow the first to be
changed later, independently of the second. The frequency reduces from
twice per pageblock (init and free) to once per max order block.
Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: James Morris <jmorris@namei.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Yiqian Wei <yiwei@redhat.com>
Cc: <stable@vger.kernel.org> [4.17+]
Link: http://lkml.kernel.org/r/20200403140952.17177-2-pasha.tatashin@soleen.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a202bf71f08b3ef15356db30535e30b03cf23aec upstream.
CPU_LOONGSON2EF need software to maintain cache consistency,
so modify the 'cpu_needs_post_dma_flush' function to return true
when the cpu type is CPU_LOONGSON2EF.
Cc: stable@vger.kernel.org
Signed-off-by: Lichao Liu <liulichao@loongson.cn>
Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3d060856adfc59afb9d029c233141334cfaba418 upstream.
Initializing struct pages is a long task and keeping interrupts disabled
for the duration of this operation introduces a number of problems.
1. jiffies are not updated for long period of time, and thus incorrect time
is reported. See proposed solution and discussion here:
lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
2. It prevents farther improving deferred page initialization by allowing
intra-node multi-threading.
We are keeping interrupts disabled to solve a rather theoretical problem
that was never observed in real world (See 3a2d7fa8a3d5).
Let's keep interrupts enabled. In case we ever encounter a scenario where
an interrupt thread wants to allocate large amount of memory this early in
boot we can deal with that by growing zone (see deferred_grow_zone()) by
the needed amount before starting deferred_init_memmap() threads.
Before:
[ 1.232459] node 0 initialised, 12058412 pages in 1ms
After:
[ 1.632580] node 0 initialised, 12051227 pages in 436ms
Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Yiqian Wei <yiwei@redhat.com>
Cc: <stable@vger.kernel.org> [4.17+]
Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c444eb564fb16645c172d550359cb3d75fe8a040 upstream.
Write protect anon page faults require an accurate mapcount to decide
if to break the COW or not. This is implemented in the THP path with
reuse_swap_page() ->
page_trans_huge_map_swapcount()/page_trans_huge_mapcount().
If the COW triggers while the other processes sharing the page are
under a huge pmd split, to do an accurate reading, we must ensure the
mapcount isn't computed while it's being transferred from the head
page to the tail pages.
reuse_swap_cache() already runs serialized by the page lock, so it's
enough to add the page lock around __split_huge_pmd_locked too, in
order to add the missing serialization.
Note: the commit in "Fixes" is just to facilitate the backporting,
because the code before such commit didn't try to do an accurate THP
mapcount calculation and it instead used the page_count() to decide if
to COW or not. Both the page_count and the pin_count are THP-wide
refcounts, so they're inaccurate if used in
reuse_swap_page(). Reverting such commit (besides the unrelated fix to
the local anon_vma assignment) would have also opened the window for
memory corruption side effects to certain workloads as documented in
such commit header.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Jann Horn <jannh@google.com>
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
PPC32.
commit 4e3319c23a66dabfd6c35f4d2633d64d99b68096 upstream.
Setting init mem to NX shall depend on sinittext being mapped by
block, not on stext being mapped by block.
Setting text and rodata to RO shall depend on stext being mapped by
block, not on sinittext being mapped by block.
Fixes: 63b2bc619565 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7d565fb8f51b18a3d98445a830b2f6548cb2da2a.1589866984.git.christophe.leroy@csgroup.eu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2166e5edce9ac1edf3b113d6091ef72fcac2d6c4 upstream.
We always preallocate a data extent for writing a free space cache, which
causes writeback to always try the nocow path first, since the free space
inode has the prealloc bit set in its flags.
However if the block group that contains the data extent for the space
cache has been turned to RO mode due to a running scrub or balance for
example, we have to fallback to the cow path. In that case once a new data
extent is allocated we end up calling btrfs_add_reserved_bytes(), which
decrements the counter named bytes_may_use from the data space_info object
with the expection that this counter was previously incremented with the
same amount (the size of the data extent).
However when we started writeout of the space cache at cache_save_setup(),
we incremented the value of the bytes_may_use counter through a call to
btrfs_check_data_free_space() and then decremented it through a call to
btrfs_prealloc_file_range_trans() immediately after. So when starting the
writeback if we fallback to cow mode we have to increment the counter
bytes_may_use of the data space_info again to compensate for the extent
allocation done by the cow path.
When this issue happens we are incorrectly decrementing the bytes_may_use
counter and when its current value is smaller then the amount we try to
subtract we end up with the following warning:
------------[ cut here ]------------
WARNING: CPU: 3 PID: 657 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
CPU: 3 PID: 657 Comm: kworker/u8:7 Tainted: G W 5.6.0-rc7-btrfs-next-58 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: writeback wb_workfn (flush-btrfs-1591)
RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
Code: ff ff 48 (...)
RSP: 0000:ffffa41608f13660 EFLAGS: 00010287
RAX: 0000000000001000 RBX: ffff9615b93ae400 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9615b96ab410
RBP: fffffffffffee000 R08: 0000000000000001 R09: 0000000000000000
R10: ffff961585e62a40 R11: 0000000000000000 R12: ffff9615b96ab400
R13: ffff9615a1a2a000 R14: 0000000000012000 R15: ffff9615b93ae400
FS: 0000000000000000(0000) GS:ffff9615bb200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055cbbc2ae178 CR3: 0000000115794006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
find_free_extent+0x4a0/0x16c0 [btrfs]
btrfs_reserve_extent+0x91/0x180 [btrfs]
cow_file_range+0x12d/0x490 [btrfs]
btrfs_run_delalloc_range+0x9f/0x6d0 [btrfs]
? find_lock_delalloc_range+0x221/0x250 [btrfs]
writepage_delalloc+0xe8/0x150 [btrfs]
__extent_writepage+0xe8/0x4c0 [btrfs]
extent_write_cache_pages+0x237/0x530 [btrfs]
extent_writepages+0x44/0xa0 [btrfs]
do_writepages+0x23/0x80
__writeback_single_inode+0x59/0x700
writeback_sb_inodes+0x267/0x5f0
__writeback_inodes_wb+0x87/0xe0
wb_writeback+0x382/0x590
? wb_workfn+0x4a2/0x6c0
wb_workfn+0x4a2/0x6c0
process_one_work+0x26d/0x6a0
worker_thread+0x4f/0x3e0
? process_one_work+0x6a0/0x6a0
kthread+0x103/0x140
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x3a/0x50
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace bd7c03622e0b0a52 ]---
------------[ cut here ]------------
So fix this by incrementing the bytes_may_use counter of the data
space_info when we fallback to the cow path. If the cow path is successful
the counter is decremented after extent allocation (by
btrfs_add_reserved_bytes()), if it fails it ends up being decremented as
well when clearing the delalloc range (extent_clear_unlock_delalloc()).
This could be triggered sporadically by the test case btrfs/061 from
fstests.
Fixes: 82d5902d9c681b ("Btrfs: Support reading/writing on disk free ino cache")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 467dc47ea99c56e966e99d09dae54869850abeeb upstream.
When doing a buffered write we always try to reserve data space for it,
even when the file has the NOCOW bit set or the write falls into a file
range covered by a prealloc extent. This is done both because it is
expensive to check if we can do a nocow write (checking if an extent is
shared through reflinks or if there's a hole in the range for example),
and because when writeback starts we might actually need to fallback to
COW mode (for example the block group containing the target extents was
turned into RO mode due to a scrub or balance).
When we are unable to reserve data space we check if we can do a nocow
write, and if we can, we proceed with dirtying the pages and setting up
the range for delalloc. In this case the bytes_may_use counter of the
data space_info object is not incremented, unlike in the case where we
are able to reserve data space (done through btrfs_check_data_free_space()
which calls btrfs_alloc_data_chunk_ondemand()).
Later when running delalloc we attempt to start writeback in nocow mode
but we might revert back to cow mode, for example because in the meanwhile
a block group was turned into RO mode by a scrub or relocation. The cow
path after successfully allocating an extent ends up calling
btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of
the data space_info object to have been incremented before - but we did
not do it when the buffered write started, since there was not enough
available data space. So btrfs_add_reserved_bytes() ends up decrementing
the bytes_may_use counter anyway, and when the counter's current value
is smaller then the size of the allocated extent we get a stack trace
like the following:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Workqueue: writeback wb_workfn (flush-btrfs-1754)
RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
Code: ff ff 48 (...)
RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287
RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410
RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000
R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400
R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800
FS: 0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
find_free_extent+0x4a0/0x16c0 [btrfs]
btrfs_reserve_extent+0x91/0x180 [btrfs]
cow_file_range+0x12d/0x490 [btrfs]
run_delalloc_nocow+0x341/0xa40 [btrfs]
btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
? find_lock_delalloc_range+0x221/0x250 [btrfs]
writepage_delalloc+0xe8/0x150 [btrfs]
__extent_writepage+0xe8/0x4c0 [btrfs]
extent_write_cache_pages+0x237/0x530 [btrfs]
? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs]
extent_writepages+0x44/0xa0 [btrfs]
do_writepages+0x23/0x80
__writeback_single_inode+0x59/0x700
writeback_sb_inodes+0x267/0x5f0
__writeback_inodes_wb+0x87/0xe0
wb_writeback+0x382/0x590
? wb_workfn+0x4a2/0x6c0
wb_workfn+0x4a2/0x6c0
process_one_work+0x26d/0x6a0
worker_thread+0x4f/0x3e0
? process_one_work+0x6a0/0x6a0
kthread+0x103/0x140
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x3a/0x50
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace f9f6ef8ec4cd8ec9 ]---
So to fix this, when falling back into cow mode check if space was not
reserved, by testing for the bit EXTENT_NORESERVE in the respective file
range, and if not, increment the bytes_may_use counter for the data
space_info object. Also clear the EXTENT_NORESERVE bit from the range, so
that if the cow path fails it decrements the bytes_may_use counter when
clearing the delalloc range (through the btrfs_clear_delalloc_extent()
callback).
Fixes: 7ee9e4405f264e ("Btrfs: check if we can nocow if we don't have data space")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e2c8e92d1140754073ad3799eb6620c76bab2078 upstream.
If an error happens while running dellaloc in COW mode for a range, we can
end up calling extent_clear_unlock_delalloc() for a range that goes beyond
our range's end offset by 1 byte, which affects 1 extra page. This results
in clearing bits and doing page operations (such as a page unlock) outside
our target range.
Fix that by calling extent_clear_unlock_delalloc() with an inclusive end
offset, instead of an exclusive end offset, at cow_file_range().
Fixes: a315e68f6e8b30 ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e289f03ea79bbc6574b78ac25682555423a91cbb upstream.
When we have extents shared amongst different inodes in the same subvolume,
if we fsync them in parallel we can end up with checksum items in the log
tree that represent ranges which overlap.
For example, consider we have inodes A and B, both sharing an extent that
covers the logical range from X to X + 64KiB:
1) Task A starts an fsync on inode A;
2) Task B starts an fsync on inode B;
3) Task A calls btrfs_csum_file_blocks(), and the first search in the
log tree, through btrfs_lookup_csum(), returns -EFBIG because it
finds an existing checksum item that covers the range from X - 64KiB
to X;
4) Task A checks that the checksum item has not reached the maximum
possible size (MAX_CSUM_ITEMS) and then releases the search path
before it does another path search for insertion (through a direct
call to btrfs_search_slot());
5) As soon as task A releases the path and before it does the search
for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG
too, because there is an existing checksum item that has an end
offset that matches the start offset (X) of the checksum range we want
to log;
6) Task B releases the path;
7) Task A does the path search for insertion (through btrfs_search_slot())
and then verifies that the checksum item that ends at offset X still
exists and extends its size to insert the checksums for the range from
X to X + 64KiB;
8) Task A releases the path and returns from btrfs_csum_file_blocks(),
having inserted the checksums into an existing checksum item that got
its size extended. At this point we have one checksum item in the log
tree that covers the logical range from X - 64KiB to X + 64KiB;
9) Task B now does a search for insertion using btrfs_search_slot() too,
but it finds that the previous checksum item no longer ends at the
offset X, it now ends at an of offset X + 64KiB, so it leaves that item
untouched.
Then it releases the path and calls btrfs_insert_empty_item()
that inserts a checksum item with a key offset corresponding to X and
a size for inserting a single checksum (4 bytes in case of crc32c).
Subsequent iterations end up extending this new checksum item so that
it contains the checksums for the range from X to X + 64KiB.
So after task B returns from btrfs_csum_file_blocks() we end up with
two checksum items in the log tree that have overlapping ranges, one
for the range from X - 64KiB to X + 64KiB, and another for the range
from X to X + 64KiB.
Having checksum items that represent ranges which overlap, regardless of
being in the log tree or in the chekcsums tree, can lead to problems where
checksums for a file range end up not being found. This type of problem
has happened a few times in the past and the following commits fixed them
and explain in detail why having checksum items with overlapping ranges is
problematic:
27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums"
b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync"
40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree"
Since this specific instance of the problem can only happen when logging
inodes, because it is the only case where concurrent attempts to insert
checksums for the same range can happen, fix the issue by using an extent
io tree as a range lock to serialize checksum insertion during inode
logging.
This issue could often be reproduced by the test case generic/457 from
fstests. When it happens it produces the following trace:
BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item
BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610
BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884
item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4
item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4
item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4
item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4
item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4
item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4
(...)
BTRFS error (device dm-0): block=30625792 write time tree block corruption detected
------------[ cut here ]------------
WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs]
Modules linked in: btrfs dm_thin_pool ...
CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs]
Code: c7 c7 ...
RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001
RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0
R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000
FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btree_submit_bio_hook+0x67/0xc0 [btrfs]
submit_one_bio+0x31/0x50 [btrfs]
btree_write_cache_pages+0x2db/0x4b0 [btrfs]
? __filemap_fdatawrite_range+0xb1/0x110
do_writepages+0x23/0x80
__filemap_fdatawrite_range+0xd2/0x110
btrfs_write_marked_extents+0x15e/0x180 [btrfs]
btrfs_sync_log+0x206/0x10a0 [btrfs]
? kmem_cache_free+0x315/0x3b0
? btrfs_log_inode+0x1e8/0xf90 [btrfs]
? __mutex_unlock_slowpath+0x45/0x2a0
? lockref_put_or_lock+0x9/0x30
? dput+0x2d/0x580
? dput+0xb5/0x580
? btrfs_sync_file+0x464/0x4d0 [btrfs]
btrfs_sync_file+0x464/0x4d0 [btrfs]
do_fsync+0x38/0x60
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fb41953a6d0
Code: 48 3d ...
RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0
RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003
RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009
R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060
R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last enabled at (0): [<ffffffffa96bdedf>] copy_process+0x74f/0x2020
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace d543fc76f5ad7fd8 ]---
In that trace the tree checker detected the overlapping checksum items at
the time when we triggered writeback for the log tree when syncing the
log.
Another trace that can happen is due to BUG_ON() when deleting checksum
items while logging an inode:
BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584)
BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610
BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473
item 0 key (257 1 0) itemoff 16123 itemsize 160
inode generation 7 size 262144 mode 100600
item 1 key (257 12 256) itemoff 16103 itemsize 20
item 2 key (257 108 0) itemoff 16050 itemsize 53
extent data disk bytenr 13631488 nr 4096
extent data offset 0 nr 131072 ram 131072
(...)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:3153!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs]
Code: 0f b6 ...
RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001
RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001
R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08
R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace
FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_del_csums+0x2f4/0x540 [btrfs]
copy_items+0x4b5/0x560 [btrfs]
btrfs_log_inode+0x910/0xf90 [btrfs]
btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
? dget_parent+0x5/0x370
btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
btrfs_sync_file+0x42b/0x4d0 [btrfs]
__x64_sys_msync+0x199/0x200
do_syscall_64+0x5c/0x280
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fe586c65760
Code: 00 f7 ...
RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760
RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000
RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61
R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1
R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420
Modules linked in: dm_log_writes ...
---[ end trace c92a7f447a8515f5 ]---
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6d3113a193e3385c72240096fe397618ecab6e43 upstream.
In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
stripe or chunk, we submit orig_bio without cloning it. In this case, we
don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
decrement pending_bios to -1, and we never complete orig_bio. Fix it by
initializing pending_bios to 1 instead of incrementing later.
Fixing this exposes another bug: we put orig_bio prematurely and then
put it again from end_io. Fix it by not putting orig_bio.
After this change, pending_bios is really more of a reference count, but
I'll leave that cleanup separate to keep the fix small.
Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 51415b6c1b117e223bc083e30af675cb5c5498f3 upstream.
[BUG]
When balance is canceled, there is a pretty high chance that unmounting
the fs can lead to lead the NULL pointer dereference:
BTRFS warning (device dm-3): page private not zero on page 223158272
...
BTRFS warning (device dm-3): page private not zero on page 223162368
BTRFS error (device dm-3): leaked root 18446744073709551608-304 refcount 1
BUG: kernel NULL pointer dereference, address: 0000000000000168
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 2 PID: 5793 Comm: umount Tainted: G O 5.7.0-rc5-custom+ #53
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:__lock_acquire+0x5dc/0x24c0
Call Trace:
lock_acquire+0xab/0x390
_raw_spin_lock+0x39/0x80
btrfs_release_extent_buffer_pages+0xd7/0x200 [btrfs]
release_extent_buffer+0xb2/0x170 [btrfs]
free_extent_buffer+0x66/0xb0 [btrfs]
btrfs_put_root+0x8e/0x130 [btrfs]
btrfs_check_leaked_roots.cold+0x5/0x5d [btrfs]
btrfs_free_fs_info+0xe5/0x120 [btrfs]
btrfs_kill_super+0x1f/0x30 [btrfs]
deactivate_locked_super+0x3b/0x80
deactivate_super+0x3e/0x50
cleanup_mnt+0x109/0x160
__cleanup_mnt+0x12/0x20
task_work_run+0x67/0xa0
exit_to_usermode_loop+0xc5/0xd0
syscall_return_slowpath+0x205/0x360
do_syscall_64+0x6e/0xb0
entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x7fd028ef740b
[CAUSE]
When balance is canceled, all reloc roots are marked as orphan, and
orphan reloc roots are going to be cleaned up.
However for orphan reloc roots and merged reloc roots, their lifespan
are quite different:
Merged reloc roots | Orphan reloc roots by cancel
--------------------------------------------------------------------
create_reloc_root() | create_reloc_root()
|- refs == 1 | |- refs == 1
|
btrfs_grab_root(reloc_root); | btrfs_grab_root(reloc_root);
|- refs == 2 | |- refs == 2
|
root->reloc_root = reloc_root; | root->reloc_root = reloc_root;
>>> No difference so far <<<
|
prepare_to_merge() | prepare_to_merge()
|- btrfs_set_root_refs(item, 1);| |- if (!err) (err == -EINTR)
|
merge_reloc_roots() | merge_reloc_roots()
|- merge_reloc_root() | |- Doing nothing to put reloc root
|- insert_dirty_subvol() | |- refs == 2
|- __del_reloc_root() |
|- btrfs_put_root() |
|- refs == 1 |
>>> Now orphan reloc roots still have refs 2 <<<
|
clean_dirty_subvols() | clean_dirty_subvols()
|- btrfs_drop_snapshot() | |- btrfS_drop_snapshot()
|- reloc_root get freed | |- reloc_root still has refs 2
| related ebs get freed, but
| reloc_root still recorded in
| allocated_roots
btrfs_check_leaked_roots() | btrfs_check_leaked_roots()
|- No leaked roots | |- Leaked reloc_roots detected
| |- btrfs_put_root()
| |- free_extent_buffer(root->node);
| |- eb already freed, caused NULL
| pointer dereference
[FIX]
The fix is to clear fs_root->reloc_root and put it at
merge_reloc_roots() time, so that we won't leak reloc roots.
Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.1+
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9c343784c4328781129bcf9e671645f69fe4b38a upstream.
Nikolay noticed a bunch of test failures with my global rsv steal
patches. At first he thought they were introduced by them, but they've
been failing for a while with 64k nodes.
The problem is with 64k nodes we have a global reserve that calculates
out to 13MiB on a freshly made file system, which only has 8MiB of
metadata space. Because of changes I previously made we no longer
account for the global reserve in the overcommit logic, which means we
correctly allow overcommit to happen even though we are already
overcommitted.
However in some corner cases, for example btrfs/170, we will allocate
the entire file system up with data chunks before we have enough space
pressure to allocate a metadata chunk. Then once the fs is full we
ENOSPC out because we cannot overcommit and the global reserve is taking
up all of the available space.
The most ideal way to deal with this is to change our space reservation
stuff to take into account the height of the tree's that we're
modifying, so that our global reserve calculation does not end up so
obscenely large.
However that is a huge undertaking. Instead fix this by forcing a chunk
allocation if the global reserve is larger than the total metadata
space. This gives us essentially the same behavior that happened
before, we get a chunk allocated and these tests can pass.
This is meant to be a stop-gap measure until we can tackle the "tree
height only" project.
Fixes: 0096420adb03 ("btrfs: do not account global reserve in can_overcommit")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 89efda52e6b6930f80f5adda9c3c9edfb1397191 upstream.
Whenever a chown is executed, all capabilities of the file being touched
are lost. When doing incremental send with a file with capabilities,
there is a situation where the capability can be lost on the receiving
side. The sequence of actions bellow shows the problem:
$ mount /dev/sda fs1
$ mount /dev/sdb fs2
$ touch fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_init
$ btrfs send fs1/snap_init | btrfs receive fs2
$ chgrp adm fs1/foo.bar
$ setcap cap_sys_nice+ep fs1/foo.bar
$ btrfs subvolume snapshot -r fs1 fs1/snap_complete
$ btrfs subvolume snapshot -r fs1 fs1/snap_incremental
$ btrfs send fs1/snap_complete | btrfs receive fs2
$ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2
At this point, only a chown was emitted by "btrfs send" since only the
group was changed. This makes the cap_sys_nice capability to be dropped
from fs2/snap_incremental/foo.bar
To fix that, only emit capabilities after chown is emitted. The current
code first checks for xattrs that are new/changed, emits them, and later
emit the chown. Now, __process_new_xattr skips capabilities, letting
only finish_inode_if_needed to emit them, if they exist, for the inode
being processed.
This behavior was being worked around in "btrfs receive" side by caching
the capability and only applying it after chown. Now, xattrs are only
emmited _after_ chown, making that workaround not needed anymore.
Link: https://github.com/kdave/btrfs-progs/issues/202
CC: stable@vger.kernel.org # 4.4+
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2473d24f2b77da0ffabcbb916793e58e7f57440b upstream.
When scrub is verifying the extents of a block group for a device, it is
possible that the corresponding block group gets removed and its logical
address and device extents get used for a new block group allocation.
When this happens scrub incorrectly reports that errors were detected
and, if the the new block group has a different profile then the old one,
deleted block group, we can crash due to a null pointer dereference.
Possibly other unexpected and weird consequences can happen as well.
Consider the following sequence of actions that leads to the null pointer
dereference crash when scrub is running in parallel with balance:
1) Balance sets block group X to read-only mode and starts relocating it.
Block group X is a metadata block group, has a raid1 profile (two
device extents, each one in a different device) and a logical address
of 19424870400;
2) Scrub is running and finds device extent E, which belongs to block
group X. It enters scrub_stripe() to find all extents allocated to
block group X, the search is done using the extent tree;
3) Balance finishes relocating block group X and removes block group X;
4) Balance starts relocating another block group and when trying to
commit the current transaction as part of the preparation step
(prepare_to_relocate()), it blocks because scrub is running;
5) The scrub task finds the metadata extent at the logical address
19425001472 and marks the pages of the extent to be read by a bio
(struct scrub_bio). The extent item's flags, which have the bit
BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct
scrub_page). It is these flags in the scrub pages that tells the
bio's end io function (scrub_bio_end_io_worker) which type of extent
it is dealing with. At this point we end up with 4 pages in a bio
which is ready for submission (the metadata extent has a size of
16Kb, so that gives 4 pages on x86);
6) At the next iteration of scrub_stripe(), scrub checks that there is a
pause request from the relocation task trying to commit a transaction,
therefore it submits the pending bio and pauses, waiting for the
transaction commit to complete before resuming;
7) The relocation task commits the transaction. The device extent E, that
was used by our block group X, is now available for allocation, since
the commit root for the device tree was swapped by the transaction
commit;
8) Another task doing a direct IO write allocates a new data block group Y
which ends using device extent E. This new block group Y also ends up
getting the same logical address that block group X had: 19424870400.
This happens because block group X was the block group with the highest
logical address and, when allocating Y, find_next_chunk() returns the
end offset of the current last block group to be used as the logical
address for the new block group, which is
18351128576 + 1073741824 = 19424870400
So our new block group Y has the same logical address and device extent
that block group X had. However Y is a data block group, while X was
a metadata one, and Y has a raid0 profile, while X had a raid1 profile;
9) After allocating block group Y, the direct IO submits a bio to write
to device extent E;
10) The read bio submitted by scrub reads the 4 pages (16Kb) from device
extent E, which now correspond to the data written by the task that
did a direct IO write. Then at the end io function associated with
the bio, scrub_bio_end_io_worker(), we call scrub_block_complete()
which calls scrub_checksum(). This later function checks the flags
of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK
is set in the flags, so it assumes it has a metadata extent and
then calls scrub_checksum_tree_block(). That functions returns an
error, since interpreting data as a metadata extent causes the
checksum verification to fail.
So this makes scrub_checksum() call scrub_handle_errored_block(),
which determines 'failed_mirror_index' to be 1, since the device
extent E was allocated as the second mirror of block group X.
It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at
'sblocks_for_recheck', and all the memory is initialized to zeroes by
kcalloc().
After that it calls scrub_setup_recheck_block(), which is responsible
for filling each of those structures. However, when that function
calls btrfs_map_sblock() against the logical address of the metadata
extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches
the current block group Y. However block group Y has a raid0 profile
and not a raid1 profile like X had, so the following call returns 1:
scrub_nr_raid_mirrors(bbio)
And as a result scrub_setup_recheck_block() only initializes the
first (index 0) scrub_block structure in 'sblocks_for_recheck'.
Then scrub_recheck_block() is called by scrub_handle_errored_block()
with the second (index 1) scrub_block structure as the argument,
because 'failed_mirror_index' was previously set to 1.
This scrub_block was not initialized by scrub_setup_recheck_block(),
so it has zero pages, its 'page_count' member is 0 and its 'pagev'
page array has all members pointing to NULL.
Finally when scrub_recheck_block() calls scrub_recheck_block_checksum()
we have a NULL pointer dereference when accessing the flags of the first
page, as pavev[0] is NULL:
static void scrub_recheck_block_checksum(struct scrub_block *sblock)
{
(...)
if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA)
scrub_checksum_data(sblock);
(...)
}
Producing a stack trace like the following:
[542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028
[542998.010238] #PF: supervisor read access in kernel mode
[542998.010878] #PF: error_code(0x0000) - not-present page
[542998.011516] PGD 0 P4D 0
[542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1
[542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
[542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs]
[542998.018474] Code: 4c 89 e6 ...
[542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202
[542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000
[542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120
[542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001
[542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800
[542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0
[542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000
[542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0
[542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[542998.032380] Call Trace:
[542998.032752] scrub_recheck_block+0x162/0x400 [btrfs]
[542998.033500] ? __alloc_pages_nodemask+0x31e/0x460
[542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs]
[542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs]
[542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs]
[542998.036735] process_one_work+0x26d/0x6a0
[542998.037275] worker_thread+0x4f/0x3e0
[542998.037740] ? process_one_work+0x6a0/0x6a0
[542998.038378] kthread+0x103/0x140
[542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70
[542998.039419] ret_from_fork+0x3a/0x50
[542998.039875] Modules linked in: dm_snapshot dm_thin_pool ...
[542998.047288] CR2: 0000000000000028
[542998.047724] ---[ end trace bde186e176c7f96a ]---
This issue has been around for a long time, possibly since scrub exists.
The last time I ran into it was over 2 years ago. After recently fixing
fstests to pass the "--full-balance" command line option to btrfs-progs
when doing balance, several tests started to more heavily exercise balance
with fsstress, scrub and other operations in parallel, and therefore
started to hit this issue again (with btrfs/061 for example).
Fix this by having scrub increment the 'trimming' counter of the block
group, which pins the block group in such a way that it guarantees neither
its logical address nor device extents can be reused by future block group
allocations until we decrement the 'trimming' counter. Also make sure that
on each iteration of scrub_stripe() we stop scrubbing the block group if
it was removed already.
A later patch in the series will rename the block group's 'trimming'
counter and its helpers to a more generic name, since now it is not used
exclusively for pinning while trimming anymore.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 998a0671961f66e9fad4990ed75f80ba3088c2f1 upstream.
btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to
the bdev with greatest device::generation number. For a typical-missing
device the generation number is zero so fs_devices::latest_bdev will
never point to it.
But if the missing device is due to alienation [1], then
device::generation is not zero and if it is greater or equal to the rest
of device generations in the list, then fs_devices::latest_bdev ends up
pointing to the missing device and reports the error like [2].
[1] We maintain devices of a fsid (as in fs_device::fsid) in the
fs_devices::devices list, a device is considered as an alien device
if its fsid does not match with the fs_device::fsid
Consider a working filesystem with raid1:
$ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb
$ mount /dev/sda /mnt-raid1
$ umount /mnt-raid1
While mnt-raid1 was unmounted the user force-adds one of its devices to
another btrfs filesystem:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt-single
$ btrfs dev add -f /dev/sda /mnt-single
Now the original mnt-raid1 fails to mount in degraded mode, because
fs_devices::latest_bdev is pointing to the alien device.
$ mount -o degraded /dev/sdb /mnt-raid1
[2]
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing
kernel: BTRFS error (device sdb): failed to read devices
kernel: BTRFS error (device sdb): open_ctree failed
Fix the root cause by checking if the device is not missing before it
can be considered for the fs_devices::latest_bdev.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7f551d969037cc128eca60688d9c5a300d84e665 upstream.
When an old device has new fsid through 'btrfs device add -f <dev>' our
fs_devices list has an alien device in one of the fs_devices lists.
By having an alien device in fs_devices, we have two issues so far
1. missing device does not not show as missing in the userland
2. degraded mount will fail
Both issues are caused by the fact that there's an alien device in the
fs_devices list. (Alien means that it does not belong to the filesystem,
identified by fsid, or does not contain btrfs filesystem at all, eg. due
to overwrite).
A device can be scanned/added through the control device ioctls
SCAN_DEV, DEVICES_READY or by ADD_DEV.
And device coming through the control device is checked against the all
other devices in the lists, but this was not the case for ADD_DEV.
This patch fixes both issues above by removing the alien device.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 47227d27e2fcb01a9e8f5958d8997cf47a820afc ]
The memcmp KASAN self-test fails on a kernel with both KASAN and
FORTIFY_SOURCE.
When FORTIFY_SOURCE is on, a number of functions are replaced with
fortified versions, which attempt to check the sizes of the operands.
However, these functions often directly invoke __builtin_foo() once they
have performed the fortify check. Using __builtins may bypass KASAN
checks if the compiler decides to inline it's own implementation as
sequence of instructions, rather than emit a function call that goes out
to a KASAN-instrumented implementation.
Why is only memcmp affected?
============================
Of the string and string-like functions that kasan_test tests, only memcmp
is replaced by an inline sequence of instructions in my testing on x86
with gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2).
I believe this is due to compiler heuristics. For example, if I annotate
kmalloc calls with the alloc_size annotation (and disable some fortify
compile-time checking!), the compiler will replace every memset except the
one in kmalloc_uaf_memset with inline instructions. (I have some WIP
patches to add this annotation.)
Does this affect other functions in string.h?
=============================================
Yes. Anything that uses __builtin_* rather than __real_* could be
affected. This looks like:
- strncpy
- strcat
- strlen
- strlcpy maybe, under some circumstances?
- strncat under some circumstances
- memset
- memcpy
- memmove
- memcmp (as noted)
- memchr
- strcpy
Whether a function call is emitted always depends on the compiler. Most
bugs should get caught by FORTIFY_SOURCE, but the missed memcmp test shows
that this is not always the case.
Isn't FORTIFY_SOURCE disabled with KASAN?
========================================-
The string headers on all arches supporting KASAN disable fortify with
kasan, but only when address sanitisation is _also_ disabled. For example
from x86:
#if defined(CONFIG_KASAN) && !defined(__SANITIZE_ADDRESS__)
/*
* For files that are not instrumented (e.g. mm/slub.c) we
* should use not instrumented version of mem* functions.
*/
#define memcpy(dst, src, len) __memcpy(dst, src, len)
#define memmove(dst, src, len) __memmove(dst, src, len)
#define memset(s, c, n) __memset(s, c, n)
#ifndef __NO_FORTIFY
#define __NO_FORTIFY /* FORTIFY_SOURCE uses __builtin_memcpy, etc. */
#endif
#endif
This comes from commit 6974f0c4555e ("include/linux/string.h: add the
option of fortified string.h functions"), and doesn't work when KASAN is
enabled and the file is supposed to be sanitised - as with test_kasan.c
I'm pretty sure this is not wrong, but not as expansive it should be:
* we shouldn't use __builtin_memcpy etc in files where we don't have
instrumentation - it could devolve into a function call to memcpy,
which will be instrumented. Rather, we should use __memcpy which
by convention is not instrumented.
* we also shouldn't be using __builtin_memcpy when we have a KASAN
instrumented file, because it could be replaced with inline asm
that will not be instrumented.
What is correct behaviour?
==========================
Firstly, there is some overlap between fortification and KASAN: both
provide some level of _runtime_ checking. Only fortify provides
compile-time checking.
KASAN and fortify can pick up different things at runtime:
- Some fortify functions, notably the string functions, could easily be
modified to consider sub-object sizes (e.g. members within a struct),
and I have some WIP patches to do this. KASAN cannot detect these
because it cannot insert poision between members of a struct.
- KASAN can detect many over-reads/over-writes when the sizes of both
operands are unknown, which fortify cannot.
So there are a couple of options:
1) Flip the test: disable fortify in santised files and enable it in
unsanitised files. This at least stops us missing KASAN checking, but
we lose the fortify checking.
2) Make the fortify code always call out to real versions. Do this only
for KASAN, for fear of losing the inlining opportunities we get from
__builtin_*.
(We can't use kasan_check_{read,write}: because the fortify functions are
_extern inline_, you can't include _static_ inline functions without a
compiler warning. kasan_check_{read,write} are static inline so we can't
use them even when they would otherwise be suitable.)
Take approach 2 and call out to real versions when KASAN is enabled.
Use __underlying_foo to distinguish from __real_foo: __real_foo always
refers to the kernel's implementation of foo, __underlying_foo could be
either the kernel implementation or the __builtin_foo implementation.
This is sometimes enough to make the memcmp test succeed with
FORTIFY_SOURCE enabled. It is at least enough to get the function call
into the module. One more fix is needed to make it reliable: see the next
patch.
Fixes: 6974f0c4555e ("include/linux/string.h: add the option of fortified string.h functions")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: David Gow <davidgow@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Link: http://lkml.kernel.org/r/20200423154503.5103-3-dja@axtens.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit adb72ae1915db28f934e9e02c18bfcea2f3ed3b7 ]
Patch series "Fix some incompatibilites between KASAN and FORTIFY_SOURCE", v4.
3 KASAN self-tests fail on a kernel with both KASAN and FORTIFY_SOURCE:
memchr, memcmp and strlen.
When FORTIFY_SOURCE is on, a number of functions are replaced with
fortified versions, which attempt to check the sizes of the operands.
However, these functions often directly invoke __builtin_foo() once they
have performed the fortify check. The compiler can detect that the
results of these functions are not used, and knows that they have no other
side effects, and so can eliminate them as dead code.
Why are only memchr, memcmp and strlen affected?
================================================
Of string and string-like functions, kasan_test tests:
* strchr -> not affected, no fortified version
* strrchr -> likewise
* strcmp -> likewise
* strncmp -> likewise
* strnlen -> not affected, the fortify source implementation calls the
underlying strnlen implementation which is instrumented, not
a builtin
* strlen -> affected, the fortify souce implementation calls a __builtin
version which the compiler can determine is dead.
* memchr -> likewise
* memcmp -> likewise
* memset -> not affected, the compiler knows that memset writes to its
first argument and therefore is not dead.
Why does this not affect the functions normally?
================================================
In string.h, these functions are not marked as __pure, so the compiler
cannot know that they do not have side effects. If relevant functions are
marked as __pure in string.h, we see the following warnings and the
functions are elided:
lib/test_kasan.c: In function `kasan_memchr':
lib/test_kasan.c:606:2: warning: statement with no effect [-Wunused-value]
memchr(ptr, '1', size + 1);
^~~~~~~~~~~~~~~~~~~~~~~~~~
lib/test_kasan.c: In function `kasan_memcmp':
lib/test_kasan.c:622:2: warning: statement with no effect [-Wunused-value]
memcmp(ptr, arr, size+1);
^~~~~~~~~~~~~~~~~~~~~~~~
lib/test_kasan.c: In function `kasan_strings':
lib/test_kasan.c:645:2: warning: statement with no effect [-Wunused-value]
strchr(ptr, '1');
^~~~~~~~~~~~~~~~
...
This annotation would make sense to add and could be added at any point,
so the behaviour of test_kasan.c should change.
The fix
=======
Make all the functions that are pure write their results to a global,
which makes them live. The strlen and memchr tests now pass.
The memcmp test still fails to trigger, which is addressed in the next
patch.
[dja@axtens.net: drop patch 3]
Link: http://lkml.kernel.org/r/20200424145521.8203-2-dja@axtens.net
Fixes: 0c96350a2d2f ("lib/test_kasan.c: add tests for several string/memory API functions")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: David Gow <davidgow@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Link: http://lkml.kernel.org/r/20200423154503.5103-1-dja@axtens.net
Link: http://lkml.kernel.org/r/20200423154503.5103-2-dja@axtens.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit effe5be17706167ee968fa28afe40dec9c6f71db ]
Certain kernel functions (e.g. get_vtimer/set_vtimer) cause kernel
panic when the stack is not 8-byte aligned. Currently JITed BPF programs
may trigger this by allocating stack frames with non-rounded sizes and
then being interrupted. Fix by using rounded fp->aux->stack_depth.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200602174339.2501066-1-iii@linux.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 836e66c218f355ec01ba57671c85abf32961dcea ]
Lorenz recently reported:
In our TC classifier cls_redirect [0], we use the following sequence of
helper calls to decapsulate a GUE (basically IP + UDP + custom header)
encapsulated packet:
bpf_skb_adjust_room(skb, -encap_len, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
bpf_redirect(skb->ifindex, BPF_F_INGRESS)
It seems like some checksums of the inner headers are not validated in
this case. For example, a TCP SYN packet with invalid TCP checksum is
still accepted by the network stack and elicits a SYN ACK. [...]
That is, we receive the following packet from the driver:
| ETH | IP | UDP | GUE | IP | TCP |
skb->ip_summed == CHECKSUM_UNNECESSARY
ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx checksum offloading.
On this packet we run skb_adjust_room_mac(-encap_len), and get the following:
| ETH | IP | TCP |
skb->ip_summed == CHECKSUM_UNNECESSARY
Note that ip_summed is still CHECKSUM_UNNECESSARY. After bpf_redirect()'ing
into the ingress, we end up in tcp_v4_rcv(). There, skb_checksum_init() is
turned into a no-op due to CHECKSUM_UNNECESSARY.
The bpf_skb_adjust_room() helper is not aware of protocol specifics. Internally,
it handles the CHECKSUM_COMPLETE case via skb_postpull_rcsum(), but that does
not cover CHECKSUM_UNNECESSARY. In this case skb->csum_level of the original
skb prior to bpf_skb_adjust_room() call was 0, that is, covering UDP. Right now
there is no way to adjust the skb->csum_level. NICs that have checksum offload
disabled (CHECKSUM_NONE) or that support CHECKSUM_COMPLETE are not affected.
Use a safe default for CHECKSUM_UNNECESSARY by resetting to CHECKSUM_NONE and
add a flag to the helper called BPF_F_ADJ_ROOM_NO_CSUM_RESET that allows users
from opting out. Opting out is useful for the case where we don't remove/add
full protocol headers, or for the case where a user wants to adjust the csum
level manually e.g. through bpf_csum_level() helper that is added in subsequent
patch.
The bpf_skb_proto_{4_to_6,6_to_4}() for NAT64/46 translation from the BPF
bpf_skb_change_proto() helper uses bpf_skb_net_hdr_{push,pop}() pair internally
as well but doesn't change layers, only transitions between v4 to v6 and vice
versa, therefore no adoption is required there.
[0] https://lore.kernel.org/bpf/20200424185556.7358-1-lmb@cloudflare.com/
Fixes: 2be7e212d541 ("bpf: add bpf_skb_adjust_room helper")
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Reported-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/bpf/CACAyw9-uU_52esMd1JjuA80fRPHJv5vsSg8GnfW3t_qDU4aVKQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/11a90472e7cce83e76ddbfce81fdfce7bfc68808.1591108731.git.daniel@iogearbox.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b8215dce7dfd817ca38807f55165bf502146cd68 ]
test_flow_dissector leaves a TAP device after it's finished, potentially
interfering with other tests that will run after it. Fix it by closing the
TAP descriptor on cleanup.
Fixes: 0905beec9f52 ("selftests/bpf: run flow dissector tests in skb-less mode")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-11-jakub@cloudflare.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e91de6afa81c10e9f855c5695eb9a53168d96b73 ]
KTLS uses a stream parser to collect TLS messages and send them to
the upper layer tls receive handler. This ensures the tls receiver
has a full TLS header to parse when it is run. However, when a
socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
is enabled we end up with two stream parsers running on the same
socket.
The result is both try to run on the same socket. First the KTLS
stream parser runs and calls read_sock() which will tcp_read_sock
which in turn calls tcp_rcv_skb(). This dequeues the skb from the
sk_receive_queue. When this is done KTLS code then data_ready()
callback which because we stacked KTLS on top of the bpf stream
verdict program has been replaced with sk_psock_start_strp(). This
will in turn kick the stream parser again and eventually do the
same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
a skb from the sk_receive_queue.
At this point the data stream is broke. Part of the stream was
handled by the KTLS side some other bytes may have been handled
by the BPF side. Generally this results in either missing data
or more likely a "Bad Message" complaint from the kTLS receive
handler as the BPF program steals some bytes meant to be in a
TLS header and/or the TLS header length is no longer correct.
We've already broke the idealized model where we can stack ULPs
in any order with generic callbacks on the TX side to handle this.
So in this patch we do the same thing but for RX side. We add
a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
program is running and add a tls_sw_has_ctx_rx() helper so BPF
side can learn there is a TLS ULP on the socket.
Then on BPF side we omit calling our stream parser to avoid
breaking the data stream for the KTLS receiver. Then on the
KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
receiver is done with the packet but before it posts the
msg to userspace. This gives us symmetry between the TX and
RX halfs and IMO makes it usable again. On the TX side we
process packets in this order BPF -> TLS -> TCP and on
the receive side in the reverse order TCP -> TLS -> BPF.
Discovered while testing OpenSSL 3.0 Alpha2.0 release.
Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ca2f5f21dbbd5e3a00cd3e97f728aa2ca0b2e011 ]
We will need this block of code called from tls context shortly
lets refactor the redirect logic so its easy to use. This also
cleans up the switch stmt so we have fewer fallthrough cases.
No logic changes are intended.
Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1ea0f9120c8ce105ca181b070561df5cbd6bc049 ]
The map_lookup_and_delete_elem() function should check for both FMODE_CAN_WRITE
and FMODE_CAN_READ permissions because it returns a map element to user space.
Fixes: bd513cd08f10 ("bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200527185700.14658-5-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 601b05ca6edb0422bf6ce313fbfd55ec7bbbc0fd ]
In case the cpu_bufs are sparsely allocated they are not all
free'ed. These changes will fix this.
Fixes: fb84b8224655 ("libbpf: add perf buffer API")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/159056888305.330763.9684536967379110349.stgit@ebuild
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7b91f1565fbfbe5a162d91f8a1f6c5580c2fc1d0 ]
On the ASUS laptop UX325JA/UX425JA, most of the media keys are not
working due to the ASUS WMI driver fails to be loaded. The ACPI error
as follows leads to the failure of asus_wmi_evaluate_method.
ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [IIA3] at bit offset/length 96/32 exceeds size of target Buffer (96 bits) (20200326/dsopcode-203)
No Local Variables are initialized for Method [WMNB]
ACPI Error: Aborting method \_SB.ATKD.WMNB due to previous error (AE_AML_BUFFER_LIMIT) (20200326/psparse-531)
The DSDT for the WMNB part shows that 5 DWORD required for local
variables and the 3rd variable IIA3 hit the buffer limit.
Method (WMNB, 3, Serialized)
{ ..
CreateDWordField (Arg2, Zero, IIA0)
CreateDWordField (Arg2, 0x04, IIA1)
CreateDWordField (Arg2, 0x08, IIA2)
CreateDWordField (Arg2, 0x0C, IIA3)
CreateDWordField (Arg2, 0x10, IIA4)
Local0 = (Arg1 & 0xFFFFFFFF)
If ((Local0 == 0x54494E49))
..
}
The limitation is determined by the input acpi_buffer size passed
to the wmi_evaluate_method. Since the struct bios_args is the data
structure used as input buffer by default for all ASUS WMI calls,
the size needs to be expanded to fix the problem.
Signed-off-by: Chris Chiu <chiu@endlessm.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
chasis-type
[ Upstream commit cfae58ed681c5fe0185db843013ecc71cd265ebf ]
The HP Stream x360 11-p000nd no longer report SW_TABLET_MODE state / events
with recent kernels. This model reports a chassis-type of 10 / "Notebook"
which is not on the recently introduced chassis-type whitelist
Commit de9647efeaa9 ("platform/x86: intel-vbtn: Only activate tablet mode
switch on 2-in-1's") added a chassis-type whitelist and only listed 31 /
"Convertible" as being capable of generating valid SW_TABLET_MOD events.
Commit 1fac39fd0316 ("platform/x86: intel-vbtn: Also handle tablet-mode
switch on "Detachable" and "Portable" chassis-types") extended the
whitelist with chassis-types 8 / "Portable" and 32 / "Detachable".
And now we need to exten the whitelist again with 10 / "Notebook"...
The issue original fixed by the whitelist is really a ACPI DSDT bug on
the Dell XPS 9360 where it has a VGBS which reports it is in tablet mode
even though it is not a 2-in-1 at all, but a regular laptop.
So since this is a workaround for a DSDT issue on that specific model,
instead of extending the whitelist over and over again, lets switch to
a blacklist and only blacklist the chassis-type of the model for which
the chassis-type check was added.
Note this also fixes the current version of the code no longer checking
if dmi_get_system_info(DMI_CHASSIS_TYPE) returns NULL.
Fixes: 1fac39fd0316 ("platform/x86: intel-vbtn: Also handle tablet-mode switch on "Detachable" and "Portable" chassis-types")
Cc: Mario Limonciello <mario.limonciello@dell.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Mario Limonciello <Mario.limonciello@dell.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8fe63eb757ac6e661a384cc760792080bdc738dc ]
HEBC method reports capabilities of 5 button array but HP Spectre X2 (2015)
does not have this control method (the same was for Wacom MobileStudio Pro).
Expand previous DMI quirk by Alex Hung to also enable 5 button array
for this system.
Signed-off-by: Nickolai Kozachenko <daemongloom@gmail.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 765dd7a1827c687b782e6ab3dd6daf4d13a4780f ]
Currently the driver prevents a user from doing
modprobe ice
ethtool -L eth0 combined 5
ip link set eth0 up
The ethtool command fails, because the driver is checking to see if the
interface is down before allowing the get_channels to proceed (even for
a set_channels).
Remove this check and allow the user to configure the interface
before bringing it up, which is a much better usability case.
Fixes: 87324e747fde ("ice: Implement ethtool ops for channels")
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5cdc45ed3948042f0d73c6fec5ee9b59e637d0d2 ]
First of all, unsigned long can overflow u32 value on 64-bit machine.
Second, simple_strtoul() doesn't check for overflow in the input.
Convert simple_strtoul() to kstrtou32() to eliminate above issues.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 7b53d59859bc932b37895d2d37388e7fa29af7a5 ]
Overflowed requests in io_uring_cancel_files() should be shed only of
inflight and overflowed refs. All other left references are owned by
someone else.
If refcount_sub_and_test() fails, it will go further and put put extra
ref, don't do that. Also, don't need to do io_wq_cancel_work()
for overflowed reqs, they will be let go shortly anyway.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 263b81dc6c932c8bc550d5e7bfc178d2b3fc491e ]
ColdFire is a big-endian cpu with a big-endian dspi hw module,
so, it uses native access, but memcpy breaks the endianness.
So, if i understand properly, by native copy we would mean
be(cpu)->be(dspi) or le(cpu)->le(dspi) accesses, so my fix
shouldn't break anything, but i couldn't test it on LS family,
so every test is really appreciated.
Fixes: 53fadb4d90c7 ("spi: spi-fsl-dspi: Simplify bytes_per_word gymnastics")
Signed-off-by: Angelo Dureghello <angelo.dureghello@timesys.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20200529195756.184677-1-angelo.dureghello@timesys.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c343bf1ba5efcbf2266a1fe3baefec9cc82f867f ]
kobject_init_and_add() takes reference even when it fails.
If this function returns an error, kobject_put() must be called to
properly clean up the memory associated with the object.
Previous commit "b8eb718348b8" fixed a similar problem.
Signed-off-by: Qiushi Wu <wu000273@umn.edu>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f0410bbf7d0fb80149e3b17d11d31f5b5197873e ]
DW APB SSI DMA-part of the driver may need to perform the requested
SPI-transfer synchronously. In that case the dma_transfer() callback
will return 0 as a marker of the SPI transfer being finished so the
SPI core doesn't need to wait and may proceed with the SPI message
trasnfers pumping procedure. This will be needed to fix the problem
when DMA transactions are finished, but there is still data left in
the SPI Tx/Rx FIFOs being sent/received. But for now make dma_transfer
to return 1 as the normal dw_spi_transfer_one() method.
Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru>
Cc: Georgy Vlasov <Georgy.Vlasov@baikalelectronics.ru>
Cc: Ramil Zaripov <Ramil.Zaripov@baikalelectronics.ru>
Cc: Alexey Malahov <Alexey.Malahov@baikalelectronics.ru>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: linux-mips@vger.kernel.org
Cc: devicetree@vger.kernel.org
Link: https://lore.kernel.org/r/20200529131205.31838-3-Sergey.Semin@baikalelectronics.ru
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1194be8c949b8190b2882ad8335a5d98aa50c735 ]
According the RM, the bit[6~0] of register ESDHC_TUNING_CTRL is
TUNING_START_TAP, bit[7] of this register is to disable the command
CRC check for standard tuning. So fix it here.
Fixes: d87fc9663688 ("mmc: sdhci-esdhc-imx: support setting tuning start point")
Signed-off-by: Haibo Chen <haibo.chen@nxp.com>
Link: https://lore.kernel.org/r/1590488522-9292-1-git-send-email-haibo.chen@nxp.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f327236df2afc8c3c711e7e070f122c26974f4da ]
When mvm is initialized we alloc aux station with aux queue.
We later free the station memory when driver is stopped, but we
never free the queue's memory, which casues a leak.
Add a proper de-initialization of the station.
Signed-off-by: Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200529092401.0121c5be55e9.Id7516fbb3482131d0c9dfb51ff20b226617ddb49@changeid
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3b70683fc4d68f5d915d9dc7e5ba72c732c7315c ]
ubsan report this warning, fix it by adding a unsigned suffix.
UBSAN: signed-integer-overflow in
drivers/net/ethernet/intel/ixgbe/ixgbe_common.c:2246:26
65535 * 65537 cannot be represented in type 'int'
CPU: 21 PID: 7 Comm: kworker/u256:0 Not tainted 5.7.0-rc3-debug+ #39
Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 03/27/2020
Workqueue: ixgbe ixgbe_service_task [ixgbe]
Call trace:
dump_backtrace+0x0/0x3f0
show_stack+0x28/0x38
dump_stack+0x154/0x1e4
ubsan_epilogue+0x18/0x60
handle_overflow+0xf8/0x148
__ubsan_handle_mul_overflow+0x34/0x48
ixgbe_fc_enable_generic+0x4d0/0x590 [ixgbe]
ixgbe_service_task+0xc20/0x1f78 [ixgbe]
process_one_work+0x8f0/0xf18
worker_thread+0x430/0x6d0
kthread+0x218/0x238
ret_from_fork+0x10/0x18
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit bc3a024101ca497bea4c69be4054c32a5c349f1d ]
If ice_init_interrupt_scheme fails, ice_probe will jump to clearing up
the interrupts. This can lead to some static analysis tools such as the
compiler sanitizers complaining about double free problems.
Since ice_init_interrupt_scheme already unrolls internally on failure,
there is no need to call ice_clear_interrupt_scheme when it fails. Add
a new unroll label and use that instead.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e93577ecde8f3cbd12a2eaa0522d5c85e0dbdd53 ]
Some controller as the ColdFire eshdc may require an endianness
byte swap, because DMA read endianness is not configurable.
Facilitate using the bounce buffer for this by adding
->copy_to_bounce_buffer().
Signed-off-by: Angelo Dureghello <angelo.dureghello@timesys.com>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Link: https://lore.kernel.org/r/20200518191742.1251440-2-angelo.dureghello@timesys.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 966244ccd2919e28f25555a77f204cd1c109cad8 ]
Using a fixed 1s timeout for all commands (and data transfers) is a bit
problematic.
For some commands it means waiting longer than needed for the timer to
expire, which may not a big issue, but still. For other commands, like for
an erase (CMD38) that uses a R1B response, may require longer timeouts than
1s. In these cases, we may end up treating the command as it failed, while
it just needed some more time to complete successfully.
Fix the problem by respecting the cmd->busy_timeout, which is provided by
the mmc core.
Cc: Bruce Chang <brucechang@via.com.tw>
Cc: Harald Welte <HaraldWelte@viatech.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20200414161413.3036-17-ulf.hansson@linaro.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f37ac1ae3ca93d0995553ad9604a25eadfe9406d ]
For commands that doesn't involve to prepare a data transfer, owl-mmc is
using a fixed 30s response timeout. This is a bit problematic.
For some commands it means waiting longer than needed for the completion to
expire, which may not a big issue, but still. For other commands, like for
an erase (CMD38) that uses a R1B response, may require longer timeouts than
30s. In these cases, we may end up treating the command as it failed, while
it just needed some more time to complete successfully.
Fix the problem by respecting the cmd->busy_timeout, which is provided by
the mmc core.
Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20200414161413.3036-8-ulf.hansson@linaro.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a389087ee9f195fcf2f31cd771e9ec5f02c16650 ]
Using a fixed 1s timeout for all commands is a bit problematic.
For some commands it means waiting longer than needed for the timeout to
expire, which may not a big issue, but still. For other commands, like for
an erase (CMD38) that uses a R1B response, may require longer timeouts than
1s. In these cases, we may end up treating the command as it failed, while
it just needed some more time to complete successfully.
Fix the problem by respecting the cmd->busy_timeout, which is provided by
the mmc core.
Cc: Rui Miguel Silva <rmfrfs@gmail.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Alex Elder <elder@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: greybus-dev@lists.linaro.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rui Miguel Silva <rmfrfs@gmail.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20200414161413.3036-20-ulf.hansson@linaro.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d863cb03fb2aac07f017b2a1d923cdbc35021280 ]
sdhci-msm can support auto cmd12.
So enable SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 quirk.
Signed-off-by: Veerabhadrarao Badiganti <vbadigan@codeaurora.org>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Link: https://lore.kernel.org/r/1587363626-20413-3-git-send-email-vbadigan@codeaurora.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3e09a81e166c0a5544832459be17561a6b231ac7 ]
Instead of reimplementing the logic in mmc_regulator_set_vqmmc(), use the
mmc code function directly.
This also allows us to fix a related issue on STM32MP1, when a voltage
switch of 1.8V is done for the eMMC, but the current level is already set
to 1.8V. More precisely, in this scenario the call to the
->post_sig_volt_switch() hangs, indefinitely waiting for the voltage switch
to complete. Fix this problem by checking if mmc_regulator_set_vqmmc()
returned 1 and then skip invoking the callback.
Signed-off-by: Marek Vasut <marex@denx.de>
Link: https://lore.kernel.org/r/20200416163649.336967-3-marex@denx.de
[Ulf: Updated the commit message]
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|