summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-05-17bpf: fix truncated jump targets on heavy expansionsDaniel Borkmann
Recently during testing, I ran into the following panic: [ 207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP [ 207.901637] Modules linked in: binfmt_misc [...] [ 207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G W 4.17.0-rc3+ #7 [ 207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017 [ 207.982428] pstate: 60400005 (nZCv daif +PAN -UAO) [ 207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0 [ 207.992603] lr : 0xffff000000bdb754 [ 207.996080] sp : ffff000013703ca0 [ 207.999384] x29: ffff000013703ca0 x28: 0000000000000001 [ 208.004688] x27: 0000000000000001 x26: 0000000000000000 [ 208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00 [ 208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000 [ 208.020599] x21: fffffffffeff2a6f x20: 000000000000000a [ 208.025903] x19: ffff000009578000 x18: 0000000000000a03 [ 208.031206] x17: 0000000000000000 x16: 0000000000000000 [ 208.036510] x15: 0000ffff9de83000 x14: 0000000000000000 [ 208.041813] x13: 0000000000000000 x12: 0000000000000000 [ 208.047116] x11: 0000000000000001 x10: ffff0000089e7f18 [ 208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000 [ 208.057723] x7 : 000000000000000a x6 : 00280c6160000000 [ 208.063026] x5 : 0000000000000018 x4 : 0000000000007db6 [ 208.068329] x3 : 000000000008647a x2 : 19868179b1484500 [ 208.073632] x1 : 0000000000000000 x0 : ffff000009578c08 [ 208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974) [ 208.086235] Call trace: [ 208.088672] bpf_skb_load_helper_8_no_cache+0x34/0xc0 [ 208.093713] 0xffff000000bdb754 [ 208.096845] bpf_test_run+0x78/0xf8 [ 208.100324] bpf_prog_test_run_skb+0x148/0x230 [ 208.104758] sys_bpf+0x314/0x1198 [ 208.108064] el0_svc_naked+0x30/0x34 [ 208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680) [ 208.117717] ---[ end trace 263cb8a59b5bf29f ]--- The program itself which caused this had a long jump over the whole instruction sequence where all of the inner instructions required heavy expansions into multiple BPF instructions. Additionally, I also had BPF hardening enabled which requires once more rewrites of all constant values in order to blind them. Each time we rewrite insns, bpf_adj_branches() would need to potentially adjust branch targets which cross the patchlet boundary to accommodate for the additional delta. Eventually that lead to the case where the target offset could not fit into insn->off's upper 0x7fff limit anymore where then offset wraps around becoming negative (in s16 universe), or vice versa depending on the jump direction. Therefore it becomes necessary to detect and reject any such occasions in a generic way for native eBPF and cBPF to eBPF migrations. For the latter we can simply check bounds in the bpf_convert_filter()'s BPF_EMIT_JMP helper macro and bail out once we surpass limits. The bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case of subsequent hardening) is a bit more complex in that we need to detect such truncations before hitting the bpf_prog_realloc(). Thus the latter is split into an extra pass to probe problematic offsets on the original program in order to fail early. With that in place and carefully tested I no longer hit the panic and the rewrites are rejected properly. The above example panic I've seen on bpf-next, though the issue itself is generic in that a guard against this issue in bpf seems more appropriate in this case. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-05-17Merge tag 'hwmon-for-linus-v4.17-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull hwmon fixes from Guenter Roeck: "Two k10temp fixes: - fix race condition when accessing System Management Network registers - fix reading critical temperatures on F15h M60h and M70h Also add PCI ID's for the AMD Raven Ridge root bridge" * tag 'hwmon-for-linus-v4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: hwmon: (k10temp) Use API function to access System Management Network x86/amd_nb: Add support for Raven Ridge CPUs hwmon: (k10temp) Fix reading critical temperature register
2018-05-18bpf: parse and verdict prog attach may race with bpf map updateJohn Fastabend
In the sockmap design BPF programs (SK_SKB_STREAM_PARSER, SK_SKB_STREAM_VERDICT and SK_MSG_VERDICT) are attached to the sockmap map type and when a sock is added to the map the programs are used by the socket. However, sockmap updates from both userspace and BPF programs can happen concurrently with the attach and detach of these programs. To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE() primitive to ensure the program pointer is not refeched and possibly NULL'd before the refcnt increment. This happens inside a RCU critical section so although the pointer reference in the map object may be NULL (by a concurrent detach operation) the reference from READ_ONCE will not be free'd until after grace period. This ensures the object returned by READ_ONCE() is valid through the RCU criticl section and safe to use as long as we "know" it may be free'd shortly. Daniel spotted a case in the sock update API where instead of using the READ_ONCE() program reference we used the pointer from the original map, stab->bpf_{verdict|parse|txmsg}. The problem with this is the logic checks the object returned from the READ_ONCE() is not NULL and then tries to reference the object again but using the above map pointer, which may have already been NULL'd by a parallel detach operation. If this happened bpf_porg_inc_not_zero could dereference a NULL pointer. Fix this by using variable returned by READ_ONCE() that is checked for NULL. Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-18bpf: sockmap update rollback on error can incorrectly dec prog refcntJohn Fastabend
If the user were to only attach one of the parse or verdict programs then it is possible a subsequent sockmap update could incorrectly decrement the refcnt on the program. This happens because in the rollback logic, after an error, we have to decrement the program reference count when its been incremented. However, we only increment the program reference count if the user has both a verdict and a parse program. The reason for this is because, at least at the moment, both are required for any one to be meaningful. The problem fixed here is in the rollback path we decrement the program refcnt even if only one existing. But we never incremented the refcnt in the first place creating an imbalance. This patch fixes the error path to handle this case. Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-17net: test tailroom before appending to linear skbWillem de Bruijn
Device features may change during transmission. In particular with corking, a device may toggle scatter-gather in between allocating and writing to an skb. Do not unconditionally assume that !NETIF_F_SG at write time implies that the same held at alloc time and thus the skb has sufficient tailroom. This issue predates git history. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17Merge branch 'ip6_gre-Fixes-in-headroom-handling'David S. Miller
Petr Machata says: ==================== net: ip6_gre: Fixes in headroom handling This series mends some problems in headroom management in ip6_gre module. The current code base has the following three closely-related problems: - ip6gretap tunnels neglect to ensure there's enough writable headroom before pushing GRE headers. - ip6erspan does this, but assumes that dev->needed_headroom is primed. But that doesn't happen until ip6_tnl_xmit() is called later. Thus for the first packet, ip6erspan actually behaves like ip6gretap above. - ip6erspan shares some of the code with ip6gretap, including calculations of needed header length. While there is custom ERSPAN-specific code for calculating the headroom, the computed values are overwritten by the ip6gretap code. The first two issues lead to a kernel panic in situations where a packet is mirrored from a veth device to the device in question. They are fixed, respectively, in patches #1 and #2, which include the full panic trace and a reproducer. The rest of the patchset deals with the last issue. In patches #3 to #6, several functions are split up into reusable parts. Finally in patch #7 these blocks are used to compose ERSPAN-specific callbacks where necessary to fix the hlen calculation. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: ip6_gre: Fix ip6erspan hlen calculationPetr Machata
Even though ip6erspan_tap_init() sets up hlen and tun_hlen according to what ERSPAN needs, it goes ahead to call ip6gre_tnl_link_config() which overwrites these settings with GRE-specific ones. Similarly for changelink callbacks, which are handled by ip6gre_changelink() calls ip6gre_tnl_change() calls ip6gre_tnl_link_config() as well. The difference ends up being 12 vs. 20 bytes, and this is generally not a problem, because a 12-byte request likely ends up allocating more and the extra 8 bytes are thus available. However correct it is not. So replace the newlink and changelink callbacks with an ERSPAN-specific ones, reusing the newly-introduced _common() functions. Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: ip6_gre: Split up ip6gre_changelink()Petr Machata
Extract from ip6gre_changelink() a reusable function ip6gre_changelink_common(). This will allow introduction of ERSPAN-specific _changelink() function with not a lot of code duplication. Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: ip6_gre: Split up ip6gre_newlink()Petr Machata
Extract from ip6gre_newlink() a reusable function ip6gre_newlink_common(). The ip6gre_tnl_link_config() call needs to be made customizable for ERSPAN, thus reorder it with calls to ip6_tnl_change_mtu() and dev_hold(), and extract the whole tail to the caller, ip6gre_newlink(). Thus enable an ERSPAN-specific _newlink() function without a lot of duplicity. Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: ip6_gre: Split up ip6gre_tnl_change()Petr Machata
Split a reusable function ip6gre_tnl_copy_tnl_parm() from ip6gre_tnl_change(). This will allow ERSPAN-specific code to reuse the common parts while customizing the behavior for ERSPAN. Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: ip6_gre: Split up ip6gre_tnl_link_config()Petr Machata
The function ip6gre_tnl_link_config() is used for setting up configuration of both ip6gretap and ip6erspan tunnels. Split the function into the common part and the route-lookup part. The latter then takes the calculated header length as an argument. This split will allow the patches down the line to sneak in a custom header length computation for the ERSPAN tunnel. Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: ip6_gre: Fix headroom request in ip6erspan_tunnel_xmit()Petr Machata
dev->needed_headroom is not primed until ip6_tnl_xmit(), so it starts out zero. Thus the call to skb_cow_head() fails to actually make sure there's enough headroom to push the ERSPAN headers to. That can lead to the panic cited below. (Reproducer below that). Fix by requesting either needed_headroom if already primed, or just the bare minimum needed for the header otherwise. [ 190.703567] kernel BUG at net/core/skbuff.c:104! [ 190.708384] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI [ 190.714007] Modules linked in: act_mirred cls_matchall ip6_gre ip6_tunnel tunnel6 gre sch_ingress vrf veth x86_pkg_temp_thermal mlx_platform nfsd e1000e leds_mlxcpld [ 190.728975] CPU: 1 PID: 959 Comm: kworker/1:2 Not tainted 4.17.0-rc4-net_master-custom-139 #10 [ 190.737647] Hardware name: Mellanox Technologies Ltd. "MSN2410-CB2F"/"SA000874", BIOS 4.6.5 03/08/2016 [ 190.747006] Workqueue: ipv6_addrconf addrconf_dad_work [ 190.752222] RIP: 0010:skb_panic+0xc3/0x100 [ 190.756358] RSP: 0018:ffff8801d54072f0 EFLAGS: 00010282 [ 190.761629] RAX: 0000000000000085 RBX: ffff8801c1a8ecc0 RCX: 0000000000000000 [ 190.768830] RDX: 0000000000000085 RSI: dffffc0000000000 RDI: ffffed003aa80e54 [ 190.776025] RBP: ffff8801bd1ec5a0 R08: ffffed003aabce19 R09: ffffed003aabce19 [ 190.783226] R10: 0000000000000001 R11: ffffed003aabce18 R12: ffff8801bf695dbe [ 190.790418] R13: 0000000000000084 R14: 00000000000006c0 R15: ffff8801bf695dc8 [ 190.797621] FS: 0000000000000000(0000) GS:ffff8801d5400000(0000) knlGS:0000000000000000 [ 190.805786] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 190.811582] CR2: 000055fa929aced0 CR3: 0000000003228004 CR4: 00000000001606e0 [ 190.818790] Call Trace: [ 190.821264] <IRQ> [ 190.823314] ? ip6erspan_tunnel_xmit+0x5e4/0x1982 [ip6_gre] [ 190.828940] ? ip6erspan_tunnel_xmit+0x5e4/0x1982 [ip6_gre] [ 190.834562] skb_push+0x78/0x90 [ 190.837749] ip6erspan_tunnel_xmit+0x5e4/0x1982 [ip6_gre] [ 190.843219] ? ip6gre_tunnel_ioctl+0xd90/0xd90 [ip6_gre] [ 190.848577] ? debug_check_no_locks_freed+0x210/0x210 [ 190.853679] ? debug_check_no_locks_freed+0x210/0x210 [ 190.858783] ? print_irqtrace_events+0x120/0x120 [ 190.863451] ? sched_clock_cpu+0x18/0x210 [ 190.867496] ? cyc2ns_read_end+0x10/0x10 [ 190.871474] ? skb_network_protocol+0x76/0x200 [ 190.875977] dev_hard_start_xmit+0x137/0x770 [ 190.880317] ? do_raw_spin_trylock+0x6d/0xa0 [ 190.884624] sch_direct_xmit+0x2ef/0x5d0 [ 190.888589] ? pfifo_fast_dequeue+0x3fa/0x670 [ 190.892994] ? pfifo_fast_change_tx_queue_len+0x810/0x810 [ 190.898455] ? __lock_is_held+0xa0/0x160 [ 190.902422] __qdisc_run+0x39e/0xfc0 [ 190.906041] ? _raw_spin_unlock+0x29/0x40 [ 190.910090] ? pfifo_fast_enqueue+0x24b/0x3e0 [ 190.914501] ? sch_direct_xmit+0x5d0/0x5d0 [ 190.918658] ? pfifo_fast_dequeue+0x670/0x670 [ 190.923047] ? __dev_queue_xmit+0x172/0x1770 [ 190.927365] ? preempt_count_sub+0xf/0xd0 [ 190.931421] __dev_queue_xmit+0x410/0x1770 [ 190.935553] ? ___slab_alloc+0x605/0x930 [ 190.939524] ? print_irqtrace_events+0x120/0x120 [ 190.944186] ? memcpy+0x34/0x50 [ 190.947364] ? netdev_pick_tx+0x1c0/0x1c0 [ 190.951428] ? __skb_clone+0x2fd/0x3d0 [ 190.955218] ? __copy_skb_header+0x270/0x270 [ 190.959537] ? rcu_read_lock_sched_held+0x93/0xa0 [ 190.964282] ? kmem_cache_alloc+0x344/0x4d0 [ 190.968520] ? cyc2ns_read_end+0x10/0x10 [ 190.972495] ? skb_clone+0x123/0x230 [ 190.976112] ? skb_split+0x820/0x820 [ 190.979747] ? tcf_mirred+0x554/0x930 [act_mirred] [ 190.984582] tcf_mirred+0x554/0x930 [act_mirred] [ 190.989252] ? tcf_mirred_act_wants_ingress.part.2+0x10/0x10 [act_mirred] [ 190.996109] ? __lock_acquire+0x706/0x26e0 [ 191.000239] ? sched_clock_cpu+0x18/0x210 [ 191.004294] tcf_action_exec+0xcf/0x2a0 [ 191.008179] tcf_classify+0xfa/0x340 [ 191.011794] __netif_receive_skb_core+0x8e1/0x1c60 [ 191.016630] ? debug_check_no_locks_freed+0x210/0x210 [ 191.021732] ? nf_ingress+0x500/0x500 [ 191.025458] ? process_backlog+0x347/0x4b0 [ 191.029619] ? print_irqtrace_events+0x120/0x120 [ 191.034302] ? lock_acquire+0xd8/0x320 [ 191.038089] ? process_backlog+0x1b6/0x4b0 [ 191.042246] ? process_backlog+0xc2/0x4b0 [ 191.046303] process_backlog+0xc2/0x4b0 [ 191.050189] net_rx_action+0x5cc/0x980 [ 191.053991] ? napi_complete_done+0x2c0/0x2c0 [ 191.058386] ? mark_lock+0x13d/0xb40 [ 191.062001] ? clockevents_program_event+0x6b/0x1d0 [ 191.066922] ? print_irqtrace_events+0x120/0x120 [ 191.071593] ? __lock_is_held+0xa0/0x160 [ 191.075566] __do_softirq+0x1d4/0x9d2 [ 191.079282] ? ip6_finish_output2+0x524/0x1460 [ 191.083771] do_softirq_own_stack+0x2a/0x40 [ 191.087994] </IRQ> [ 191.090130] do_softirq.part.13+0x38/0x40 [ 191.094178] __local_bh_enable_ip+0x135/0x190 [ 191.098591] ip6_finish_output2+0x54d/0x1460 [ 191.102916] ? ip6_forward_finish+0x2f0/0x2f0 [ 191.107314] ? ip6_mtu+0x3c/0x2c0 [ 191.110674] ? ip6_finish_output+0x2f8/0x650 [ 191.114992] ? ip6_output+0x12a/0x500 [ 191.118696] ip6_output+0x12a/0x500 [ 191.122223] ? ip6_route_dev_notify+0x5b0/0x5b0 [ 191.126807] ? ip6_finish_output+0x650/0x650 [ 191.131120] ? ip6_fragment+0x1a60/0x1a60 [ 191.135182] ? icmp6_dst_alloc+0x26e/0x470 [ 191.139317] mld_sendpack+0x672/0x830 [ 191.143021] ? igmp6_mcf_seq_next+0x2f0/0x2f0 [ 191.147429] ? __local_bh_enable_ip+0x77/0x190 [ 191.151913] ipv6_mc_dad_complete+0x47/0x90 [ 191.156144] addrconf_dad_completed+0x561/0x720 [ 191.160731] ? addrconf_rs_timer+0x3a0/0x3a0 [ 191.165036] ? mark_held_locks+0xc9/0x140 [ 191.169095] ? __local_bh_enable_ip+0x77/0x190 [ 191.173570] ? addrconf_dad_work+0x50d/0xa20 [ 191.177886] ? addrconf_dad_work+0x529/0xa20 [ 191.182194] addrconf_dad_work+0x529/0xa20 [ 191.186342] ? addrconf_dad_completed+0x720/0x720 [ 191.191088] ? __lock_is_held+0xa0/0x160 [ 191.195059] ? process_one_work+0x45d/0xe20 [ 191.199302] ? process_one_work+0x51e/0xe20 [ 191.203531] ? rcu_read_lock_sched_held+0x93/0xa0 [ 191.208279] process_one_work+0x51e/0xe20 [ 191.212340] ? pwq_dec_nr_in_flight+0x200/0x200 [ 191.216912] ? get_lock_stats+0x4b/0xf0 [ 191.220788] ? preempt_count_sub+0xf/0xd0 [ 191.224844] ? worker_thread+0x219/0x860 [ 191.228823] ? do_raw_spin_trylock+0x6d/0xa0 [ 191.233142] worker_thread+0xeb/0x860 [ 191.236848] ? process_one_work+0xe20/0xe20 [ 191.241095] kthread+0x206/0x300 [ 191.244352] ? process_one_work+0xe20/0xe20 [ 191.248587] ? kthread_stop+0x570/0x570 [ 191.252459] ret_from_fork+0x3a/0x50 [ 191.256082] Code: 14 3e ff 8b 4b 78 55 4d 89 f9 41 56 41 55 48 c7 c7 a0 cf db 82 41 54 44 8b 44 24 2c 48 8b 54 24 30 48 8b 74 24 20 e8 16 94 13 ff <0f> 0b 48 c7 c7 60 8e 1f 85 48 83 c4 20 e8 55 ef a6 ff 89 74 24 [ 191.275327] RIP: skb_panic+0xc3/0x100 RSP: ffff8801d54072f0 [ 191.281024] ---[ end trace 7ea51094e099e006 ]--- [ 191.285724] Kernel panic - not syncing: Fatal exception in interrupt [ 191.292168] Kernel Offset: disabled [ 191.295697] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Reproducer: ip link add h1 type veth peer name swp1 ip link add h3 type veth peer name swp3 ip link set dev h1 up ip address add 192.0.2.1/28 dev h1 ip link add dev vh3 type vrf table 20 ip link set dev h3 master vh3 ip link set dev vh3 up ip link set dev h3 up ip link set dev swp3 up ip address add dev swp3 2001:db8:2::1/64 ip link set dev swp1 up tc qdisc add dev swp1 clsact ip link add name gt6 type ip6erspan \ local 2001:db8:2::1 remote 2001:db8:2::2 oseq okey 123 ip link set dev gt6 up sleep 1 tc filter add dev swp1 ingress pref 1000 matchall skip_hw \ action mirred egress mirror dev gt6 ping -I h1 192.0.2.2 Fixes: e41c7c68ea77 ("ip6erspan: make sure enough headroom at xmit.") Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: ip6_gre: Request headroom in __gre6_xmit()Petr Machata
__gre6_xmit() pushes GRE headers before handing over to ip6_tnl_xmit() for generic IP-in-IP processing. However it doesn't make sure that there is enough headroom to push the header to. That can lead to the panic cited below. (Reproducer below that). Fix by requesting either needed_headroom if already primed, or just the bare minimum needed for the header otherwise. [ 158.576725] kernel BUG at net/core/skbuff.c:104! [ 158.581510] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI [ 158.587174] Modules linked in: act_mirred cls_matchall ip6_gre ip6_tunnel tunnel6 gre sch_ingress vrf veth x86_pkg_temp_thermal mlx_platform nfsd e1000e leds_mlxcpld [ 158.602268] CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 4.17.0-rc4-net_master-custom-139 #10 [ 158.610938] Hardware name: Mellanox Technologies Ltd. "MSN2410-CB2F"/"SA000874", BIOS 4.6.5 03/08/2016 [ 158.620426] RIP: 0010:skb_panic+0xc3/0x100 [ 158.624586] RSP: 0018:ffff8801d3f27110 EFLAGS: 00010286 [ 158.629882] RAX: 0000000000000082 RBX: ffff8801c02cc040 RCX: 0000000000000000 [ 158.637127] RDX: 0000000000000082 RSI: dffffc0000000000 RDI: ffffed003a7e4e18 [ 158.644366] RBP: ffff8801bfec8020 R08: ffffed003aabce19 R09: ffffed003aabce19 [ 158.651574] R10: 000000000000000b R11: ffffed003aabce18 R12: ffff8801c364de66 [ 158.658786] R13: 000000000000002c R14: 00000000000000c0 R15: ffff8801c364de68 [ 158.666007] FS: 0000000000000000(0000) GS:ffff8801d5400000(0000) knlGS:0000000000000000 [ 158.674212] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 158.680036] CR2: 00007f4b3702dcd0 CR3: 0000000003228002 CR4: 00000000001606e0 [ 158.687228] Call Trace: [ 158.689752] ? __gre6_xmit+0x246/0xd80 [ip6_gre] [ 158.694475] ? __gre6_xmit+0x246/0xd80 [ip6_gre] [ 158.699141] skb_push+0x78/0x90 [ 158.702344] __gre6_xmit+0x246/0xd80 [ip6_gre] [ 158.706872] ip6gre_tunnel_xmit+0x3bc/0x610 [ip6_gre] [ 158.711992] ? __gre6_xmit+0xd80/0xd80 [ip6_gre] [ 158.716668] ? debug_check_no_locks_freed+0x210/0x210 [ 158.721761] ? print_irqtrace_events+0x120/0x120 [ 158.726461] ? sched_clock_cpu+0x18/0x210 [ 158.730572] ? sched_clock_cpu+0x18/0x210 [ 158.734692] ? cyc2ns_read_end+0x10/0x10 [ 158.738705] ? skb_network_protocol+0x76/0x200 [ 158.743216] ? netif_skb_features+0x1b2/0x550 [ 158.747648] dev_hard_start_xmit+0x137/0x770 [ 158.752010] sch_direct_xmit+0x2ef/0x5d0 [ 158.755992] ? pfifo_fast_dequeue+0x3fa/0x670 [ 158.760460] ? pfifo_fast_change_tx_queue_len+0x810/0x810 [ 158.765975] ? __lock_is_held+0xa0/0x160 [ 158.770002] __qdisc_run+0x39e/0xfc0 [ 158.773673] ? _raw_spin_unlock+0x29/0x40 [ 158.777781] ? pfifo_fast_enqueue+0x24b/0x3e0 [ 158.782191] ? sch_direct_xmit+0x5d0/0x5d0 [ 158.786372] ? pfifo_fast_dequeue+0x670/0x670 [ 158.790818] ? __dev_queue_xmit+0x172/0x1770 [ 158.795195] ? preempt_count_sub+0xf/0xd0 [ 158.799313] __dev_queue_xmit+0x410/0x1770 [ 158.803512] ? ___slab_alloc+0x605/0x930 [ 158.807525] ? ___slab_alloc+0x605/0x930 [ 158.811540] ? memcpy+0x34/0x50 [ 158.814768] ? netdev_pick_tx+0x1c0/0x1c0 [ 158.818895] ? __skb_clone+0x2fd/0x3d0 [ 158.822712] ? __copy_skb_header+0x270/0x270 [ 158.827079] ? rcu_read_lock_sched_held+0x93/0xa0 [ 158.831903] ? kmem_cache_alloc+0x344/0x4d0 [ 158.836199] ? skb_clone+0x123/0x230 [ 158.839869] ? skb_split+0x820/0x820 [ 158.843521] ? tcf_mirred+0x554/0x930 [act_mirred] [ 158.848407] tcf_mirred+0x554/0x930 [act_mirred] [ 158.853104] ? tcf_mirred_act_wants_ingress.part.2+0x10/0x10 [act_mirred] [ 158.860005] ? __lock_acquire+0x706/0x26e0 [ 158.864162] ? mark_lock+0x13d/0xb40 [ 158.867832] tcf_action_exec+0xcf/0x2a0 [ 158.871736] tcf_classify+0xfa/0x340 [ 158.875402] __netif_receive_skb_core+0x8e1/0x1c60 [ 158.880334] ? nf_ingress+0x500/0x500 [ 158.884059] ? process_backlog+0x347/0x4b0 [ 158.888241] ? lock_acquire+0xd8/0x320 [ 158.892050] ? process_backlog+0x1b6/0x4b0 [ 158.896228] ? process_backlog+0xc2/0x4b0 [ 158.900291] process_backlog+0xc2/0x4b0 [ 158.904210] net_rx_action+0x5cc/0x980 [ 158.908047] ? napi_complete_done+0x2c0/0x2c0 [ 158.912525] ? rcu_read_unlock+0x80/0x80 [ 158.916534] ? __lock_is_held+0x34/0x160 [ 158.920541] __do_softirq+0x1d4/0x9d2 [ 158.924308] ? trace_event_raw_event_irq_handler_exit+0x140/0x140 [ 158.930515] run_ksoftirqd+0x1d/0x40 [ 158.934152] smpboot_thread_fn+0x32b/0x690 [ 158.938299] ? sort_range+0x20/0x20 [ 158.941842] ? preempt_count_sub+0xf/0xd0 [ 158.945940] ? schedule+0x5b/0x140 [ 158.949412] kthread+0x206/0x300 [ 158.952689] ? sort_range+0x20/0x20 [ 158.956249] ? kthread_stop+0x570/0x570 [ 158.960164] ret_from_fork+0x3a/0x50 [ 158.963823] Code: 14 3e ff 8b 4b 78 55 4d 89 f9 41 56 41 55 48 c7 c7 a0 cf db 82 41 54 44 8b 44 24 2c 48 8b 54 24 30 48 8b 74 24 20 e8 16 94 13 ff <0f> 0b 48 c7 c7 60 8e 1f 85 48 83 c4 20 e8 55 ef a6 ff 89 74 24 [ 158.983235] RIP: skb_panic+0xc3/0x100 RSP: ffff8801d3f27110 [ 158.988935] ---[ end trace 5af56ee845aa6cc8 ]--- [ 158.993641] Kernel panic - not syncing: Fatal exception in interrupt [ 159.000176] Kernel Offset: disabled [ 159.003767] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Reproducer: ip link add h1 type veth peer name swp1 ip link add h3 type veth peer name swp3 ip link set dev h1 up ip address add 192.0.2.1/28 dev h1 ip link add dev vh3 type vrf table 20 ip link set dev h3 master vh3 ip link set dev vh3 up ip link set dev h3 up ip link set dev swp3 up ip address add dev swp3 2001:db8:2::1/64 ip link set dev swp1 up tc qdisc add dev swp1 clsact ip link add name gt6 type ip6gretap \ local 2001:db8:2::1 remote 2001:db8:2::2 ip link set dev gt6 up sleep 1 tc filter add dev swp1 ingress pref 1000 matchall skip_hw \ action mirred egress mirror dev gt6 ping -I h1 192.0.2.2 Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17selftests/bpf: check return value of fopen in test_verifier.cJesper Dangaard Brouer
Commit 0a6748740368 ("selftests/bpf: Only run tests if !bpf_disabled") forgot to check return value of fopen. This caused some confusion, when running test_verifier (from tools/testing/selftests/bpf/) on an older kernel (< v4.4) as it will simply seqfault. This fix avoids the segfault and prints an error, but allow program to continue. Given the sysctl was introduced in 1be7f75d1668 ("bpf: enable non-root eBPF programs"), we know that the running kernel cannot support unpriv, thus continue with unpriv_disabled = true. Fixes: 0a6748740368 ("selftests/bpf: Only run tests if !bpf_disabled") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-17erspan: fix invalid erspan version.William Tu
ERSPAN only support version 1 and 2. When packets send to an erspan device which does not have proper version number set, drop the packet. In real case, we observe multicast packets sent to the erspan pernet device, erspan0, which does not have erspan version configured. Reported-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17x86/apic/x2apic: Initialize cluster ID properlyThomas Gleixner
Rick bisected a regression on large systems which use the x2apic cluster mode for interrupt delivery to the commit wich reworked the cluster management. The problem is caused by a missing initialization of the clusterid field in the shared cluster data structures. So all structures end up with cluster ID 0 which only allows sharing between all CPUs which belong to cluster 0. All other CPUs with a cluster ID > 0 cannot share the data structure because they cannot find existing data with their cluster ID. This causes malfunction with IPIs because IPIs are sent to the wrong cluster and the caller waits for ever that the target CPU handles the IPI. Add the missing initialization when a upcoming CPU is the first in a cluster so that the later booting CPUs can find the data and share it for proper operation. Fixes: 023a611748fd ("x86/apic/x2apic: Simplify cluster management") Reported-by: Rick Warner <rick@microway.com> Bisected-by: Rick Warner <rick@microway.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Rick Warner <rick@microway.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1805171418210.1947@nanos.tec.linutronix.de
2018-05-17Merge branch 'ibmvnic-Fix-bugs-and-memory-leaks'David S. Miller
Thomas Falcon says: ==================== ibmvnic: Fix bugs and memory leaks This is a small patch series fixing up some bugs and memory leaks in the ibmvnic driver. The first fix frees up previously allocated memory that should be freed in case of an error. The second fixes a reset case that was failing due to TX/RX queue IRQ's being erroneously disabled without being enabled again. The final patch fixes incorrect reallocated of statistics buffers during a device reset, resulting in loss of statistics information and a memory leak. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17ibmvnic: Fix statistics buffers memory leakThomas Falcon
Move initialization of statistics buffers from ibmvnic_init function into ibmvnic_probe. In the current state, ibmvnic_init will be called again during a device reset, resulting in the allocation of new buffers without freeing the old ones. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17ibmvnic: Fix non-fatal firmware error resetThomas Falcon
It is not necessary to disable interrupt lines here during a reset to handle a non-fatal firmware error. Move that call within the code block that handles the other cases that do require interrupts to be disabled and re-enabled. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17ibmvnic: Free coherent DMA memory if FW map failedThomas Falcon
If the firmware map fails for whatever reason, remember to free up the memory after. Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net/ipv4: Initialize proto and ports in flow structDavid Ahern
Updating the FIB tracepoint for the recent change to allow rules using the protocol and ports exposed a few places where the entries in the flow struct are not initialized. For __fib_validate_source add the call to fib4_rules_early_flow_dissect since it is invoked for the input path. For netfilter, add the memset on the flow struct to avoid future problems like this. In ip_route_input_slow need to set the fields if the skb dissection does not happen. Fixes: bfff4862653b ("net: fib_rules: support for match on ip_proto, sport and dport") Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tls: don't use stack memory in a scatterlistMatt Mullins
scatterlist code expects virt_to_page() to work, which fails with CONFIG_VMAP_STACK=y. Fixes: c46234ebb4d1e ("tls: RX path for ktls") Signed-off-by: Matt Mullins <mmullins@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: - ARM/ARM64 locking fixes - x86 fixes: PCID, UMIP, locking - improved support for recent Windows version that have a 2048 Hz APIC timer - rename KVM_HINTS_DEDICATED CPUID bit to KVM_HINTS_REALTIME - better behaved selftests * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME KVM: arm/arm64: VGIC/ITS save/restore: protect kvm_read_guest() calls KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock KVM: arm/arm64: VGIC/ITS: Promote irq_lock() in update_affinity KVM: arm/arm64: Properly protect VGIC locks from IRQs KVM: X86: Lower the default timer frequency limit to 200us KVM: vmx: update sec exec controls for UMIP iff emulating UMIP kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabled KVM: selftests: exit with 0 status code when tests cannot be run KVM: hyperv: idr_find needs RCU protection x86: Delay skip of emulated hypercall instruction KVM: Extend MAX_IRQ_ROUTES to 4096 for all archs
2018-05-17Merge tag 'sound-4.17-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "We have a core fix in the compat code for covering a potential race (double references), but it's a very minor change. The rest are all small device-specific quirks, as well as a correction of the new UAC3 support code" * tag 'sound-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: usb-audio: Use Class Specific EP for UAC3 devices. ALSA: hda/realtek - Clevo P950ER ALC1220 Fixup ALSA: usb: mixer: volume quirk for CM102-A+/102S+ ALSA: hda: Add Lenovo C50 All in one to the power_save blacklist ALSA: control: fix a redundant-copy issue
2018-05-17kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIMEMichael S. Tsirkin
KVM_HINTS_DEDICATED seems to be somewhat confusing: Guest doesn't really care whether it's the only task running on a host CPU as long as it's not preempted. And there are more reasons for Guest to be preempted than host CPU sharing, for example, with memory overcommit it can get preempted on a memory access, post copy migration can cause preemption, etc. Let's call it KVM_HINTS_REALTIME which seems to better match what guests expect. Also, the flag most be set on all vCPUs - current guests assume this. Note so in the documentation. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-05-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Martin Schwidefsky: - a fix for the vfio ccw translation code - update an incorrect email address in the MAINTAINERS file - fix a division by zero oops in the cpum_sf code found by trinity - two fixes for the error handling of the qdio code - several spectre related patches to convert all left-over indirect branches in the kernel to expoline branches - update defconfigs to avoid warnings due to the netfilter Kconfig changes - avoid several compiler warnings in the kexec_file code for s390 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/qdio: don't release memory in qdio_setup_irq() s390/qdio: fix access to uninitialized qdio_q fields s390/cpum_sf: ensure sample frequency of perf event attributes is non-zero s390: use expoline thunks in the BPF JIT s390: extend expoline to BC instructions s390: remove indirect branch from do_softirq_own_stack s390: move spectre sysfs attribute code s390/kernel: use expoline for indirect branches s390/ftrace: use expoline for indirect branches s390/lib: use expoline for indirect branches s390/crc32-vx: use expoline for indirect branches s390: move expoline assembler macros to a header vfio: ccw: fix cleanup if cp_prefetch fails s390/kexec_file: add declaration of purgatory related globals s390: update defconfigs MAINTAINERS: update s390 zcrypt maintainers email address
2018-05-17Merge tag 'selinux-pr-20180516' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull SELinux fixes from Paul Moore: "A small pull request to fix a few regressions in the SELinux/SCTP code with applications that call bind() with AF_UNSPEC/INADDR_ANY. The individual commit descriptions have more information, but the commits themselves should be self explanatory" * tag 'selinux-pr-20180516' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: correctly handle sa_family cases in selinux_sctp_bind_connect() selinux: fix address family in bind() and connect() to match address/port selinux: add AF_UNSPEC and INADDR_ANY checks to selinux_socket_bind()
2018-05-17proc: do not access cmdline nor environ from file-backed areasWilly Tarreau
proc_pid_cmdline_read() and environ_read() directly access the target process' VM to retrieve the command line and environment. If this process remaps these areas onto a file via mmap(), the requesting process may experience various issues such as extra delays if the underlying device is slow to respond. Let's simply refuse to access file-backed areas in these functions. For this we add a new FOLL_ANON gup flag that is passed to all calls to access_remote_vm(). The code already takes care of such failures (including unmapped areas). Accesses via /proc/pid/mem were not changed though. This was assigned CVE-2018-1120. Note for stable backports: the patch may apply to kernels prior to 4.11 but silently miss one location; it must be checked that no call to access_remote_vm() keeps zero as the last argument. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-17bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=nColy Li
Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()") returns the return value of debugfs_create_dir() to bcache_init(). When CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes bcache_init() failedi. This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n, so bcache can continue to work for the kernels which don't have debugfs enanbled. Changelog: v4: Add Acked-by from Kent Overstreet. v3: Use IS_ENABLED(CONFIG_DEBUG_FS) to replace #ifdef DEBUG_FS. v2: Remove a warning information v1: Initial version. Fixes: Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()") Cc: stable@vger.kernel.org Signed-off-by: Coly Li <colyli@suse.de> Reported-by: Massimo B. <massimo.b@gmx.net> Reported-by: Kai Krakow <kai@kaishome.de> Tested-by: Kai Krakow <kai@kaishome.de> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-18powerpc/powernv: Fix NVRAM sleep in invalid context when crashingNicholas Piggin
Similarly to opal_event_shutdown, opal_nvram_write can be called in the crash path with irqs disabled. Special case the delay to avoid sleeping in invalid context. Fixes: 3b8070335f75 ("powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops") Cc: stable@vger.kernel.org # v3.2 Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-17MAINTAINERS: add entry for STM32 I2C driverPierre-Yves MORDRET
Add I2C/SMBUS Driver entry for STM32 family from ST Microelectronics. Signed-off-by: Pierre-Yves MORDRET <pierre-yves.mordret@st.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-05-17btrfs: fix crash when trying to resume balance without the resume flagAnand Jain
We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance() only, which isn't called during the remount. So when resuming from the paused balance we hit the bug: kernel: kernel BUG at fs/btrfs/volumes.c:3890! :: kernel: balance_kthread+0x51/0x60 [btrfs] kernel: kthread+0x111/0x130 :: kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8 Reproducer: On a mounted filesystem: btrfs balance start --full-balance /btrfs btrfs balance pause /btrfs mount -o remount,ro /dev/sdb /btrfs mount -o remount,rw /dev/sdb /btrfs To fix this set the BTRFS_BALANCE_RESUME flag in btrfs_resume_balance_async(). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-17btrfs: Fix delalloc inodes invalidation during transaction abortNikolay Borisov
When a transaction is aborted btrfs_cleanup_transaction is called to cleanup all the various in-flight bits and pieces which migth be active. One of those is delalloc inodes - inodes which have dirty pages which haven't been persisted yet. Currently the process of freeing such delalloc inodes in exceptional circumstances such as transaction abort boiled down to calling btrfs_invalidate_inodes whose sole job is to invalidate the dentries for all inodes related to a root. This is in fact wrong and insufficient since such delalloc inodes will likely have pending pages or ordered-extents and will be linked to the sb->s_inode_list. This means that unmounting a btrfs instance with an aborted transaction could potentially lead inodes/their pages visible to the system long after their superblock has been freed. This in turn leads to a "use-after-free" situation once page shrink is triggered. This situation could be simulated by running generic/019 which would cause such inodes to be left hanging, followed by generic/176 which causes memory pressure and page eviction which lead to touching the freed super block instance. This situation is additionally detected by the unmount code of VFS with the following message: "VFS: Busy inodes after unmount of Self-destruct in 5 seconds. Have a nice day..." Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree)); in free_fs_root for the same reason. This patch aims to rectify the sitaution by doing the following: 1. Change btrfs_destroy_delalloc_inodes so that it calls invalidate_inode_pages2 for every inode on the delalloc list, this ensures that all the pages of the inode are released. This function boils down to calling btrfs_releasepage. During test I observed cases where inodes on the delalloc list were having an i_count of 0, so this necessitates using igrab to be sure we are working on a non-freed inode. 2. Since calling btrfs_releasepage might queue delayed iputs move the call out to btrfs_cleanup_transaction in btrfs_error_commit_super before calling run_delayed_iputs for the last time. This is necessary to ensure that delayed iputs are run. Note: this patch is tagged for 4.14 stable but the fix applies to older versions too but needs to be backported manually due to conflicts. CC: stable@vger.kernel.org # 4.14.x: 2b8773313494: btrfs: Split btrfs_del_delalloc_inode into 2 functions CC: stable@vger.kernel.org # 4.14.x Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment to igrab ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-17btrfs: Split btrfs_del_delalloc_inode into 2 functionsNikolay Borisov
This is in preparation of fixing delalloc inodes leakage on transaction abort. Also export the new function. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-17btrfs: fix reading stale metadata blocks after degraded raid1 mountsLiu Bo
If a btree block, aka. extent buffer, is not available in the extent buffer cache, it'll be read out from the disk instead, i.e. btrfs_search_slot() read_block_for_search() # hold parent and its lock, go to read child btrfs_release_path() read_tree_block() # read child Unfortunately, the parent lock got released before reading child, so commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had used 0 as parent transid to read the child block. It forces read_tree_block() not to check if parent transid is different with the generation id of the child that it reads out from disk. A simple PoC is included in btrfs/124, 0. A two-disk raid1 btrfs, 1. Right after mkfs.btrfs, block A is allocated to be device tree's root. 2. Mount this filesystem and put it in use, after a while, device tree's root got COW but block A hasn't been allocated/overwritten yet. 3. Umount it and reload the btrfs module to remove both disks from the global @fs_devices list. 4. mount -odegraded dev1 and write some data, so now block A is allocated to be a leaf in checksum tree. Note that only dev1 has the latest metadata of this filesystem. 5. Umount it and mount it again normally (with both disks), since raid1 can pick up one disk by the writer task's pid, if btrfs_search_slot() needs to read block A, dev2 which does NOT have the latest metadata might be read for block A, then we got a stale block A. 6. As parent transid is not checked, block A is marked as uptodate and put into the extent buffer cache, so the future search won't bother to read disk again, which means it'll make changes on this stale one and make it dirty and flush it onto disk. To avoid the problem, parent transid needs to be passed to read_tree_block(). In order to get a valid parent transid, we need to hold the parent's lock until finishing reading child. This patch needs to be slightly adapted for stable kernels, the &first_key parameter added to read_tree_block() is from 4.16+ (581c1760415c4). The fix is to replace 0 by 'gen'. Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-17btrfs: property: Set incompat flag if lzo/zstd compression is setMisono Tomohiro
Incompat flag of LZO/ZSTD compression should be set at: 1. mount time (-o compress/compress-force) 2. when defrag is done 3. when property is set Currently 3. is missing and this commit adds this. This could lead to a filesystem that uses ZSTD but is not marked as such. If a kernel without a ZSTD support encounteres a ZSTD compressed extent, it will handle that but this could be confusing to the user. Typically the filesystem is mounted with the ZSTD option, but the discrepancy can arise when a filesystem is never mounted with ZSTD and then the property on some file is set (and some new extents are written). A simple mount with -o compress=zstd will fix that up on an unpatched kernel. Same goes for LZO, but this has been around for a very long time (2.6.37) so it's unlikely that a pre-LZO kernel would be used. Fixes: 5c1aab1dd544 ("btrfs: Add zstd support") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add user visible impact ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-17Btrfs: fix duplicate extents after fsync of file with prealloc extentsFilipe Manana
In commit 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay"), on fsync, we started to always log all prealloc extents beyond an inode's i_size in order to avoid losing them after a power failure. However under some cases this can lead to the log replay code to create duplicate extent items, with different lengths, in the extent tree. That happens because, as of that commit, we can now log extent items based on extent maps that are not on the "modified" list of extent maps of the inode's extent map tree. Logging extent items based on extent maps is used during the fast fsync path to save time and for this to work reliably it requires that the extent maps are not merged with other adjacent extent maps - having the extent maps in the list of modified extents gives such guarantee. Consider the following example, captured during a long run of fsstress, which illustrates this problem. We have inode 271, in the filesystem tree (root 5), for which all of the following operations and discussion apply to. A buffered write starts at offset 312391 with a length of 933471 bytes (end offset at 1245862). At this point we have, for this inode, the following extent maps with the their field values: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 376832, block_start 1106399232, block_len 376832, orig_block_len 376832 em C, start 417792, orig_start 417792, len 782336, block_start 18446744073709551613, block_len 0, orig_block_len 0 em D, start 1200128, orig_start 1200128, len 835584, block_start 1106776064, block_len 835584, orig_block_len 835584 em E, start 2035712, orig_start 2035712, len 245760, block_start 1107611648, block_len 245760, orig_block_len 245760 Extent map A corresponds to a hole and extent maps D and E correspond to preallocated extents. Extent map D ends where extent map E begins (1106776064 + 835584 = 1107611648), but these extent maps were not merged because they are in the inode's list of modified extent maps. An fsync against this inode is made, which triggers the fast path (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback of the data previously written using buffered IO, and when the respective ordered extent finishes, btrfs_drop_extents() is called against the (aligned) range 311296..1249279. This causes a split of extent map D at btrfs_drop_extent_cache(), replacing extent map D with a new extent map D', also added to the list of modified extents, with the following values: em D', start 1249280, orig_start of 1200128, block_start 1106825216 (= 1106776064 + 1249280 - 1200128), orig_block_len 835584, block_len 786432 (835584 - (1249280 - 1200128)) Then, during the fast fsync, btrfs_log_changed_extents() is called and extent maps D' and E are removed from the list of modified extents. The flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged clear_em_logging() is called on each of them, and that makes extent map E to be merged with extent map D' (try_merge_map()), resulting in D' being deleted and E adjusted to: em E, start 1249280, orig_start 1200128, len 1032192, block_start 1106825216, block_len 1032192, orig_block_len 245760 A direct IO write at offset 1847296 and length of 360448 bytes (end offset at 2207744) starts, and at that moment the following extent maps exist for our inode: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 270336, block_start 1106399232, block_len 270336, orig_block_len 376832 em C, start 311296, orig_start 311296, len 937984, block_start 1112842240, block_len 937984, orig_block_len 937984 em E (prealloc), start 1249280, orig_start 1200128, len 1032192, block_start 1106825216, block_len 1032192, orig_block_len 245760 The dio write results in drop_extent_cache() being called twice. The first time for a range that starts at offset 1847296 and ends at offset 2035711 (length of 188416), which results in a double split of extent map E, replacing it with two new extent maps: em F, start 1249280, orig_start 1200128, block_start 1106825216, block_len 598016, orig_block_len 598016 em G, start 2035712, orig_start 1200128, block_start 1107611648, block_len 245760, orig_block_len 1032192 It also creates a new extent map that represents a part of the requested IO (through create_io_em()): em H, start 1847296, len 188416, block_start 1107423232, block_len 188416 The second call to drop_extent_cache() has a range with a start offset of 2035712 and end offset of 2207743 (length of 172032). This leads to replacing extent map G with a new extent map I with the following values: em I, start 2207744, orig_start 1200128, block_start 1107783680, block_len 73728, orig_block_len 1032192 It also creates a new extent map that represents the second part of the requested IO (through create_io_em()): em J, start 2035712, len 172032, block_start 1107611648, block_len 172032 The dio write set the inode's i_size to 2207744 bytes. After the dio write the inode has the following extent maps: em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613, block_len 0, orig_block_len 0 em B, start 40960, orig_start 40960, len 270336, block_start 1106399232, block_len 270336, orig_block_len 376832 em C, start 311296, orig_start 311296, len 937984, block_start 1112842240, block_len 937984, orig_block_len 937984 em F, start 1249280, orig_start 1200128, len 598016, block_start 1106825216, block_len 598016, orig_block_len 598016 em H, start 1847296, orig_start 1200128, len 188416, block_start 1107423232, block_len 188416, orig_block_len 835584 em J, start 2035712, orig_start 2035712, len 172032, block_start 1107611648, block_len 172032, orig_block_len 245760 em I, start 2207744, orig_start 1200128, len 73728, block_start 1107783680, block_len 73728, orig_block_len 1032192 Now do some change to the file, like adding a xattr for example and then fsync it again. This triggers a fast fsync path, and as of commit 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay"), we use the extent map I to log a file extent item because it's a prealloc extent and it starts at an offset matching the inode's i_size. However when we log it, we create a file extent item with a value for the disk byte location that is wrong, as can be seen from the following output of "btrfs inspect-internal dump-tree": item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53 generation 22 type 2 (prealloc) prealloc data disk byte 1106776064 nr 1032192 prealloc data offset 1007616 nr 73728 Here the disk byte value corresponds to calculation based on some fields from the extent map I: 1106776064 = block_start (1107783680) - 1007616 (extent_offset) extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616 The disk byte value of 1106776064 clashes with disk byte values of the file extent items at offsets 1249280 and 1847296 in the fs tree: item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53 generation 20 type 2 (prealloc) prealloc data disk byte 1106776064 nr 835584 prealloc data offset 49152 nr 598016 item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53 generation 20 type 1 (regular) extent data disk byte 1106776064 nr 835584 extent data offset 647168 nr 188416 ram 835584 extent compression 0 (none) item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53 generation 20 type 1 (regular) extent data disk byte 1107611648 nr 245760 extent data offset 0 nr 172032 ram 245760 extent compression 0 (none) item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53 generation 20 type 2 (prealloc) prealloc data disk byte 1107611648 nr 245760 prealloc data offset 172032 nr 73728 Instead of the disk byte value of 1106776064, the value of 1107611648 should have been logged. Also the data offset value should have been 172032 and not 1007616. After a log replay we end up getting two extent items in the extent tree with different lengths, one of 835584, which is correct and existed before the log replay, and another one of 1032192 which is wrong and is based on the logged file extent item: item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53 refs 2 gen 15 flags DATA extent data backref root 5 objectid 271 offset 1200128 count 2 item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53 refs 1 gen 22 flags DATA extent data backref root 5 objectid 271 offset 1200128 count 1 Obviously this leads to many problems and a filesystem check reports many errors: (...) checking extents Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1 extent item 1106776064 has multiple extent items ref mismatch on [1106776064 835584] extent item 2, found 3 Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680 Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70 Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192 backpointer mismatch on [1106776064 835584] checking free space cache block group 1103101952 has wrong amount of free space failed to load free space cache for block group 1103101952 checking fs roots (...) So fix this by logging the prealloc extents beyond the inode's i_size based on searches in the subvolume tree instead of the extent maps. Fixes: 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay") CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-17dmaengine: qcom: bam_dma: check if the runtime pm enabledSrinivas Kandagatla
Disabling pm runtime at probe is not sufficient to get BAM working on remotely controller instances. pm_runtime_get_sync() would return -EACCES in such cases. So check if runtime pm is enabled before returning error from bam functions. Fixes: 5b4a68952a89 ("dmaengine: qcom: bam_dma: disable runtime pm on remote controlled") Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Signed-off-by: Vinod Koul <vkoul@kernel.org>
2018-05-17Merge branch 'vmwgfx-fixes-4.17' of ↵Dave Airlie
git://people.freedesktop.org/~thomash/linux into drm-fixes A single fix for a recent regression. * 'vmwgfx-fixes-4.17' of git://people.freedesktop.org/~thomash/linux: drm/vmwgfx: Set dmabuf_size when vmw_dmabuf_init is successful
2018-05-17Merge tag 'drm-misc-fixes-2018-05-16' of ↵Dave Airlie
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes - core: Fix regression in dev node offsets (Haneen) - vc4: Fix memory leak on driver close (Eric) - dumb-buffers: Prevent overflow in DIV_ROUND_UP() (Dan) Cc: Haneen Mohammed <hamohammed.sa@gmail.com> Cc: Eric Anholt <eric@anholt.net> Cc: Dan Carpenter <dan.carpenter@oracle.com> * tag 'drm-misc-fixes-2018-05-16' of git://anongit.freedesktop.org/drm/drm-misc: drm/dumb-buffers: Integer overflow in drm_mode_create_ioctl() drm/vc4: Fix leak of the file_priv that stored the perfmon. drm: Match sysfs name in link removal to link creation
2018-05-16Merge tag 'trace-v4.17-rc4-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Some of the ftrace internal events use a zero for a data size of a field event. This is increasingly important for the histogram trigger work that is being extended. While auditing trace events, I found that a couple of the xen events were used as just marking that a function was called, by creating a static array of size zero. This can play havoc with the tracing features if these events are used, because a zero size of a static array is denoted as a special nul terminated dynamic array (this is what the trace_marker code uses). But since the xen events have no size, they are not nul terminated, and unexpected results may occur. As trace events were never intended on being a marker to denote that a function was hit or not, especially since function tracing and kprobes can trivially do the same, the best course of action is to simply remove these events" * tag 'trace-v4.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/x86/xen: Remove zero data size trace events trace_xen_mmu_flush_tlb{_all}
2018-05-16tuntap: fix use after free during releaseJason Wang
After commit b196d88aba8a ("tun: fix use after free for ptr_ring") we need clean up tx ring during release(). But unfortunately, it tries to do the cleanup blindly after socket were destroyed which will lead another use-after-free. Fix this by doing the cleanup before dropping the last reference of the socket in __tun_detach(). Reported-by: Andrei Vagin <avagin@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Fixes: b196d88aba8a ("tun: fix use after free for ptr_ring") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16Merge branch 'qed-LL2-fixes'David S. Miller
Michal Kalderon says: ==================== qed: LL2 fixes This series fixes some issues in ll2 related to synchronization and resource freeing ==================== Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16qed: Fix LL2 race during connection terminateMichal Kalderon
Stress on qedi/qedr load unload lead to list_del corruption. This is due to ll2 connection terminate freeing resources without verifying that no more ll2 processing will occur. This patch unregisters the ll2 status block before terminating the connection to assure this race does not occur. Fixes: 1d6cff4fca4366 ("qed: Add iSCSI out of order packet handling") Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16qed: Fix possibility of list corruption during rmmod flowsMichal Kalderon
The ll2 flows of flushing the txq/rxq need to be synchronized with the regular fp processing. Caused list corruption during load/unload stress tests. Fixes: 0a7fb11c23c0f ("qed: Add Light L2 support") Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16qed: LL2 flush isles when connection is closedMichal Kalderon
Driver should free all pending isles once it gets a FLUSH cqe from FW. Part of iSCSI out of order flow. Fixes: 1d6cff4fca4366 ("qed: Add iSCSI out of order packet handling") Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net/sched: fix refcnt leak in the error path of tcf_vlan_init()Davide Caratti
Similarly to what was done with commit a52956dfc503 ("net sched actions: fix refcnt leak in skbmod"), fix the error path of tcf_vlan_init() to avoid refcnt leaks when wrong value of TCA_VLAN_PUSH_VLAN_PROTOCOL is given. Fixes: 5026c9b1bafc ("net sched: vlan action fix late binding") CC: Roman Mashak <mrv@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net: 8390: ne: Fix accidentally removed RBTX4927 supportGeert Uytterhoeven
The configuration settings for RBTX4927 were accidentally removed, leading to a silently broken network interface. Re-add the missing settings to fix this. Fixes: 8eb97ff5a4ec941d ("net: 8390: remove m32r specific bits") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16Merge branch 'dsa-bcm_sf2-CFP-fixes'David S. Miller
Florian Fainelli says: ==================== net: dsa: bcm_sf2: CFP fixes This patch series fixes a number of usability issues with the SF2 Compact Field Processor code: - we would not be properly bound checking the location when we let the kernel automatically place rules with RX_CLS_LOC_ANY - when using IPv6 rules and user space specifies a location identifier we would be off by one in what the chain ID (within the Broadcom tag) indicates - it would be possible to delete one of the two slices of an IPv6 while leaving the other one programming leading to various problems ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net: dsa: bcm_sf2: Fix IPv6 rule half deletionFlorian Fainelli
It was possible to delete only one half of an IPv6, which would leave the second half still programmed and possibly in use. Instead of checking for the unused bitmap, we need to check the unique bitmap, and refuse any deletion that does not match that criteria. We also need to move that check from bcm_sf2_cfp_rule_del_one() into its caller: bcm_sf2_cfp_rule_del() otherwise we would not be able to delete second halves anymore that would not pass the first test. Fixes: ba0696c22e7c ("net: dsa: bcm_sf2: Add support for IPv6 CFP rules") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>