summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-12-13Merge tag 'x86-urgent-2020-12-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of x86 and membarrier fixes: - Correct a few problems in the x86 and the generic membarrier implementation. Small corrections for assumptions about visibility which have turned out not to be true. - Make the PAT bits for memory encryption correct vs 4K and 2M/1G page table entries as they are at a different location. - Fix a concurrency issue in the the local bandwidth readout of resource control leading to incorrect values - Fix the ordering of allocating a vector for an interrupt. The order missed to respect the provided cpumask when the first attempt of allocating node local in the mask fails. It then tries the node instead of trying the full provided mask first. This leads to erroneous error messages and breaking the (user) supplied affinity request. Reorder it. - Make the INT3 padding detection in optprobe work correctly" * tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kprobes: Fix optprobe to detect INT3 padding correctly x86/apic/vector: Fix ordering in vector assignment x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP membarrier: Execute SYNC_CORE on the calling thread membarrier: Explicitly sync remote cores when SYNC_CORE is requested membarrier: Add an actual barrier before rseq_preempt() x86/membarrier: Get rid of a dubious optimization
2020-12-12kernel: remove checking for TIF_NOTIFY_SIGNALJens Axboe
It's available everywhere now, no need to check or add dummy defines. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12signal: kill JOBCTL_TASK_WORKJens Axboe
It's no longer used, get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12task_work: remove legacy TWA_SIGNAL pathJens Axboe
All archs now support TIF_NOTIFY_SIGNAL. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
xdp_return_frame_bulk() needs to pass a xdp_buff to __xdp_return(). strlcpy got converted to strscpy but here it makes no functional difference, so just keep the right code. Conflicts: net/netfilter/nf_tables_api.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11tick/sched: Make jiffies update quick check more robustThomas Gleixner
The quick check in tick_do_update_jiffies64() whether jiffies need to be updated is not really correct under all circumstances and on all architectures, especially not on 32bit systems. The quick check does: if (now < READ_ONCE(tick_next_period)) return; and the counterpart in the update is: WRITE_ONCE(tick_next_period, next_update_time); This has two problems: 1) On weakly ordered architectures there is no guarantee that the stores before the WRITE_ONCE() are visible which means that other CPUs can operate on a stale jiffies value. 2) On 32bit the store of tick_next_period which is an u64 is split into two 32bit stores. If the first 32bit store advances tick_next_period far out and the second 32bit store is delayed (virt, NMI ...) then jiffies will become stale until the second 32bit store happens. Address this by seperating the handling for 32bit and 64bit. On 64bit problem #1 is addressed by replacing READ_ONCE() / WRITE_ONCE() with smp_load_acquire() / smp_store_release(). On 32bit problem #2 is addressed by protecting the quick check with the jiffies sequence counter. The load and stores can be plain because the sequence count mechanics provides the required barriers already. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/87czzpc02w.fsf@nanos.tec.linutronix.de
2020-12-11bpf: Fix enum names for bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpersAndrii Nakryiko
Remove bpf_ prefix, which causes these helpers to be reported in verifier dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets fix it as long as it is still possible before UAPI freezes on these helpers. Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-11elfcore: fix building with clangArnd Bergmann
kernel/elfcore.c only contains weak symbols, which triggers a bug with clang in combination with recordmcount: Cannot find symbol for section 2: .text. kernel/elfcore.o: failed Move the empty stubs into linux/elfcore.h as inline functions. As only two architectures use these, just use the architecture specific Kconfig symbols to key off the declaration. Link: https://lkml.kernel.org/r/20201204165742.3815221-2-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Barret Rhoden <brho@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-11x86,swiotlb: Adjust SWIOTLB bounce buffer size for SEV guestsAshish Kalra
For SEV, all DMA to and from guest has to use shared (un-encrypted) pages. SEV uses SWIOTLB to make this happen without requiring changes to device drivers. However, depending on the workload being run, the default 64MB of it might not be enough and it may run out of buffers to use for DMA, resulting in I/O errors and/or performance degradation for high I/O workloads. Adjust the default size of SWIOTLB for SEV guests using a percentage of the total memory available to guest for the SWIOTLB buffers. Adds a new sev_setup_arch() function which is invoked from setup_arch() and it calls into a new swiotlb generic code function swiotlb_adjust_size() to do the SWIOTLB buffer adjustment. v5 fixed build errors and warnings as Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Co-developed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2020-12-11cpufreq: schedutil: Simplify sugov_update_next_freq()Rafael J. Wysocki
Rearrange a conditional to make it more straightforward. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-11genirq/affinity: Add irq_update_affinity_desc()John Garry
Add a function to allow the affinity of an interrupt be switched to managed, such that interrupts allocated for platform devices may be managed. This new interface has certain limitations, and attempts to use it in the following circumstances will fail: - For when the kernel is configured for generic IRQ reservation mode (in config GENERIC_IRQ_RESERVATION_MODE). The reason being that it could conflict with managed vs. non-managed interrupt accounting. - The interrupt is already started, which should not be the case during init - The interrupt is already configured as managed, which means double init Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/1606905417-183214-2-git-send-email-john.garry@huawei.com
2020-12-11Revert "genirq: Add fasteoi IPI flow"Valentin Schneider
handle_percpu_devid_fasteoi_ipi() has no more users, and handle_percpu_devid_irq() can do all that it was supposed to do. Get rid of it. This reverts commit c5e5ec033c4ab25c53f1fd217849e75deb0bf7bf. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20201109094121.29975-6-valentin.schneider@arm.com
2020-12-11ntp: Consolidate the RTC update implementationThomas Gleixner
The code for the legacy RTC and the RTC class based update are pretty much the same. Consolidate the common parts into one function and just invoke the actual setter functions. For RTC class based devices the update code checks whether the offset is valid for the device, which is usually not the case for the first invocation. If it's not the same it stores the correct offset and lets the caller try again. That's not much different from the previous approach where the first invocation had a pretty low probability to actually hit the allowed window. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201206220542.355743355@linutronix.de
2020-12-11ntp: Make the RTC sync offset less obscureThomas Gleixner
The current RTC set_offset_nsec value is not really intuitive to understand. tsched twrite(t2.tv_sec - 1) t2 (seconds increment) The offset is calculated from twrite based on the assumption that t2 - twrite == 1s. That means for the MC146818 RTC the offset needs to be negative so that the write happens 500ms before t2. It's easier to understand when the whole calculation is based on t2. That avoids negative offsets and the meaning is obvious: t2 - twrite: The time defined by the chip when seconds increment after the write. twrite - tsched: The time for the transport to the point where the chip is updated. ==> set_offset_nsec = t2 - tsched ttransport = twrite - tsched tRTCinc = t2 - twrite ==> set_offset_nsec = ttransport + tRTCinc tRTCinc is a chip property and can be obtained from the data sheet. ttransport depends on how the RTC is connected. It is close to 0 for directly accessible RTCs. For RTCs behind a slow bus, e.g. i2c, it's the time required to send the update over the bus. This can be estimated or even calibrated, but that's a different problem. Adjust the implementation and update comments accordingly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201206220542.263204937@linutronix.de
2020-12-11ntp, rtc: Move rtc_set_ntp_time() to ntp codeThomas Gleixner
rtc_set_ntp_time() is not really RTC functionality as the code is just a user of RTC. Move it into the NTP code which allows further cleanups. Requested-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201206220542.166871172@linutronix.de
2020-12-11ntp: Make the RTC synchronization more reliableThomas Gleixner
Miroslav reported that the periodic RTC synchronization in the NTP code fails more often than not to hit the specified update window. The reason is that the code uses delayed_work to schedule the update which needs to be in thread context as the underlying RTC might be connected via a slow bus, e.g. I2C. In the update function it verifies whether the current time is correct vs. the requirements of the underlying RTC. But delayed_work is using the timer wheel for scheduling which is inaccurate by design. Depending on the distance to the expiry the wheel gets less granular to allow batching and to avoid the cascading of the original timer wheel. See 500462a9de65 ("timers: Switch to a non-cascading wheel") and the code for further details. The code already deals with this by splitting the 660 seconds period into a long 659 seconds timer and then retrying with a smaller delta. But looking at the actual granularities of the timer wheel (which depend on the HZ configuration) the 659 seconds timer ends up in an outer wheel level and is affected by a worst case granularity of: HZ Granularity 1000 32s 250 16s 100 40s So the initial timer can be already off by max 12.5% which is not a big issue as the period of the sync is defined as ~11 minutes. The fine grained second attempt schedules to the desired update point with a timer expiring less than a second from now. Depending on the actual delta and the HZ setting even the second attempt can end up in outer wheel levels which have a large enough granularity to make the correctness check fail. As this is a fundamental property of the timer wheel there is no way to make this more accurate short of iterating in one jiffies steps towards the update point. Switch it to an hrtimer instead which schedules the actual update work. The hrtimer will expire precisely (max 1 jiffie delay when high resolution timers are not available). The actual scheduling delay of the work is the same as before. The update is triggered from do_adjtimex() which is a bit racy but not much more racy than it was before: if (ntp_synced()) queue_delayed_work(system_power_efficient_wq, &sync_work, 0); which is racy when the work is currently executed and has not managed to reschedule itself. This becomes now: if (ntp_synced() && !hrtimer_is_queued(&sync_hrtimer)) queue_work(system_power_efficient_wq, &sync_work, 0); which is racy when the hrtimer has expired and the work is currently executed and has not yet managed to rearm the hrtimer. Not a big problem as it just schedules work for nothing. The new implementation has a safe guard in place to catch the case where the hrtimer is queued on entry to the work function and avoids an extra update attempt of the RTC that way. Reported-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Miroslav Lichvar <mlichvar@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201206220542.062910520@linutronix.de
2020-12-11sched/fair: Trivial correction of the newidle_balance() commentBarry Song
idle_balance() has been renamed to newidle_balance(). To differentiate with nohz_idle_balance, it seems refining the comment will be helpful for the readers of the code. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201202220641.22752-1-song.bao.hua@hisilicon.com
2020-12-11sched/fair: Clear SMT siblings after determining the core is not idleMel Gorman
The clearing of SMT siblings from the SIS mask before checking for an idle core is a small but unnecessary cost. Defer the clearing of the siblings until the scan moves to the next potential target. The cost of this was not measured as it is borderline noise but it should be self-evident. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201130144020.GS3371@techsingularity.net
2020-12-11sched: Fix kernel-doc markupMauro Carvalho Chehab
Kernel-doc requires that a kernel-doc markup to be immediately below the function prototype, as otherwise it will rename it. So, move sys_sched_yield() markup to the right place. Also fix the cpu_util() markup: Kernel-doc markups should use this format: identifier - description Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/50cd6f460aeb872ebe518a8e9cfffda2df8bdb0a.1606823973.git.mchehab+huawei@kernel.org
2020-12-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) IPsec compat fixes, from Dmitry Safonov. 2) Fix memory leak in xfrm_user_policy(). Fix from Yu Kuai. 3) Fix polling in xsk sockets by using sk_poll_wait() instead of datagram_poll() which keys off of sk_wmem_alloc and such which xsk sockets do not update. From Xuan Zhuo. 4) Missing init of rekey_data in cfgh80211, from Sara Sharon. 5) Fix destroy of timer before init, from Davide Caratti. 6) Missing CRYPTO_CRC32 selects in ethernet driver Kconfigs, from Arnd Bergmann. 7) Missing error return in rtm_to_fib_config() switch case, from Zhang Changzhong. 8) Fix some src/dest address handling in vrf and add a testcase. From Stephen Suryaputra. 9) Fix multicast handling in Seville switches driven by mscc-ocelot driver. From Vladimir Oltean. 10) Fix proto value passed to skb delivery demux in udp, from Xin Long. 11) HW pkt counters not reported correctly in enetc driver, from Claudiu Manoil. 12) Fix deadlock in bridge, from Joseph Huang. 13) Missing of_node_pur() in dpaa2 driver, fromn Christophe JAILLET. 14) Fix pid fetching in bpftool when there are a lot of results, from Andrii Nakryiko. 15) Fix long timeouts in nft_dynset, from Pablo Neira Ayuso. 16) Various stymmac fixes, from Fugang Duan. 17) Fix null deref in tipc, from Cengiz Can. 18) When mss is biog, coose more resonable rcvq_space in tcp, fromn Eric Dumazet. 19) Revert a geneve change that likely isnt necessary, from Jakub Kicinski. 20) Avoid premature rx buffer reuse in various Intel driversm from Björn Töpel. 21) retain EcT bits during TIS reflection in tcp, from Wei Wang. 22) Fix Tso deferral wrt. cwnd limiting in tcp, from Neal Cardwell. 23) MPLS_OPT_LSE_LABEL attribute is 342 ot 8 bits, from Guillaume Nault 24) Fix propagation of 32-bit signed bounds in bpf verifier and add test cases, from Alexei Starovoitov. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) selftests: fix poll error in udpgro.sh selftests/bpf: Fix "dubious pointer arithmetic" test selftests/bpf: Fix array access with signed variable test selftests/bpf: Add test for signed 32-bit bound check bug bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds. MAINTAINERS: Add entry for Marvell Prestera Ethernet Switch driver net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower net/mlx4_en: Handle TX error CQE net/mlx4_en: Avoid scheduling restart task if it is already running tcp: fix cwnd-limited bug for TSO deferral where we send nothing net: flow_offload: Fix memory leak for indirect flow block tcp: Retain ECT bits for tos reflection ethtool: fix stack overflow in ethnl_parse_bitset() e1000e: fix S0ix flow to allow S0i3.2 subset entry ice: avoid premature Rx buffer reuse ixgbe: avoid premature Rx buffer reuse i40e: avoid premature Rx buffer reuse igb: avoid transmit queue timeout in xdp path igb: use xdp_do_flush igb: skb add metasize for xdp ...
2020-12-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2020-12-10 The following pull-request contains BPF updates for your *net* tree. We've added 21 non-merge commits during the last 12 day(s) which contain a total of 21 files changed, 163 insertions(+), 88 deletions(-). The main changes are: 1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei. 2) Fix ring_buffer__poll() return value, from Andrii. 3) Fix race in lwt_bpf, from Cong. 4) Fix test_offload, from Toke. 5) Various xsk fixes. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! Also thanks to reporters, reviewers and testers of commits in this pull-request: Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-10bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.Alexei Starovoitov
The 64-bit signed bounds should not affect 32-bit signed bounds unless the verifier knows that upper 32-bits are either all 1s or all 0s. For example the register with smin_value==1 doesn't mean that s32_min_value is also equal to 1, since smax_value could be larger than 32-bit subregister can hold. The verifier refines the smax/s32_max return value from certain helpers in do_refine_retval_range(). Teach the verifier to recognize that smin/s32_min value is also bounded. When both smin and smax bounds fit into 32-bit subregister the verifier can propagate those bounds. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-12-10exec: Transform exec_update_mutex into a rw_semaphoreEric W. Biederman
Recently syzbot reported[0] that there is a deadlock amongst the users of exec_update_mutex. The problematic lock ordering found by lockdep was: perf_event_open (exec_update_mutex -> ovl_i_mutex) chown (ovl_i_mutex -> sb_writes) sendfile (sb_writes -> p->lock) by reading from a proc file and writing to overlayfs proc_pid_syscall (p->lock -> exec_update_mutex) While looking at possible solutions it occured to me that all of the users and possible users involved only wanted to state of the given process to remain the same. They are all readers. The only writer is exec. There is no reason for readers to block on each other. So fix this deadlock by transforming exec_update_mutex into a rw_semaphore named exec_update_lock that only exec takes for writing. Cc: Jann Horn <jannh@google.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christopher Yeoh <cyeoh@au1.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex") [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcuEric W. Biederman
When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Using task_lookup_next_fd_rcu simplifies task_file_seq_get_next, by moving the checking for the maximum file descritor into the generic code, and by remvoing the need for capturing and releasing a reference on files_struct. As the reference count of files_struct no longer needs to be maintained bpf_iter_seq_task_file_info can have it's files member removed and task_file_seq_get_next no longer needs it's fstruct argument. The curr_fd local variable does need to become unsigned to be used with fnext_task. As curr_fd is assigned from and assigned a u32 making curr_fd an unsigned int won't cause problems and might prevent them. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> v1: https://lkml.kernel.org/r/20200817220425.9389-11-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-16-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10kcmp: In get_file_raw_ptr use task_lookup_fd_rcuEric W. Biederman
Modify get_file_raw_ptr to use task_lookup_fd_rcu. The helper task_lookup_fd_rcu does the work of taking the task lock and verifying that task->files != NULL and then calls files_lookup_fd_rcu. So let use the helper to make a simpler implementation of get_file_raw_ptr. Acked-by: Cyrill Gorcunov <gorcunov@gmail.com> Link: https://lkml.kernel.org/r/20201120231441.29911-13-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Replace fcheck_files with files_lookup_fd_rcuEric W. Biederman
This change renames fcheck_files to files_lookup_fd_rcu. All of the remaining callers take the rcu_read_lock before calling this function so the _rcu suffix is appropriate. This change also tightens up the debug check to verify that all callers hold the rcu_read_lock. All callers that used to call files_check with the files->file_lock held have now been changed to call files_lookup_fd_locked. This change of name has helped remind me of which locks and which guarantees are in place helping me to catch bugs later in the patchset. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-9-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10bpf: In bpf_task_fd_query use fget_taskEric W. Biederman
Use the helper fget_task to simplify bpf_task_fd_query. As well as simplifying the code this removes one unnecessary increment of struct files_struct. This unnecessary increment of files_struct.count can result in exec unnecessarily unsharing files_struct and breaking posix locks, and it can result in fget_light having to fallback to fget reducing performance. This simplification comes from the observation that none of the callers of get_files_struct actually need to call get_files_struct that was made when discussing[1] exec and posix file locks. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> v1: https://lkml.kernel.org/r/20200817220425.9389-5-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-5-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10kcmp: In kcmp_epoll_target use fget_taskEric W. Biederman
Use the helper fget_task and simplify the code. As well as simplifying the code this removes one unnecessary increment of struct files_struct. This unnecessary increment of files_struct.count can result in exec unnecessarily unsharing files_struct and breaking posix locks, and it can result in fget_light having to fallback to fget reducing performance. Suggested-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> v1: https://lkml.kernel.org/r/20200817220425.9389-4-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-4-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10exec: Simplify unshare_filesEric W. Biederman
Now that exec no longer needs to return the unshared files to their previous value there is no reason to return displaced. Instead when unshare_fd creates a copy of the file table, call put_files_struct before returning from unshare_files. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-2-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-2-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-09Input: gtco - remove driverDmitry Torokhov
The driver has its own HID descriptor parsing code, that had and still has several issues discovered by syzbot and other tools. Ideally we should move the driver over to the HID subsystem, so that it uses proven parsing code. However the devices in question are EOL, and GTCO is not willing to extend resources for that, so let's simply remove the driver. Note that our HID support has greatly improved over the last 10 years, we may also consider reverting 6f8d9e26e7de ("hid-core.c: Adds all GTCO CalComp Digitizers and InterWrite School Products to blacklist") and see if GTCO devices actually work with normal HID drivers. Link: https://lore.kernel.org/r/X8wbBtO5KidME17K@google.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2020-12-09driver core: Add fwnode_init()Saravana Kannan
There are multiple locations in the kernel where a struct fwnode_handle is initialized. Add fwnode_init() so that we have one way of initializing a fwnode_handle. Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20201121020232.908850-8-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-09Merge remote-tracking branch 'arm64/for-next/fixes' into for-next/coreCatalin Marinas
* arm64/for-next/fixes: (26 commits) arm64: mte: fix prctl(PR_GET_TAGGED_ADDR_CTRL) if TCF0=NONE arm64: mte: Fix typo in macro definition arm64: entry: fix EL1 debug transitions arm64: entry: fix NMI {user, kernel}->kernel transitions arm64: entry: fix non-NMI kernel<->kernel transitions arm64: ptrace: prepare for EL1 irq/rcu tracking arm64: entry: fix non-NMI user<->kernel transitions arm64: entry: move el1 irq/nmi logic to C arm64: entry: prepare ret_to_user for function call arm64: entry: move enter_from_user_mode to entry-common.c arm64: entry: mark entry code as noinstr arm64: mark idle code as noinstr arm64: syscall: exit userspace before unmasking exceptions arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect() arm64: pgtable: Fix pte_accessible() ACPI/IORT: Fix doc warnings in iort.c arm64/fpsimd: add <asm/insn.h> to <asm/kprobes.h> to fix fpsimd build arm64: cpu_errata: Apply Erratum 845719 to KRYO2XX Silver arm64: proton-pack: Add KRYO2XX silver CPUs to spectre-v2 safe-list arm64: kpti: Add KRYO2XX gold/silver CPU cores to kpti safelist ... # Conflicts: # arch/arm64/include/asm/exception.h # arch/arm64/kernel/sdei.c
2020-12-09Merge remote-tracking branch 'arm64/for-next/scs' into for-next/coreCatalin Marinas
* arm64/for-next/scs: arm64: sdei: Push IS_ENABLED() checks down to callee functions arm64: scs: use vmapped IRQ and SDEI shadow stacks scs: switch to vmapped shadow stacks
2020-12-09perf: Break deadlock involving exec_update_mutexpeterz@infradead.org
Syzbot reported a lock inversion involving perf. The sore point being perf holding exec_update_mutex() for a very long time, specifically across a whole bunch of filesystem ops in pmu::event_init() (uprobes) and anon_inode_getfile(). This then inverts against procfs code trying to take exec_update_mutex. Move the permission checks later, such that we need to hold the mutex over less code. Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-12-09locking/rwsem: Remove reader optimistic spinningWaiman Long
Reader optimistic spinning is helpful when the reader critical section is short and there aren't that many readers around. It also improves the chance that a reader can get the lock as writer optimistic spinning disproportionally favors writers much more than readers. Since commit d3681e269fff ("locking/rwsem: Wake up almost all readers in wait queue"), all the waiting readers are woken up so that they can all get the read lock and run in parallel. When the number of contending readers is large, allowing reader optimistic spinning will likely cause reader fragmentation where multiple smaller groups of readers can get the read lock in a sequential manner separated by writers. That reduces reader parallelism. One possible way to address that drawback is to limit the number of readers (preferably one) that can do optimistic spinning. These readers act as representatives of all the waiting readers in the wait queue as they will wake up all those waiting readers once they get the lock. Alternatively, as reader optimistic lock stealing has already enhanced fairness to readers, it may be easier to just remove reader optimistic spinning and simplifying the optimistic spinning code as a result. Performance measurements (locking throughput kops/s) using a locking microbenchmark with 50/50 reader/writer distribution and turbo-boost disabled was done on a 2-socket Cascade Lake system (48-core 96-thread) to see the impacts of these changes: 1) Vanilla - 5.10-rc3 kernel 2) Before - 5.10-rc3 kernel with previous patches in this series 2) limit-rspin - 5.10-rc3 kernel with limited reader spinning patch 3) no-rspin - 5.10-rc3 kernel with reader spinning disabled # of threads CS Load Vanilla Before limit-rspin no-rspin ------------ ------- ------- ------ ----------- -------- 2 1 5,185 5,662 5,214 5,077 4 1 5,107 4,983 5,188 4,760 8 1 4,782 4,564 4,720 4,628 16 1 4,680 4,053 4,567 3,402 32 1 4,299 1,115 1,118 1,098 64 1 3,218 983 1,001 957 96 1 1,938 944 957 930 2 20 2,008 2,128 2,264 1,665 4 20 1,390 1,033 1,046 1,101 8 20 1,472 1,155 1,098 1,213 16 20 1,332 1,077 1,089 1,122 32 20 967 914 917 980 64 20 787 874 891 858 96 20 730 836 847 844 2 100 372 356 360 355 4 100 492 425 434 392 8 100 533 537 529 538 16 100 548 572 568 598 32 100 499 520 527 537 64 100 466 517 526 512 96 100 406 497 506 509 The column "CS Load" represents the number of pause instructions issued in the locking critical section. A CS load of 1 is extremely short and is not likey in real situations. A load of 20 (moderate) and 100 (long) are more realistic. It can be seen that the previous patches in this series have reduced performance in general except in highly contended cases with moderate or long critical sections that performance improves a bit. This change is mostly caused by the "Prevent potential lock starvation" patch that reduce reader optimistic spinning and hence reduce reader fragmentation. The patch that further limit reader optimistic spinning doesn't seem to have too much impact on overall performance as shown in the benchmark data. The patch that disables reader optimistic spinning shows reduced performance at lightly loaded cases, but comparable or slightly better performance on with heavier contention. This patch just removes reader optimistic spinning for now. As readers are not going to do optimistic spinning anymore, we don't need to consider if the OSQ is empty or not when doing lock stealing. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Link: https://lkml.kernel.org/r/20201121041416.12285-6-longman@redhat.com
2020-12-09locking/rwsem: Enable reader optimistic lock stealingWaiman Long
If the optimistic spinning queue is empty and the rwsem does not have the handoff or write-lock bits set, it is actually not necessary to call rwsem_optimistic_spin() to spin on it. Instead, it can steal the lock directly as its reader bias is in the count already. If it is the first reader in this state, it will try to wake up other readers in the wait queue. With this patch applied, the following were the lock event counts after rebooting a 2-socket system and a "make -j96" kernel rebuild. rwsem_opt_rlock=4437 rwsem_rlock=29 rwsem_rlock_steal=19 So lock stealing represents about 0.4% of all the read locks acquired in the slow path. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Link: https://lkml.kernel.org/r/20201121041416.12285-4-longman@redhat.com
2020-12-09locking/rwsem: Prevent potential lock starvationWaiman Long
The lock handoff bit is added in commit 4f23dbc1e657 ("locking/rwsem: Implement lock handoff to prevent lock starvation") to avoid lock starvation. However, allowing readers to do optimistic spinning does introduce an unlikely scenario where lock starvation can happen. The lock handoff bit may only be set when a waiter is being woken up. In the case of reader unlock, wakeup happens only when the reader count reaches 0. If there is a continuous stream of incoming readers acquiring read lock via optimistic spinning, it is possible that the reader count may never reach 0 and so the handoff bit will never be asserted. One way to prevent this scenario from happening is to disallow optimistic spinning if the rwsem is currently owned by readers. If the previous or current owner is a writer, optimistic spinning will be allowed. If the previous owner is a reader but the reader count has reached 0 before, a wakeup should have been issued. So the handoff mechanism will be kicked in to prevent lock starvation. As a result, it should be OK to do optimistic spinning in this case. This patch may have some impact on reader performance as it reduces reader optimistic spinning especially if the lock critical sections are short the number of contending readers are small. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Link: https://lkml.kernel.org/r/20201121041416.12285-3-longman@redhat.com
2020-12-09locking/rwsem: Pass the current atomic count to rwsem_down_read_slowpath()Waiman Long
The atomic count value right after reader count increment can be useful to determine the rwsem state at trylock time. So the count value is passed down to rwsem_down_read_slowpath() to be used when appropriate. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Link: https://lkml.kernel.org/r/20201121041416.12285-2-longman@redhat.com
2020-12-09locking/rwsem: Fold __down_{read,write}*()Peter Zijlstra
There's a lot needless duplication in __down_{read,write}*(), cure that with a helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201207090243.GE3040@hirez.programming.kicks-ass.net
2020-12-09locking/rwsem: Introduce rwsem_write_trylock()Peter Zijlstra
One copy of this logic is better than three. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201207090243.GE3040@hirez.programming.kicks-ass.net
2020-12-09locking/rwsem: Better collate rwsem_read_trylock()Peter Zijlstra
All users of rwsem_read_trylock() do rwsem_set_reader_owned(sem) on success, move it into rwsem_read_trylock() proper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201207090243.GE3040@hirez.programming.kicks-ass.net
2020-12-09Merge branch 'locking/rwsem'Peter Zijlstra
2020-12-09rwsem: Implement down_read_interruptibleEric W. Biederman
In preparation for converting exec_update_mutex to a rwsem so that multiple readers can execute in parallel and not deadlock, add down_read_interruptible. This is needed for perf_event_open to be converted (with no semantic changes) from working on a mutex to wroking on a rwsem. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/87k0tybqfy.fsf@x220.int.ebiederm.org
2020-12-09rwsem: Implement down_read_killable_nestedEric W. Biederman
In preparation for converting exec_update_mutex to a rwsem so that multiple readers can execute in parallel and not deadlock, add down_read_killable_nested. This is needed so that kcmp_lock can be converted from working on a mutexes to working on rw_semaphores. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/87o8jabqh3.fsf@x220.int.ebiederm.org
2020-12-09printk: remove logbuf_lock writer-protection of ringbufferJohn Ogness
Since the ringbuffer is lockless, there is no need for it to be protected by @logbuf_lock. Remove @logbuf_lock writer-protection of the ringbuffer. The reader-protection is not removed because some variables, used by readers, are using @logbuf_lock for synchronization: @syslog_seq, @syslog_time, @syslog_partial, @console_seq, struct kmsg_dumper. For PRINTK_NMI_DIRECT_CONTEXT_MASK, @logbuf_lock usage is not removed because it may be used for dumper synchronization. Without @logbuf_lock synchronization of vprintk_store() it is no longer possible to use the single static buffer for temporarily sprint'ing the message. Instead, use vsnprintf() to determine the length and perform the real vscnprintf() using the area reserved from the ringbuffer. This leads to suboptimal packing of the message data, but will result in less wasted storage than multiple per-cpu buffers to support lockless temporary sprint'ing. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20201209004453.17720-3-john.ogness@linutronix.de
2020-12-09printk: inline log_output(),log_store() in vprintk_store()John Ogness
In preparation for removing logbuf_lock, inline log_output() and log_store() into vprintk_store(). This will simplify dealing with the various code branches and fallbacks that are possible. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20201209004453.17720-2-john.ogness@linutronix.de
2020-12-09module: delay kobject uevent until after module init callJessica Yu
Apparently there has been a longstanding race between udev/systemd and the module loader. Currently, the module loader sends a uevent right after sysfs initialization, but before the module calls its init function. However, some udev rules expect that the module has initialized already upon receiving the uevent. This race has been triggered recently (see link in references) in some systemd mount unit files. For instance, the configfs module creates the /sys/kernel/config mount point in its init function, however the module loader issues the uevent before this happens. sys-kernel-config.mount expects to be able to mount /sys/kernel/config upon receipt of the module loading uevent, but if the configfs module has not called its init function yet, then this directory will not exist and the mount unit fails. A similar situation exists for sys-fs-fuse-connections.mount, as the fuse sysfs mount point is created during the fuse module's init function. If udev is faster than module initialization then the mount unit would fail in a similar fashion. To fix this race, delay the module KOBJ_ADD uevent until after the module has finished calling its init routine. References: https://github.com/systemd/systemd/issues/17586 Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-By: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com> Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-12-09membarrier: Execute SYNC_CORE on the calling threadAndy Lutomirski
membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented as syncing the core on all sibling threads but not necessarily the calling thread. This behavior is fundamentally buggy and cannot be used safely. Suppose a user program has two threads. Thread A is on CPU 0 and thread B is on CPU 1. Thread A modifies some text and calls membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE). Then thread B executes the modified code. If, at any point after membarrier() decides which CPUs to target, thread A could be preempted and replaced by thread B on CPU 0. This could even happen on exit from the membarrier() syscall. If this happens, thread B will end up running on CPU 0 without having synced. In principle, this could be fixed by arranging for the scheduler to issue sync_core_before_usermode() whenever switching between two threads in the same mm if there is any possibility of a concurrent membarrier() call, but this would have considerable overhead. Instead, make membarrier() sync the calling CPU as well. As an optimization, this avoids an extra smp_mb() in the default barrier-only mode and an extra rseq preempt on the caller. Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/250ded637696d490c69bef1877148db86066881c.1607058304.git.luto@kernel.org
2020-12-09membarrier: Explicitly sync remote cores when SYNC_CORE is requestedAndy Lutomirski
membarrier() does not explicitly sync_core() remote CPUs; instead, it relies on the assumption that an IPI will result in a core sync. On x86, this may be true in practice, but it's not architecturally reliable. In particular, the SDM and APM do not appear to guarantee that interrupt delivery is serializing. While IRET does serialize, IPI return can schedule, thereby switching to another task in the same mm that was sleeping in a syscall. The new task could then SYSRET back to usermode without ever executing IRET. Make this more robust by explicitly calling sync_core_before_usermode() on remote cores. (This also helps people who search the kernel tree for instances of sync_core() and sync_core_before_usermode() -- one might be surprised that the core membarrier code doesn't currently show up in a such a search.) Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/776b448d5f7bd6b12690707f5ed67bcda7f1d427.1607058304.git.luto@kernel.org
2020-12-09membarrier: Add an actual barrier before rseq_preempt()Andy Lutomirski
It seems that most RSEQ membarrier users will expect any stores done before the membarrier() syscall to be visible to the target task(s). While this is extremely likely to be true in practice, nothing actually guarantees it by a strict reading of the x86 manuals. Rather than providing this guarantee by accident and potentially causing a problem down the road, just add an explicit barrier. Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/d3e7197e034fa4852afcf370ca49c30496e58e40.1607058304.git.luto@kernel.org