summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-10-31perf unwind: Take pgoff into account when reporting elf to libdwflMilian Wolff
libdwfl parses an ELF file itself and creates mappings for the individual sections. perf on the other hand sees raw mmap events which represent individual sections. When we encounter an address pointing into a mapping with pgoff != 0, we must take that into account and report the file at the non-offset base address. This fixes unwinding with libdwfl in some cases. E.g. for a file like: ``` using namespace std; mutex g_mutex; double worker() { lock_guard<mutex> guard(g_mutex); uniform_real_distribution<double> uniform(-1E5, 1E5); default_random_engine engine; double s = 0; for (int i = 0; i < 1000; ++i) { s += norm(complex<double>(uniform(engine), uniform(engine))); } cout << s << endl; return s; } int main() { vector<std::future<double>> results; for (int i = 0; i < 10000; ++i) { results.push_back(async(launch::async, worker)); } return 0; } ``` Compile it with `g++ -g -O2 -lpthread cpp-locking.cpp -o cpp-locking`, then record it with `perf record --call-graph dwarf -e sched:sched_switch`. When you analyze it with `perf script` and libunwind, you should see: ``` cpp-locking 20038 [005] 54830.236589: sched:sched_switch: prev_comm=cpp-locking prev_pid=20038 prev_prio=120 prev_state=T ==> next_comm=swapper/5 next_pid=0 next_prio=120 ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb1670208 schedule+0x28 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb16737cc rwsem_down_read_failed+0xec (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb1665e04 call_rwsem_down_read_failed+0x14 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb1672a03 down_read+0x13 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb106bd85 __do_page_fault+0x445 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb18015f5 page_fault+0x45 (/lib/modules/4.14.78-1-lts/build/vmlinux) 7f38e4252591 new_heap+0x101 (/usr/lib/libc-2.28.so) 7f38e4252d0b arena_get2.part.4+0x2fb (/usr/lib/libc-2.28.so) 7f38e4255b1c tcache_init.part.6+0xec (/usr/lib/libc-2.28.so) 7f38e42569e5 __GI___libc_malloc+0x115 (inlined) 7f38e4241790 __GI__IO_file_doallocate+0x90 (inlined) 7f38e424fbbf __GI__IO_doallocbuf+0x4f (inlined) 7f38e424ee47 __GI__IO_file_overflow+0x197 (inlined) 7f38e424df36 _IO_new_file_xsputn+0x116 (inlined) 7f38e4242bfb __GI__IO_fwrite+0xdb (inlined) 7f38e463fa6d std::basic_streambuf<char, std::char_traits<char> >::sputn(char const*, long)+0x1cd (inlined) 7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> >::_M_put(char const*, long)+0x1cd (inlined) 7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::__write<char>(std::ostreambuf_iterator<char, std::char_traits<char> >, char const*, int)+0x1cd (inlined) 7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double>(std::ostreambuf_iterator<c> 7f38e464bd70 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::put(std::ostreambuf_iterator<char, std::char_traits<char> >, std::ios_base&, char, double) const+0x90 (inl> 7f38e464bd70 std::ostream& std::ostream::_M_insert<double>(double)+0x90 (/usr/lib/libstdc++.so.6.0.25) 563b9cb502f7 std::ostream::operator<<(double)+0xb7 (inlined) 563b9cb502f7 worker()+0xb7 (/ssd/milian/projects/kdab/rnd/hotspot/build/tests/test-clients/cpp-locking/cpp-locking) 563b9cb506fb double std::__invoke_impl<double, double (*)()>(std::__invoke_other, double (*&&)())+0x2b (inlined) 563b9cb506fb std::__invoke_result<double (*)()>::type std::__invoke<double (*)()>(double (*&&)())+0x2b (inlined) 563b9cb506fb decltype (__invoke((_S_declval<0ul>)())) std::thread::_Invoker<std::tuple<double (*)()> >::_M_invoke<0ul>(std::_Index_tuple<0ul>)+0x2b (inlined) 563b9cb506fb std::thread::_Invoker<std::tuple<double (*)()> >::operator()()+0x2b (inlined) 563b9cb506fb std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<double>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<double (*)()> >, dou> 563b9cb506fb std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_> 563b9cb507e8 std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>::operator()() const+0x28 (inlined) 563b9cb507e8 std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)+0x28 (/ssd/milian/> 7f38e46d24fe __pthread_once_slow+0xbe (/usr/lib/libpthread-2.28.so) 563b9cb51149 __gthread_once+0xe9 (inlined) 563b9cb51149 void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)> 563b9cb51149 std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool)+0xe9 (inlined) 563b9cb51149 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<double (*)()> >&&)::{lambda()#1}::op> 563b9cb51149 void std::__invoke_impl<void, std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<double> 563b9cb51149 std::__invoke_result<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<double (*)()> >> 563b9cb51149 decltype (__invoke((_S_declval<0ul>)())) std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_> 563b9cb51149 std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread::_Invoker<std::tuple<dou> 563b9cb51149 std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<double (*)()> >, double>::_Async_state_impl(std::thread> 7f38e45f0062 execute_native_thread_routine+0x12 (/usr/lib/libstdc++.so.6.0.25) 7f38e46caa9c start_thread+0xfc (/usr/lib/libpthread-2.28.so) 7f38e42ccb22 __GI___clone+0x42 (inlined) ``` Before this patch, using libdwfl, you would see: ``` cpp-locking 20038 [005] 54830.236589: sched:sched_switch: prev_comm=cpp-locking prev_pid=20038 prev_prio=120 prev_state=T ==> next_comm=swapper/5 next_pid=0 next_prio=120 ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb1670208 schedule+0x28 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb16737cc rwsem_down_read_failed+0xec (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb1665e04 call_rwsem_down_read_failed+0x14 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb1672a03 down_read+0x13 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb106bd85 __do_page_fault+0x445 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb18015f5 page_fault+0x45 (/lib/modules/4.14.78-1-lts/build/vmlinux) 7f38e4252591 new_heap+0x101 (/usr/lib/libc-2.28.so) a041161e77950c5c [unknown] ([unknown]) ``` With this patch applied, we get a bit further in unwinding: ``` cpp-locking 20038 [005] 54830.236589: sched:sched_switch: prev_comm=cpp-locking prev_pid=20038 prev_prio=120 prev_state=T ==> next_comm=swapper/5 next_pid=0 next_prio=120 ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb166fec5 __sched_text_start+0x545 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb1670208 schedule+0x28 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb16737cc rwsem_down_read_failed+0xec (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb1665e04 call_rwsem_down_read_failed+0x14 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb1672a03 down_read+0x13 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb106bd85 __do_page_fault+0x445 (/lib/modules/4.14.78-1-lts/build/vmlinux) ffffffffb18015f5 page_fault+0x45 (/lib/modules/4.14.78-1-lts/build/vmlinux) 7f38e4252591 new_heap+0x101 (/usr/lib/libc-2.28.so) 7f38e4252d0b arena_get2.part.4+0x2fb (/usr/lib/libc-2.28.so) 7f38e4255b1c tcache_init.part.6+0xec (/usr/lib/libc-2.28.so) 7f38e42569e5 __GI___libc_malloc+0x115 (inlined) 7f38e4241790 __GI__IO_file_doallocate+0x90 (inlined) 7f38e424fbbf __GI__IO_doallocbuf+0x4f (inlined) 7f38e424ee47 __GI__IO_file_overflow+0x197 (inlined) 7f38e424df36 _IO_new_file_xsputn+0x116 (inlined) 7f38e4242bfb __GI__IO_fwrite+0xdb (inlined) 7f38e463fa6d std::basic_streambuf<char, std::char_traits<char> >::sputn(char const*, long)+0x1cd (inlined) 7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> >::_M_put(char const*, long)+0x1cd (inlined) 7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::__write<char>(std::ostreambuf_iterator<char, std::char_traits<char> >, char const*, int)+0x1cd (inlined) 7f38e463fa6d std::ostreambuf_iterator<char, std::char_traits<char> > std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double>(std::ostreambuf_iterator<c> 7f38e464bd70 std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::put(std::ostreambuf_iterator<char, std::char_traits<char> >, std::ios_base&, char, double) const+0x90 (inl> 7f38e464bd70 std::ostream& std::ostream::_M_insert<double>(double)+0x90 (/usr/lib/libstdc++.so.6.0.25) 563b9cb502f7 std::ostream::operator<<(double)+0xb7 (inlined) 563b9cb502f7 worker()+0xb7 (/ssd/milian/projects/kdab/rnd/hotspot/build/tests/test-clients/cpp-locking/cpp-locking) 6eab825c1ee3e4ff [unknown] ([unknown]) ``` Note that the backtrace is still stopping too early, when compared to the nice results obtained via libunwind. It's unclear so far what the reason for that is. Committer note: Further comment by Milian on the thread started on the Link: tag below: --- The remaining issue is due to a bug in elfutils: https://sourceware.org/ml/elfutils-devel/2018-q4/msg00089.html With both patches applied, libunwind and elfutils produce the same output for the above scenario. --- Signed-off-by: Milian Wolff <milian.wolff@kdab.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: http://lkml.kernel.org/r/20181029141644.3907-1-milian.wolff@kdab.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-31perf top: Do not use overwrite mode by defaultArnaldo Carvalho de Melo
Enabling --overwrite mode allows us to to use just the most recent records, which helps in high core count machines such as Knights Landing/Mill, but right now is being disabled by default as the pausing used in this technique is leading to loss of metadata events such as PERF_RECORD_MMAP which makes 'perf top' unable to resolve samples, leading to lots of unknown samples appearing on the UI. Enabling this may be useful if you are in such machines and profiling a workload that doesn't creates short lived threads and/or doesn't uses many executable mmap operations. Work is being planed to solve this situation, till then, this will remain disabled by default. Reported-by: David Miller <davem@davemloft.net> Acked-by: Kan Liang <kan.liang@intel.com> Link: https://lkml.kernel.org/r/4f84468f-37d9-cf1b-12c1-514ef74b6a48@linux.intel.com Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Fixes: ebebbf082357 ("perf top: Switch default mode to overwrite mode") Link: https://lkml.kernel.org/n/tip-ehvf77vi1si9409r7p4wx788@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-30perf top: Allow disabling the overwrite modeArnaldo Carvalho de Melo
In ebebbf082357 ("perf top: Switch default mode to overwrite mode") we forgot to leave a way to disable that new default, add a --overwrite option that can be disabled using --no-overwrite, since the code already in such a way that we can readily disable this mode. This is useful when investigating bugs with this mode like the recent report from David Miller where lots of unknown symbols appear due to disabling the events while processing them which disables all record types, not just PERF_RECORD_SAMPLE, which makes it impossible to resolve maps when we lose PERF_RECORD_MMAP records. This can be easily seen while building a kernel, when there are lots of short lived processes. Reported-by: David Miller <davem@davemloft.net> Acked-by: Kan Liang <kan.liang@intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wang Nan <wangnan0@huawei.com> Fixes: ebebbf082357 ("perf top: Switch default mode to overwrite mode") Link: https://lkml.kernel.org/n/tip-oqgsz2bq4kgrnnajrafcdhie@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-30perf trace: Beautify mount's first pathname argArnaldo Carvalho de Melo
The pathname beautifiers so far support just one augmented pathname per syscall, so do it just for mount's first arg, later this will get fixed. With: # perf probe -l probe:vfs_getname (on getname_flags:73@acme/git/linux/fs/namei.c with pathname) # Later this will get added to augmented_syscalls.c (eBPF): In one xterm: # perf trace -e mount,umount 2687.331 ( 3.544 ms): mount/8892 mount(dev_name: /mnt, dir_name: 0x561f9ac184a0, type: 0x561f9ac1b170, flags: BIND) = 0 3912.126 ( 8.807 ms): umount/8895 umount2(name: /mnt) = 0 ^C# In the other: $ sudo mount --bind /proc /mnt $ sudo umount /mnt Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Benjamin Peterson <benjamin@python.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-qsvhrm2es635cl4zicqjeth2@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-30perf trace: Beautify the umount's 'name' argumentArnaldo Carvalho de Melo
By using the SCA_FILENAME beautifier, that works when either the probe:vfs_getname probe is in place or with the eBPF program tools/perf/examples/bpf/augmented_syscalls.c: # perf probe -l probe:vfs_getname (on getname_flags:73@acme/git/linux/fs/namei.c with pathname) # perf trace -e umount 9630.332 ( 9.521 ms): umount/8082 umount2(name: /mnt) = 0 # The augmented syscalls one will be done in the next patch. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Benjamin Peterson <benjamin@python.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-hegbzlpd2nrn584l5jxn7sy2@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-30perf trace: Consider syscall aliases tooArnaldo Carvalho de Melo
When trying to trace the 'umount' syscall on x86_64 I noticed that it was failing: # trace -e umount umount /mnt event syntax error: 'umount' \___ parser error Run 'perf list' for a list of valid events Usage: perf trace [<options>] [<command>] or: perf trace [<options>] -- <command> [<options>] or: perf trace record [<options>] [<command>] or: perf trace record [<options>] -- <command> [<options>] -e, --event <event> event/syscall selector. use 'perf list' to list available events # This is because in the x86-64 we have it just as 'umount2': $ grep umount arch/x86/entry/syscalls/syscall_64.tbl 166 common umount2 __x64_sys_umount $ So if the syscall name fails, try fallbacking to looking at the aliases we have in the syscall_fmts table to then re-lookup, now: # trace -e umount umount -f /mnt umount: /mnt: not mounted. 1.759 ( 0.004 ms): umount/18365 umount2(name: 0x55fbfcbc4480, flags: 1) = -1 EINVAL Invalid argument # Time to beautify the flags arg :-) Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Benjamin Peterson <benjamin@python.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-ukweodgzbmjd25lfkgryeft1@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-30perf trace beauty: Beautify mount/umount's 'flags' argumentArnaldo Carvalho de Melo
# trace -e mount mount -o ro -t debugfs nodev /mnt 0.000 ( 1.040 ms): mount/27235 mount(dev_name: 0x5601cc8c64e0, dir_name: 0x5601cc8c6500, type: 0x5601cc8c6480, flags: RDONLY) = 0 # trace -e mount mount -o remount,relatime -t debugfs nodev /mnt 0.000 ( 2.946 ms): mount/27262 mount(dev_name: 0x55f4a73d64e0, dir_name: 0x55f4a73d6500, type: 0x55f4a73d6480, flags: REMOUNT|RELATIME) = 0 # trace -e mount mount -o remount,strictatime -t debugfs nodev /mnt 0.000 ( 2.934 ms): mount/27265 mount(dev_name: 0x5617f71d94e0, dir_name: 0x5617f71d9500, type: 0x5617f71d9480, flags: REMOUNT|STRICTATIME) = 0 # trace -e mount mount -o remount,suid,silent -t debugfs nodev /mnt 0.000 ( 0.049 ms): mount/27273 mount(dev_name: 0x55ad65df24e0, dir_name: 0x55ad65df2500, type: 0x55ad65df2480, flags: REMOUNT|SILENT) = 0 # trace -e mount mount -o remount,rw,sync,lazytime -t debugfs nodev /mnt 0.000 ( 2.684 ms): mount/27281 mount(dev_name: 0x561216055530, dir_name: 0x561216055550, type: 0x561216055510, flags: SYNCHRONOUS|REMOUNT|LAZYTIME) = 0 # trace -e mount mount -o remount,dirsync -t debugfs nodev /mnt 0.000 ( 3.512 ms): mount/27314 mount(dev_name: 0x55c4e7188480, dir_name: 0x55c4e7188530, type: 0x55c4e71884a0, flags: REMOUNT|DIRSYNC, data: 0x55c4e71884e0) = 0 # Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Benjamin Peterson <benjamin@python.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-i5ncao73c0bd02qprgrq6wb9@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-30perf trace beauty: Allow syscalls to mask an argument before considering itArnaldo Carvalho de Melo
Take mount's 'flags' arg, to cope with this semantic, as defined in do_mount in fs/namespace.c: /* * Pre-0.97 versions of mount() didn't have a flags word. When the * flags word was introduced its top half was required to have the * magic value 0xC0ED, and this remained so until 2.4.0-test9. * Therefore, if this magic number is present, it carries no * information and must be discarded. */ We need to mask this arg, and then see if it is zero, when we simply don't print the arg name and value. The next patch will use this for mount's 'flag' arg. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Benjamin Peterson <benjamin@python.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-btue14k5jemayuykfrwsnh85@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-30perf beauty: Introduce strarray__scnprintf_flags()Arnaldo Carvalho de Melo
Generalizing pkey_alloc__scnprintf_access_rights(), so that we can use it with other flags-like arguments, such as mount's mountflags argument. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Benjamin Peterson <benjamin@python.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-o3ymi3104m8moaz9865g09w9@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-30perf beauty: Switch from GPL v2.0 to LGPL v2.1Arnaldo Carvalho de Melo
The intention is to have this as a library, since it is not perf specific at all. I did the switch for the files where I'm the only contributor, with the exception of a few lines changed by Jiri Olsa. Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-a04q6chdyjknm1hr305ulx8h@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-30perf beauty: Add a generator for MS_ mount/umount's flag constantsArnaldo Carvalho de Melo
It'll use tools/include copy of linux/fs.h to generate a table to be used by tools, initially by the 'mount' and 'umount' beautifiers in 'perf trace', but that could also be used to translate from a string constant to the integer value to be used in a eBPF or tracefs tracepoint filter. When used without any args it produces: $ tools/perf/trace/beauty/mount_flags.sh static const char *mount_flags[] = { [1 ? (ilog2(1) + 1) : 0] = "RDONLY", [2 ? (ilog2(2) + 1) : 0] = "NOSUID", [4 ? (ilog2(4) + 1) : 0] = "NODEV", [8 ? (ilog2(8) + 1) : 0] = "NOEXEC", [16 ? (ilog2(16) + 1) : 0] = "SYNCHRONOUS", [32 ? (ilog2(32) + 1) : 0] = "REMOUNT", [64 ? (ilog2(64) + 1) : 0] = "MANDLOCK", [128 ? (ilog2(128) + 1) : 0] = "DIRSYNC", [1024 ? (ilog2(1024) + 1) : 0] = "NOATIME", [2048 ? (ilog2(2048) + 1) : 0] = "NODIRATIME", [4096 ? (ilog2(4096) + 1) : 0] = "BIND", [8192 ? (ilog2(8192) + 1) : 0] = "MOVE", [16384 ? (ilog2(16384) + 1) : 0] = "REC", [32768 ? (ilog2(32768) + 1) : 0] = "SILENT", [16 + 1] = "POSIXACL", [17 + 1] = "UNBINDABLE", [18 + 1] = "PRIVATE", [19 + 1] = "SLAVE", [20 + 1] = "SHARED", [21 + 1] = "RELATIME", [22 + 1] = "KERNMOUNT", [23 + 1] = "I_VERSION", [24 + 1] = "STRICTATIME", [25 + 1] = "LAZYTIME", [26 + 1] = "SUBMOUNT", [27 + 1] = "NOREMOTELOCK", [28 + 1] = "NOSEC", [29 + 1] = "BORN", [30 + 1] = "ACTIVE", [31 + 1] = "NOUSER", }; $ Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Benjamin Peterson <benjamin@python.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-mgutbbkmip9gfnmd28ikg7xt@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-30tools include uapi: Grab a copy of linux/fs.hArnaldo Carvalho de Melo
We'll use it to create tables for the 'flags' argument to the 'mount' and 'umount' syscalls. Add it to check_headers.sh so that when a new protocol gets added we get a notification during the build process. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Benjamin Peterson <benjamin@python.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-yacf9jvkwfwg2g95r2us3xb3@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-10-30perf/core: Clean up inconsisent indentationColin Ian King
Replace a bunch of spaces with tab, cleans up indentation Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Link: http://lkml.kernel.org/r/20181029233211.21475-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-29Merge branch 'linus' into perf/urgent, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-27i2c-hid: properly terminate i2c_hid_dmi_desc_override_table[] arrayLinus Torvalds
Commit 9ee3e06610fd ("HID: i2c-hid: override HID descriptors for certain devices") added a new dmi_system_id quirk table to override certain HID report descriptors for some systems that lack them. But the table wasn't properly terminated, causing the dmi matching to walk off into la-la-land, and starting to treat random data as dmi descriptor pointers, causing boot-time oopses if you were at all unlucky. Terminate the array. We really should have some way to just statically check that arrays that should be terminated by an empty entry actually are so. But the HID people really should have caught this themselves, rather than have me deal with an oops during the merge window. Tssk, tssk. Cc: Julian Sax <jsbc@gmx.de> Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: - a few misc things - ocfs2 updates - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits) hugetlbfs: dirty pages as they are added to pagecache mm: export add_swap_extent() mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page() mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t mm/gup: cache dev_pagemap while pinning pages Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" mm: return zero_resv_unavail optimization mm: zero remaining unavailable struct pages tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option tools/testing/selftests/vm/gup_benchmark.c: allow user specified file tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage mm/gup_benchmark.c: add additional pinning methods mm/gup_benchmark.c: time put_page() mm: don't raise MEMCG_OOM event due to failed high-order allocation mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock ...
2018-10-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "What better way to start off a weekend than with some networking bug fixes: 1) net namespace leak in dump filtering code of ipv4 and ipv6, fixed by David Ahern and Bjørn Mork. 2) Handle bad checksums from hardware when using CHECKSUM_COMPLETE properly in UDP, from Sean Tranchetti. 3) Remove TCA_OPTIONS from policy validation, it turns out we don't consistently use nested attributes for this across all packet schedulers. From David Ahern. 4) Fix SKB corruption in cadence driver, from Tristram Ha. 5) Fix broken WoL handling in r8169 driver, from Heiner Kallweit. 6) Fix OOPS in pneigh_dump_table(), from Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (28 commits) net/neigh: fix NULL deref in pneigh_dump_table() net: allow traceroute with a specified interface in a vrf bridge: do not add port to router list when receives query with source 0.0.0.0 net/smc: fix smc_buf_unuse to use the lgr pointer ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called net/{ipv4,ipv6}: Do not put target net if input nsid is invalid lan743x: Remove SPI dependency from Microchip group. drivers: net: remove <net/busy_poll.h> inclusion when not needed net: phy: genphy_10g_driver: Avoid NULL pointer dereference r8169: fix broken Wake-on-LAN from S5 (poweroff) octeontx2-af: Use GFP_ATOMIC under spin lock net: ethernet: cadence: fix socket buffer corruption problem net/ipv6: Allow onlink routes to have a device mismatch if it is the default route net: sched: Remove TCA_OPTIONS from policy ice: Poll for link status change ice: Allocate VF interrupts and set queue map ice: Introduce ice_dev_onetime_setup net: hns3: Fix for warning uninitialized symbol hw_err_lst3 octeontx2-af: Copy the right amount of memory net: udp: fix handling of CHECKSUM_COMPLETE packets ...
2018-10-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds
Pull sparc fixes from David Miller: "Some more sparc fixups, mostly aimed at getting the allmodconfig build up and clean again" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc64: Rework xchg() definition to avoid warnings. sparc64: Export __node_distance. sparc64: Make corrupted user stacks more debuggable.
2018-10-26hugetlbfs: dirty pages as they are added to pagecacheMike Kravetz
Some test systems were experiencing negative huge page reserve counts and incorrect file block counts. This was traced to /proc/sys/vm/drop_caches removing clean pages from hugetlbfs file pagecaches. When non-hugetlbfs explicit code removes the pages, the appropriate accounting is not performed. This can be recreated as follows: fallocate -l 2M /dev/hugepages/foo echo 1 > /proc/sys/vm/drop_caches fallocate -l 2M /dev/hugepages/foo grep -i huge /proc/meminfo AnonHugePages: 0 kB ShmemHugePages: 0 kB HugePages_Total: 2048 HugePages_Free: 2047 HugePages_Rsvd: 18446744073709551615 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 4194304 kB ls -lsh /dev/hugepages/foo 4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo To address this issue, dirty pages as they are added to pagecache. This can easily be reproduced with fallocate as shown above. Read faulted pages will eventually end up being marked dirty. But there is a window where they are clean and could be impacted by code such as drop_caches. So, just dirty them all as they are added to the pagecache. Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com Fixes: 6bda666a03f0 ("hugepages: fold find_or_alloc_pages into huge_no_page()") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Mihcla Hocko <mhocko@suse.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: export add_swap_extent()Omar Sandoval
Btrfs currently does not support swap files because swap's use of bmap does not work with copy-on-write and multiple devices. See 35054394c4b3 ("Btrfs: stop providing a bmap operation to avoid swapfile corruptions"). However, the swap code has a mechanism for the filesystem to manually add swap extents using add_swap_extent() from the ->swap_activate() aop. iomap has done this since 67482129cdab ("iomap: add a swapfile activation function"). Btrfs will do the same in a later patch, so export add_swap_extent(). Link: http://lkml.kernel.org/r/bb1208575e02829aae51b538709476964f97b1ea.1536704650.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Sterba <dsterba@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: split SWP_FILE into SWP_ACTIVATED and SWP_FSOmar Sandoval
The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go through the filesystem, and to make swapoff() call ->swap_deactivate(). For Btrfs, we want the latter but not the former, so split this flag into two. This makes us always call ->swap_deactivate() if ->swap_activate() succeeded, not just if it didn't add any swap extents itself. This also resolves the issue of the very misleading name of SWP_FILE, which is only used for swap files over NFS. Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26tools/testing/selftests/vm/map_fixed_noreplace.c: add test for ↵Michael Ellerman
MAP_FIXED_NOREPLACE Add a test for MAP_FIXED_NOREPLACE, based on some code originally by Jann Horn. This would have caught the overlap bug reported by Daniel Micay. I originally suggested to Michal that we create MAP_FIXED_NOREPLACE, but instead of writing a selftest I spent my time bike-shedding whether it should be called MAP_FIXED_SAFE/NOCLOBBER/WEAK/NEW .. mea culpa. Link: http://lkml.kernel.org/r/20181013133929.28653-1-mpe@ellerman.id.au Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jann Horn <jannh@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Joel Stanley <joel@jms.id.au> Cc: Jason Evans <jasone@google.com> Cc: David Goldblatt <davidtgoldblatt@gmail.com> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()Andrea Arcangeli
There should be no cache left by the time we overwrite the old transhuge pmd with the new one. It's already too late to flush through the virtual address because we already copied the page data to the new physical address. So flush the cache before the data copy. Also delete the "end" variable to shutoff a "unused variable" warning on x86 where flush_cache_range() is a noop. Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()Andrea Arcangeli
change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB right away. do_huge_pmd_numa_page() flushes the TLB before calling migrate_misplaced_transhuge_page(). By the time do_huge_pmd_numa_page() runs some CPU could still access the page through the TLB. change_huge_pmd() before arming the numa/protnone transhuge pmd calls mmu_notifier_invalidate_range_start(). So there's no need of mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end() sequence in migrate_misplaced_transhuge_page() too, because by the time migrate_misplaced_transhuge_page() runs, the pmd mapping has already been invalidated in the secondary MMUs. It has to or if a secondary MMU can still write to the page, the migrate_page_copy() would lose data. However an explicit mmu_notifier_invalidate_range() is needed before migrate_misplaced_transhuge_page() starts copying the data of the transhuge page or the below can happen for MMU notifier users sharing the primary MMU pagetables and only implementing ->invalidate_range: CPU0 CPU1 GPU sharing linux pagetables using only ->invalidate_range ----------- ------------ --------- GPU secondary MMU writes to the page mapped by the transhuge pmd change_pmd_range() mmu..._range_start() ->invalidate_range_start() noop change_huge_pmd() set_pmd_at(numa/protnone) pmd_unlock() do_huge_pmd_numa_page() CPU TLB flush globally (1) CPU cannot write to page migrate_misplaced_transhuge_page() GPU writes to the page... migrate_page_copy() ...GPU stops writing to the page CPU TLB flush (2) mmu..._range_end() (3) ->invalidate_range_stop() noop ->invalidate_range() GPU secondary MMU is invalidated and cannot write to the page anymore (too late) Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives too late, we also need a mmu_notifier_invalidate_range() before calling migrate_misplaced_transhuge_page(), because the ->invalidate_range() in (3) also arrives too late. This requirement is the result of the lazy optimization in change_huge_pmd() that releases the pmd_lock without first flushing the TLB and without first calling mmu_notifier_invalidate_range(). Even converting the removed mmu_notifier_invalidate_range_only_end() into a mmu_notifier_invalidate_range_end() would not have been enough to fix this, because it run after migrate_page_copy(). After the hugepage data copy is done migrate_misplaced_transhuge_page() can proceed and call set_pmd_at without having to flush the TLB nor any secondary MMUs because the secondary MMU invalidate, just like the CPU TLB flush, has to happen before the migrate_page_copy() is called or it would be a bug in the first place (and it was for drivers using ->invalidate_range()). KVM is unaffected because it doesn't implement ->invalidate_range(). The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and uses the generic migrate_pages which transitions the pte from numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs and all mmu notifiers there before copying the page. Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race conditionAndrea Arcangeli
Patch series "migrate_misplaced_transhuge_page race conditions". Aaron found a new instance of the THP MADV_DONTNEED race against pmdp_clear_flush* variants, that was apparently left unfixed. While looking into the race found by Aaron, I may have found two more issues in migrate_misplaced_transhuge_page. These race conditions would not cause kernel instability, but they'd corrupt userland data or leave data non zero after MADV_DONTNEED. I did only minor testing, and I don't expect to be able to reproduce this (especially the lack of ->invalidate_range before migrate_page_copy, requires the latest iommu hardware or infiniband to reproduce). The last patch is noop for x86 and it needs further review from maintainers of archs that implement flush_cache_range() (not in CC yet). To avoid confusion, it's not the first patch that introduces the bug fixed in the second patch, even before removing the pmdp_huge_clear_flush_notify, that _notify suffix was called after migrate_page_copy already run. This patch (of 3): This is a corollary of ced108037c2aa ("thp: fix MADV_DONTNEED vs. numa balancing race"), 58ceeb6bec8 ("thp: fix MADV_DONTNEED vs. MADV_FREE race") and 5b7abeae3af8c ("thp: fix MADV_DONTNEED vs clear soft dirty race). When the above three fixes where posted Dave asked https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com but apparently this was missed. The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced in a54a407fbf7 ("mm: Close races between THP migration and PMD numa clearing"). The important part of such commit is only the part where the page lock is not released until the first do_huge_pmd_numa_page() finished disarming the pagenuma/protnone. The addition of pmdp_clear_flush() wasn't beneficial to such commit and there's no commentary about such an addition either. I guess the pmdp_clear_flush() in such commit was added just in case for safety, but it ended up introducing the MADV_DONTNEED race condition found by Aaron. At that point in time nobody thought of such kind of MADV_DONTNEED race conditions yet (they were fixed later) so the code may have looked more robust by adding the pmdp_clear_flush(). This specific race condition won't destabilize the kernel, but it can confuse userland because after MADV_DONTNEED the memory won't be zeroed out. This also optimizes the code and removes a superfluous TLB flush. [akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)] Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Aaron Tomlin <atomlin@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_tClark Williams
The static lock quarantine_lock is used in quarantine.c to protect the quarantine queue datastructures. It is taken inside quarantine queue manipulation routines (quarantine_put(), quarantine_reduce() and quarantine_remove_cache()), with IRQs disabled. This is not a problem on a stock kernel but is problematic on an RT kernel where spin locks are sleeping spinlocks, which can sleep and can not be acquired with disabled interrupts. Convert the quarantine_lock to a raw spinlock_t. The usage of quarantine_lock is confined to quarantine.c and the work performed while the lock is held is used for debug purpose. [bigeasy@linutronix.de: slightly altered the commit message] Link: http://lkml.kernel.org/r/20181010214945.5owshc3mlrh74z4b@linutronix.de Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/gup: cache dev_pagemap while pinning pagesKeith Busch
Getting pages from ZONE_DEVICE memory needs to check the backing device's live-ness, which is tracked in the device's dev_pagemap metadata. This metadata is stored in a radix tree and looking it up adds measurable software overhead. This patch avoids repeating this relatively costly operation when dev_pagemap is used by caching the last dev_pagemap while getting user pages. The gup_benchmark kernel self test reports this reduces time to get user pages to as low as 1/3 of the previous time. Link: http://lkml.kernel.org/r/20181012173040.15669-1-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"Masayoshi Mizuma
commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved") breaks movable_node kernel option because it changed the memory gap range to reserved memblock. So, the node is marked as Normal zone even if the SRAT has Hot pluggable affinity. ===================================================================== kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable ... kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0 ... kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug ... kernel: Movable zone start for each node kernel: Node 3: 0x00001c0000000000 kernel: Early memory node ranges ... ===================================================================== The original issue is fixed by the former patches, so let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"). Link: http://lkml.kernel.org/r/20181002143821.5112-4-msys.mizuma@gmail.com Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: return zero_resv_unavail optimizationPavel Tatashin
When checking for valid pfns in zero_resv_unavail(), it is not necessary to verify that pfns within pageblock_nr_pages ranges are valid, only the first one needs to be checked. This is because memory for pages are allocated in contiguous chunks that contain pageblock_nr_pages struct pages. Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.com Signed-off-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: zero remaining unavailable struct pagesNaoya Horiguchi
Patch series "mm: Fix for movable_node boot option", v3. This patch series contains a fix for the movable_node boot option issue which was introduced by commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"). The commit breaks the option because it changed the memory gap range to reserved memblock. So, the node is marked as Normal zone even if the SRAT has Hot pluggable affinity. First and second patch fix the original issue which the commit tried to fix, then revert the commit. This patch (of 3): There is a kernel panic that is triggered when reading /proc/kpageflags on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]': BUG: unable to handle kernel paging request at fffffffffffffffe PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014 RIP: 0010:stable_page_flags+0x27/0x3c0 Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7 RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202 RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0 RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001 R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0 R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10 FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0 Call Trace: kpageflags_read+0xc7/0x120 proc_reg_read+0x3c/0x60 __vfs_read+0x36/0x170 vfs_read+0x89/0x130 ksys_pread64+0x71/0x90 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7efc42e75e23 Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24 According to kernel bisection, this problem became visible due to commit f7f99100d8d9 which changes how struct pages are initialized. Memblock layout affects the pfn ranges covered by node/zone. Consider that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the default (no memmap= given) memblock layout is like below: MEMBLOCK configuration: memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x4 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0 memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... If you give memmap=1G!4G (so it just covers memory[0x2]), the range [0x100000000-0x13fffffff] is gone: MEMBLOCK configuration: memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x3 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... This causes shrinking node 0's pfn range because it is calculated by the address range of memblock.memory. So some of struct pages in the gap range are left uninitialized. We have a function zero_resv_unavail() which does zeroing the struct pages outside memblock.memory, but currently it covers only the reserved unavailable range (i.e. memblock.memory && !memblock.reserved). This patch extends it to cover all unavailable range, which fixes the reported issue. Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Tested-by: Oscar Salvador <osalvador@suse.de> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB optionKeith Busch
Add a new option '-H' to the gup benchmark to help understand how hugetlb mapping pages compare with the default. Link: http://lkml.kernel.org/r/20181010195605.10689-6-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED optionKeith Busch
Add a new benchmark option, -S, to request MAP_SHARED. This can be used to compare with MAP_PRIVATE, or for files that require this option, like dax. Link: http://lkml.kernel.org/r/20181010195605.10689-5-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26tools/testing/selftests/vm/gup_benchmark.c: allow user specified fileKeith Busch
Allow a user to specify a file to map by adding a new option, '-f', providing a means to test various file backings. If not specified, the benchmark will use a private mapping of /dev/zero, which produces an anonymous mapping as before. [akpm@linux-foundation.org: avoid using comma operator] Link: http://lkml.kernel.org/r/20181010195605.10689-4-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usageKeith Busch
If the '-w' parameter was provided, the benchmark would exit due to a mssing 'break'. Link: http://lkml.kernel.org/r/20181010195605.10689-3-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/gup_benchmark.c: add additional pinning methodsKeith Busch
Provide new gup benchmark ioctl commands to run different user page pinning methods, get_user_pages_longterm() and get_user_pages(), in addition to the existing get_user_pages_fast(). Link: http://lkml.kernel.org/r/20181010195605.10689-2-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/gup_benchmark.c: time put_page()Keith Busch
We'd like to measure time to unpin user pages, so this adds a second benchmark timer on put_page, separate from get_page. Adding the field breaks this ioctl ABI, but should be okay since this an in-tree kernel selftest. [akpm@linux-foundation.org: add expansion to struct gup_benchmark for future use] Link: http://lkml.kernel.org/r/20181010195605.10689-1-keith.busch@intel.com Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: don't raise MEMCG_OOM event due to failed high-order allocationRoman Gushchin
It was reported that on some of our machines containers were restarted with OOM symptoms without an obvious reason. Despite there were almost no memory pressure and plenty of page cache, MEMCG_OOM event was raised occasionally, causing the container management software to think, that OOM has happened. However, no tasks have been killed. The following investigation showed that the problem is caused by a failing attempt to charge a high-order page. In such case, the OOM killer is never invoked. As shown below, it can happen under conditions, which are very far from a real OOM: e.g. there is plenty of clean page cache and no memory pressure. There is no sense in raising an OOM event in this case, as it might confuse a user and lead to wrong and excessive actions (e.g. restart the workload, as in my case). Let's look at the charging path in try_charge(). If the memory usage is about memory.max, which is absolutely natural for most memory cgroups, we try to reclaim some pages. Even if we were able to reclaim enough memory for the allocation, the following check can fail due to a race with another concurrent allocation: if (mem_cgroup_margin(mem_over_limit) >= nr_pages) goto retry; For regular pages the following condition will save us from triggering the OOM: if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER)) goto retry; But for high-order allocation this condition will intentionally fail. The reason behind is that we'll likely fall to regular pages anyway, so it's ok and even preferred to return ENOMEM. In this case the idea of raising MEMCG_OOM looks dubious. Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation order check, so that the event won't be raised for high order allocations. This change doesn't affect regular pages allocation and charging. Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlockDave Chinner
We've recently seen a workload on XFS filesystems with a repeatable deadlock between background writeback and a multi-process application doing concurrent writes and fsyncs to a small range of a file. range_cyclic writeback Process 1 Process 2 xfs_vm_writepages write_cache_pages writeback_index = 2 cycled = 0 .... find page 2 dirty lock Page 2 ->writepage page 2 writeback page 2 clean page 2 added to bio no more pages write() locks page 1 dirties page 1 locks page 2 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 find page 1 towrite lock Page 1 ->writepage page 1 writeback page 1 clean page 1 added to bio find page 2 towrite lock Page 2 page 2 is writeback <blocks> write() locks page 1 dirties page 1 fsync() .... xfs_vm_writepages write_cache_pages start index 0 !done && !cycled sets index to 0, restarts lookup find page 1 dirty find page 1 towrite lock Page 1 page 1 is writeback <blocks> lock Page 1 <blocks> DEADLOCK because: - process 1 needs page 2 writeback to complete to make enough progress to issue IO pending for page 1 - writeback needs page 1 writeback to complete so process 2 can progress and unlock the page it is blocked on, then it can issue the IO pending for page 2 - process 2 can't make progress until process 1 issues IO for page 1 The underlying cause of the problem here is that range_cyclic writeback is processing pages in descending index order as we hold higher index pages in a structure controlled from above write_cache_pages(). The write_cache_pages() caller needs to be able to submit these pages for IO before write_cache_pages restarts writeback at mapping index 0 to avoid wcp inverting the page lock/writeback wait order. generic_writepages() is not susceptible to this bug as it has no private context held across write_cache_pages() - filesystems using this infrastructure always submit pages in ->writepage immediately and so there is no problem with range_cyclic going back to mapping index 0. However: mpage_writepages() has a private bio context, exofs_writepages() has page_collect fuse_writepages() has fuse_fill_wb_data nfs_writepages() has nfs_pageio_descriptor xfs_vm_writepages() has xfs_writepage_ctx All of these ->writepages implementations can hold pages under writeback in their private structures until write_cache_pages() returns, and hence they are all susceptible to this deadlock. Also worth noting is that ext4 has it's own bastardised version of write_cache_pages() and so it /may/ have an equivalent deadlock. I looked at the code long enough to understand that it has a similar retry loop for range_cyclic writeback reaching the end of the file and then promptly ran away before my eyes bled too much. I'll leave it for the ext4 developers to determine if their code is actually has this deadlock and how to fix it if it has. There's a few ways I can see avoid this deadlock. There's probably more, but these are the first I've though of: 1. get rid of range_cyclic altogether 2. range_cyclic always stops at EOF, and we start again from writeback index 0 on the next call into write_cache_pages() 2a. wcp also returns EAGAIN to ->writepages implementations to indicate range cyclic has hit EOF. writepages implementations can then flush the current context and call wpc again to continue. i.e. lift the retry into the ->writepages implementation 3. range_cyclic uses trylock_page() rather than lock_page(), and it skips pages it can't lock without blocking. It will already do this for pages under writeback, so this seems like a no-brainer 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid blocking as per pages under writeback. I don't think #1 is an option - range_cyclic prevents frequently dirtied lower file offset from starving background writeback of rarely touched higher file offsets. #2 is simple, and I don't think it will have any impact on performance as going back to the start of the file implies an immediate seek. We'll have exactly the same number of seeks if we switch writeback to another inode, and then come back to this one later and restart from index 0. #2a is pretty much "status quo without the deadlock". Moving the retry loop up into the wcp caller means we can issue IO on the pending pages before calling wcp again, and so avoid locking or waiting on pages in the wrong order. I'm not convinced we need to do this given that we get the same thing from #2 on the next writeback call from the writeback infrastructure. #3 is really just a band-aid - it doesn't fix the access/wait inversion problem, just prevents it from becoming a deadlock situation. I'd prefer we fix the inversion, not sweep it under the carpet like this. #3a is really an optimisation that just so happens to include the band-aid fix of #3. So it seems that the simplest way to fix this issue is to implement solution #2 Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.de> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: move mirrored memory specific code outside of memmap_init_zonePavel Tatashin
memmap_init_zone, is getting complex, because it is called from different contexts: hotplug, and during boot, and also because it must handle some architecture quirks. One of them is mirrored memory. Move the code that decides whether to skip mirrored memory outside of memmap_init_zone, into a separate function. [pasha.tatashin@oracle.com: uninline overlap_memmap_init()] Link: http://lkml.kernel.org/r/20180726193509.3326-4-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180724235520.10200-4-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: calculate deferred pages after skipping mirrored memoryPavel Tatashin
update_defer_init() should be called only when struct page is about to be initialized. Because it counts number of initialized struct pages, but there we may skip struct pages if there is some mirrored memory. So move, update_defer_init() after checking for mirrored memory. Also, rename update_defer_init() to defer_init() and reverse the return boolean to emphasize that this is a boolean function, that tells that the reset of memmap initialization should be deferred. Make this function self-contained: do not pass number of already initialized pages in this zone by using static counters. I found this bug by reading the code. The effect is that fewer than expected struct pages are initialized early in boot, and it is possible that in some corner cases we may fail to boot when mirrored pages are used. The deferred on demand code should somewhat mitigate this. But this still brings some inconsistencies compared to when booting without mirrored pages, so it is better to fix. [pasha.tatashin@oracle.com: add comment about defer_init's lack of locking] Link: http://lkml.kernel.org/r/20180726193509.3326-3-pasha.tatashin@oracle.com [akpm@linux-foundation.org: make defer_init non-inline, __meminit] Link: http://lkml.kernel.org/r/20180724235520.10200-3-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: make memmap_init a proper functionPavel Tatashin
memmap_init is sometimes a macro sometimes a function based on __HAVE_ARCH_MEMMAP_INIT. It is only a function on ia64. Make memmap_init a weak function instead, and let ia64 redefine it. Link: http://lkml.kernel.org/r/20180724235520.10200-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/memcontrol.c: convert mem_cgroup_id::ref to refcount_t typeKirill Tkhai
This will allow to use generic refcount_t interfaces to check counters overflow instead of currently existing VM_BUG_ON(). The only difference after the patch is VM_BUG_ON() may cause BUG(), while refcount_t fires with WARN(). But this seems not to be significant here, since such the problems are usually caught by syzbot with panic-on-warn enabled. Link: http://lkml.kernel.org/r/153910718919.7006.13400779039257185427.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/page_alloc.c: initialize num_movable in move_freepages()David Rientjes
If move_freepages_block() returns 0 because !zone_spans_pfn(), *num_movable can hold the value from the stack because it does not get initialized in move_freepages(). Move the initialization to move_freepages_block() to guarantee the value actually makes sense. This currently doesn't affect its only caller where num_movable != NULL, so no bug fix, but just more robust. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1810051355490.212229@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/zsmalloc.c: fix fall-through annotationGustavo A. R. Silva
Replace "fallthru" with a proper "fall through" annotation. This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough Link: http://lkml.kernel.org/r/20181003105114.GA24423@embeddedor.com Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26userfaultfd: selftest: recycle lock threads firstPeter Xu
Now we recycle the uffd servicing threads earlier than the lock threads. It might happen that when the lock thread is still blocked at a pthread mutex lock while the servicing thread has already quitted for the cpu so the lock thread will be blocked forever and hang the test program. To fix the possible race, recycle the lock threads first. This never happens with current missing-only tests, but when I start to run the write-protection tests (the feature is not yet posted upstream) it happens every time of the run possibly because in that new test we'll need to service two page faults for each lock operation. Link: http://lkml.kernel.org/r/20180930074259.18229-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Shaohua Li <shli@fb.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26userfaultfd: selftest: generalize read and pollPeter Xu
We do very similar things in read and poll modes, but we're copying the codes around. Share the codes properly on reading the message and handling the page fault to make the code cleaner. Meanwhile this solves previous mismatch of behaviors between the two modes on that the old code: - did not check EAGAIN case in read() mode - ignored BOUNCE_VERIFY check in read() mode Link: http://lkml.kernel.org/r/20180930074259.18229-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Shaohua Li <shli@fb.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26userfaultfd: selftest: cleanup help messagesPeter Xu
Firstly, the help in the comment region is obsolete, now we support three parameters. Since at it, change it and move it into the help message of the program. Also, the help messages dumped here and there is obsolete too. Use a single usage() helper. Link: http://lkml.kernel.org/r/20180930074259.18229-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Shaohua Li <shli@fb.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/vmstat.c: assert that vmstat_text is in sync with stat_items_sizeJann Horn
Having two gigantic arrays that must manually be kept in sync, including ifdefs, isn't exactly robust. To make it easier to catch such issues in the future, add a BUILD_BUG_ON(). Link: http://lkml.kernel.org/r/20181001143138.95119-3-jannh@google.com Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm/memory.c: recheck page table entry with page table lock heldAneesh Kumar K.V
We clear the pte temporarily during read/modify/write update of the pte. If we take a page fault while the pte is cleared, the application can get SIGBUS. One such case is with remap_pfn_range without a backing vm_ops->fault callback. do_fault will return SIGBUS in that case. cpu 0 cpu1 mprotect() ptep_modify_prot_start()/pte cleared. . . page fault. . . prep_modify_prot_commit() Fix this by taking page table lock and rechecking for pte_none. [aneesh.kumar@linux.ibm.com: fix crash observed with syzkaller run] Link: http://lkml.kernel.org/r/87va6bwlfg.fsf@linux.ibm.com Link: http://lkml.kernel.org/r/20180926031858.9692-1-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ido Schimmel <idosch@idosch.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: dax: add comment for PFN_SPECIALYang Shi
The comment for PFN_SPECIAL is missed in pfn_t.h. Add comment to get consistent with other pfn flags. Link: http://lkml.kernel.org/r/1538086549-100536-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>