Age | Commit message (Collapse) | Author |
|
Introduce the basic control files to account, partition, and limit
memory using cgroups in default hierarchy mode.
This interface versioning allows us to address fundamental design
issues in the existing memory cgroup interface, further explained
below. The old interface will be maintained indefinitely, but a
clearer model and improved workload performance should encourage
existing users to switch over to the new one eventually.
The control files are thus:
- memory.current shows the current consumption of the cgroup and its
descendants, in bytes.
- memory.low configures the lower end of the cgroup's expected
memory consumption range. The kernel considers memory below that
boundary to be a reserve - the minimum that the workload needs in
order to make forward progress - and generally avoids reclaiming
it, unless there is an imminent risk of entering an OOM situation.
- memory.high configures the upper end of the cgroup's expected
memory consumption range. A cgroup whose consumption grows beyond
this threshold is forced into direct reclaim, to work off the
excess and to throttle new allocations heavily, but is generally
allowed to continue and the OOM killer is not invoked.
- memory.max configures the hard maximum amount of memory that the
cgroup is allowed to consume before the OOM killer is invoked.
- memory.events shows event counters that indicate how often the
cgroup was reclaimed while below memory.low, how often it was
forced to reclaim excess beyond memory.high, how often it hit
memory.max, and how often it entered OOM due to memory.max. This
allows users to identify configuration problems when observing a
degradation in workload performance. An overcommitted system will
have an increased rate of low boundary breaches, whereas increased
rates of high limit breaches, maximum hits, or even OOM situations
will indicate internally overcommitted cgroups.
For existing users of memory cgroups, the following deviations from
the current interface are worth pointing out and explaining:
- The original lower boundary, the soft limit, is defined as a limit
that is per default unset. As a result, the set of cgroups that
global reclaim prefers is opt-in, rather than opt-out. The costs
for optimizing these mostly negative lookups are so high that the
implementation, despite its enormous size, does not even provide
the basic desirable behavior. First off, the soft limit has no
hierarchical meaning. All configured groups are organized in a
global rbtree and treated like equal peers, regardless where they
are located in the hierarchy. This makes subtree delegation
impossible. Second, the soft limit reclaim pass is so aggressive
that it not just introduces high allocation latencies into the
system, but also impacts system performance due to overreclaim, to
the point where the feature becomes self-defeating.
The memory.low boundary on the other hand is a top-down allocated
reserve. A cgroup enjoys reclaim protection when it and all its
ancestors are below their low boundaries, which makes delegation
of subtrees possible. Secondly, new cgroups have no reserve per
default and in the common case most cgroups are eligible for the
preferred reclaim pass. This allows the new low boundary to be
efficiently implemented with just a minor addition to the generic
reclaim code, without the need for out-of-band data structures and
reclaim passes. Because the generic reclaim code considers all
cgroups except for the ones running low in the preferred first
reclaim pass, overreclaim of individual groups is eliminated as
well, resulting in much better overall workload performance.
- The original high boundary, the hard limit, is defined as a strict
limit that can not budge, even if the OOM killer has to be called.
But this generally goes against the goal of making the most out of
the available memory. The memory consumption of workloads varies
during runtime, and that requires users to overcommit. But doing
that with a strict upper limit requires either a fairly accurate
prediction of the working set size or adding slack to the limit.
Since working set size estimation is hard and error prone, and
getting it wrong results in OOM kills, most users tend to err on
the side of a looser limit and end up wasting precious resources.
The memory.high boundary on the other hand can be set much more
conservatively. When hit, it throttles allocations by forcing
them into direct reclaim to work off the excess, but it never
invokes the OOM killer. As a result, a high boundary that is
chosen too aggressively will not terminate the processes, but
instead it will lead to gradual performance degradation. The user
can monitor this and make corrections until the minimal memory
footprint that still gives acceptable performance is found.
In extreme cases, with many concurrent allocations and a complete
breakdown of reclaim progress within the group, the high boundary
can be exceeded. But even then it's mostly better to satisfy the
allocation from the slack available in other groups or the rest of
the system than killing the group. Otherwise, memory.max is there
to limit this type of spillover and ultimately contain buggy or
even malicious applications.
- The original control file names are unwieldy and inconsistent in
many different ways. For example, the upper boundary hit count is
exported in the memory.failcnt file, but an OOM event count has to
be manually counted by listening to memory.oom_control events, and
lower boundary / soft limit events have to be counted by first
setting a threshold for that value and then counting those events.
Also, usage and limit files encode their units in the filename.
That makes the filenames very long, even though this is not
information that a user needs to be reminded of every time they
type out those names.
To address these naming issues, as well as to signal clearly that
the new interface carries a new configuration model, the naming
conventions in it necessarily differ from the old interface.
- The original limit files indicate the state of an unset limit with
a very high number, and a configured limit can be unset by echoing
-1 into those files. But that very high number is implementation
and architecture dependent and not very descriptive. And while -1
can be understood as an underflow into the highest possible value,
-2 or -10M etc. do not work, so it's not inconsistent.
memory.low, memory.high, and memory.max will use the string
"infinity" to indicate and set the highest possible value.
[akpm@linux-foundation.org: use seq_puts() for basic strings]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The unified hierarchy interface for memory cgroups will no longer use "-1"
to mean maximum possible resource value. In preparation for this, make
the string an argument and let the caller supply it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Especially on 32 bit kernels memory node ranges are printed with 32 bit
wide addresses only. Use u64 types and %llx specifiers to print full
width of addresses.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Use BUILD_BUG_ON() to compile assert that memcg string tables are in sync
with corresponding enums. There aren't currently any issues with these
tables. This is just defensive.
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since commit b2052564e66d ("mm: memcontrol: continue cache reclaim from
offlined groups") pages charged to a memory cgroup are not reparented when
the cgroup is removed. Instead, they are supposed to be reclaimed in a
regular way, along with pages accounted to online memory cgroups.
However, an lruvec of an offline memory cgroup will sooner or later get so
small that it will be scanned only at low scan priorities (see
get_scan_count()). Therefore, if there are enough reclaimable pages in
big lruvecs, pages accounted to offline memory cgroups will never be
scanned at all, wasting memory.
Fix this by unconditionally forcing scanning dead lruvecs from kswapd.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Although it was not called, destroy_compound_page() did some potentially
useful checks. Let's re-introduce them in free_pages_prepare(), where
they can be actually triggered when CONFIG_DEBUG_VM=y.
compound_order() assert is already in free_pages_prepare(). We have few
checks for tail pages left.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The only caller is __free_one_page(). By the time we should have
page->flags to be cleared already:
- for 0-order pages though PCP list:
free_hot_cold_page()
free_pages_prepare()
free_pages_check()
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
<put the page to PCP list>
free_pcppages_bulk()
page = <withdraw pages from PCP list>
__free_one_page(page)
- for non-0-order pages:
__free_pages_ok()
free_pages_prepare()
free_pages_check()
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
free_one_page()
__free_one_page()
So there's no way PageCompound() will return true in __free_one_page().
Let's remove dead destroy_compound_page() and put assert for page->flags
there instead.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer
via extra parameter. Since the latter can be trivially obtained by
dereferencing the former, the overhead of the extra parameter is
unjustified.
This patch thus removes the zone parameter from next_zones_zonelist().
Both callers happen to be in the same header file, so it's simple to add
the zoneref dereference inline. We save some bytes of code size.
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105)
function old new delta
nr_free_zone_pages 129 115 -14
__alloc_pages_nodemask 2300 2285 -15
get_page_from_freelist 2652 2576 -76
add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10)
function old new delta
try_to_compact_pages 569 579 +10
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Expand the usage of the struct alloc_context introduced in the previous
patch also for calling try_to_compact_pages(), to reduce the number of its
parameters. Since the function is in different compilation unit, we need
to move alloc_context definition in the shared mm/internal.h header.
With this change we get simpler code and small savings of code size and stack
usage:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
function old new delta
__alloc_pages_direct_compact 283 256 -27
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
function old new delta
try_to_compact_pages 582 569 -13
Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
scripts/checkstack.pl).
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Introduce struct alloc_context to accumulate the numerous parameters
passed between the alloc_pages* family of functions and
get_page_from_freelist(). This excludes gfp_flags and alloc_info, which
mutate too much along the way, and allocation order, which is conceptually
different.
The result is shorter function signatures, as well as overal code size and
stack usage reductions.
bloat-o-meter:
add/remove: 0/0 grow/shrink: 1/2 up/down: 127/-310 (-183)
function old new delta
get_page_from_freelist 2525 2652 +127
__alloc_pages_direct_compact 329 283 -46
__alloc_pages_nodemask 2564 2300 -264
checkstack.pl:
function old new
__alloc_pages_nodemask 248 200
get_page_from_freelist 168 184
__alloc_pages_direct_compact 40 24
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The possibility of replacing the numerous parameters of alloc_pages*
functions with a single structure has been discussed when Minchan proposed
to expand the x86 kernel stack [1]. This series implements the change,
along with few more cleanups/microoptimizations.
The series is based on next-20150108 and I used gcc 4.8.3 20140627 on
openSUSE 13.2 for compiling. Config includess NUMA and COMPACTION.
The core change is the introduction of a new struct alloc_context, which looks
like this:
struct alloc_context {
struct zonelist *zonelist;
nodemask_t *nodemask;
struct zone *preferred_zone;
int classzone_idx;
int migratetype;
enum zone_type high_zoneidx;
};
All the contents is mostly constant, except that __alloc_pages_slowpath()
changes preferred_zone, classzone_idx and potentially zonelist. But
that's not a problem in case control returns to retry_cpuset: in
__alloc_pages_nodemask(), those will be reset to initial values again
(although it's a bit subtle). On the other hand, gfp_flags and alloc_info
mutate so much that it doesn't make sense to put them into alloc_context.
Still, the result is one parameter instead of up to 7. This is all in
Patch 2.
Patch 3 is a step to expand alloc_context usage out of page_alloc.c
itself. The function try_to_compact_pages() can also much benefit from
the parameter reduction, but it means the struct definition has to be
moved to a shared header.
Patch 1 should IMHO be included even if the rest is deemed not useful
enough. It improves maintainability and also has some code/stack
reduction. Patch 4 is OTOH a tiny optimization.
Overall bloat-o-meter results:
add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-460 (-460)
function old new delta
nr_free_zone_pages 129 115 -14
__alloc_pages_direct_compact 329 256 -73
get_page_from_freelist 2670 2576 -94
__alloc_pages_nodemask 2564 2285 -279
try_to_compact_pages 582 579 -3
Overall stack sizes per ./scripts/checkstack.pl:
old new delta
get_page_from_freelist: 184 184 0
__alloc_pages_nodemask 248 200 -48
__alloc_pages_direct_c 40 - -40
try_to_compact_pages 72 72 0
-88
[1] http://marc.info/?l=linux-mm&m=140142462528257&w=2
This patch (of 4):
prep_new_page() sets almost everything in the struct page of the page
being allocated, except page->pfmemalloc. This is not obvious and has at
least once led to a bug where page->pfmemalloc was forgotten to be set
correctly, see commit 8fb74b9fb2b1 ("mm: compaction: partially revert
capture of suitable high-order page").
This patch moves the pfmemalloc setting to prep_new_page(), which means it
needs to gain alloc_flags parameter. The call to prep_new_page is moved
from buffered_rmqueue() to get_page_from_freelist(), which also leads to
simpler code. An obsolete comment for buffered_rmqueue() is replaced.
In addition to better maintainability there is a small reduction of code
and stack usage for get_page_from_freelist(), which inlines the other
functions involved.
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-145 (-145)
function old new delta
get_page_from_freelist 2670 2525 -145
Stack usage is reduced from 184 to 168 bytes.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
32-bit sparc uses swap instruction to implement set_pte(). It called
using GCC inline assembler. But it misses the "memory" clobber to
indicate that pte value will be updated in memory.
As result GCC doesn't know that it cannot postpone pte pointer dereference
which occurs before set_pte() to post-set_pte() time.
It leads to real-world bugs -- [1]. In this situation we have code:
ptent = ptep_modify_prot_start(mm, addr, pte);
ptent = pte_modify(ptent, newprot);
...
ptep_modify_prot_commit(mm, addr, pte, ptent);
ptep_modify_prot_start() in sparc case is just 'pte' dereference plus
pte_clear(). pte_clear() calls broken set_pte(). GCC thinks it's valid
to dereference 'pte' again on pte_modify() and gets cleared pte.
ptep_modify_prot_commit() puts 'pteent' with pfn==0 back to page table,
which eventually leads to the crash.
[1] http://lkml.kernel.org/r/54C06B19.8060305@roeck-us.net
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If __unmap_hugepage_range() tries to unmap the address range over which
hugepage migration is on the way, we get the wrong page because pte_page()
doesn't work for migration entries. This patch simply clears the pte for
migration entries as we do for hwpoison entries.
Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is a race condition between hugepage migration and
change_protection(), where hugetlb_change_protection() doesn't care about
migration entries and wrongly overwrites them. That causes unexpected
results like kernel crash. HWPoison entries also can cause the same
problem.
This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
function to do proper actions.
Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When running the test which causes the race as shown in the previous patch,
we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().
This race happens when pte turns into migration entry just after the first
check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
To fix this, we need to check pte_present() again after huge_ptep_get().
This patch also reorders taking ptl and doing pte_page(), because
pte_page() should be done in ptl. Due to this reordering, we need use
trylock_page() in page != pagecache_page case to respect locking order.
Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We have a race condition between move_pages() and freeing hugepages, where
move_pages() calls follow_page(FOLL_GET) for hugepages internally and
tries to get its refcount without preventing concurrent freeing. This
race crashes the kernel, so this patch fixes it by moving FOLL_GET code
for hugepages into follow_huge_pmd() with taking the page table lock.
This patch intentionally removes page==NULL check after pte_page.
This is justified because pte_page() never returns NULL for any
architectures or configurations.
This patch changes the behavior of follow_huge_pmd() for tail pages and
then tail pages can be pinned/returned. So the caller must be changed to
properly handle the returned tail pages.
We could have a choice to add the similar locking to
follow_huge_(addr|pud) for consistency, but it's not necessary because
currently these functions don't support FOLL_GET flag, so let's leave it
for future development.
Here is the reproducer:
$ cat movepages.c
#include <stdio.h>
#include <stdlib.h>
#include <numaif.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
#define PS 0x1000
int main(int argc, char *argv[]) {
int i;
int nr_hp = strtol(argv[1], NULL, 0);
int nr_p = nr_hp * HPS / PS;
int ret;
void **addrs;
int *status;
int *nodes;
pid_t pid;
pid = strtol(argv[2], NULL, 0);
addrs = malloc(sizeof(char *) * nr_p + 1);
status = malloc(sizeof(char *) * nr_p + 1);
nodes = malloc(sizeof(char *) * nr_p + 1);
while (1) {
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 1;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
for (i = 0; i < nr_p; i++) {
addrs[i] = (void *)ADDR_INPUT + i * PS;
nodes[i] = 0;
status[i] = 0;
}
ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
MPOL_MF_MOVE_ALL);
if (ret == -1)
err("move_pages");
}
return 0;
}
$ cat hugepage.c
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#define ADDR_INPUT 0x700000000000UL
#define HPS 0x200000
int main(int argc, char *argv[]) {
int nr_hp = strtol(argv[1], NULL, 0);
char *p;
while (1) {
p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (p != (void *)ADDR_INPUT) {
perror("mmap");
break;
}
memset(p, 0, nr_hp * HPS);
munmap(p, nr_hp * HPS);
}
}
$ sysctl vm.nr_hugepages=40
$ ./hugepage 10 &
$ ./movepages 10 $(pgrep -f hugepage)
Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Migrating hugepages and hwpoisoned hugepages are considered as non-present
hugepages, and they are referenced via migration entries and hwpoison
entries in their page table slots.
This behavior causes race condition because pmd_huge() doesn't tell
non-huge pages from migrating/hwpoisoned hugepages. follow_page_mask() is
one example where the kernel would call follow_page_pte() for such
hugepage while this function is supposed to handle only normal pages.
To avoid this, this patch makes pmd_huge() return true when pmd_none() is
true *and* pmd_present() is false. We don't have to worry about mixing up
non-present pmd entry with normal pmd (pointing to leaf level pte entry)
because pmd_present() is true in normal pmd.
The same race condition could happen in (x86-specific) gup_pmd_range(),
where this patch simply adds pmd_present() check instead of pmd_huge().
This is because gup_pmd_range() is fast path. If we have non-present
hugepage in this function, we will go into gup_huge_pmd(), then return 0
at flag mask check, and finally fall back to the slow path.
Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: <stable@vger.kernel.org> [2.6.36+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently we have many duplicates in definitions around
follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
patch tries to remove the m. The basic idea is to put the default
implementation for these functions in mm/hugetlb.c as weak symbols
(regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
arch-specific code only when the arch needs it.
For follow_huge_addr(), only powerpc and ia64 have their own
implementation, and in all other architectures this function just returns
ERR_PTR(-EINVAL). So this patch sets returning ERR_PTR(-EINVAL) as
default.
As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
always return 0 in your architecture (like in ia64 or sparc,) it's never
called (the callsite is optimized away) no matter how implemented it is.
So in such architectures, we don't need arch-specific implementation.
In some architecture (like mips, s390 and tile,) their current
arch-specific follow_huge_(pmd|pud)() are effectively identical with the
common code, so this patch lets these architecture use the common code.
One exception is metag, where pmd_huge() could return non-zero but it
expects follow_huge_pmd() to always return NULL. This means that we need
arch-specific implementation which returns NULL. This behavior looks
strange to me (because non-zero pmd_huge() implies that the architecture
supports PMD-based hugepage, so follow_huge_pmd() can/should return some
relevant value,) but that's beyond this cleanup patch, so let's keep it.
Justification of non-trivial changes:
- in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
is true when follow_huge_pmd() can be called (note that pmd_huge() has
the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
- in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
code. This patch forces these archs use PMD_MASK, but it's OK because
they are identical in both archs.
In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
PTE_ORDER is always 0, so these are identical.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Kswapd in balance_pgdate() currently uses wake_up() on processes waiting
in throttle_direct_reclaim(), which only wakes up a single process. This
might leave processes waiting for longer than necessary, until the check
is reached in the next loop iteration. Processes might also be left
waiting if zone was fully balanced in single iteration. Note that the
comment in balance_pgdat() also says "Wake them", so waking up a single
process does not seem intentional.
Thus, replace wake_up() with wake_up_all().
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Found it when I want to jump to the definition of MIGRATE_RESERVE ctags.
Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now kmemcheck_pagealloc_alloc() is only called by __alloc_pages_slowpath().
__alloc_pages_nodemask()
__alloc_pages_slowpath()
kmemcheck_pagealloc_alloc()
And the page will not be tracked by kmemcheck in the following path.
__alloc_pages_nodemask()
get_page_from_freelist()
So move kmemcheck_pagealloc_alloc() into __alloc_pages_nodemask(),
like this:
__alloc_pages_nodemask()
...
get_page_from_freelist()
if (!page)
__alloc_pages_slowpath()
kmemcheck_pagealloc_alloc()
...
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
__alloc_pages_nodemask() strips __GFP_IO when retrying the page
allocation. But it does this by altering the function-wide variable
gfp_mask. This will cause subsequent allocation attempts to inadvertently
use the modified gfp_mask.
Also, pass the correct mask (the mask we actually used) into
trace_mm_page_alloc().
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The complexity of memcg page stat synchronization is currently leaking
into the callsites, forcing them to keep track of the move_lock state and
the IRQ flags. Simplify the API by tracking it in the memcg.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The body of this function was removed by commit 0a31bc97c80c ("mm:
memcontrol: rewrite uncharge API").
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
OOM killer tries to exclude tasks which do not have mm_struct associated
because killing such a task wouldn't help much. The OOM victim gets
TIF_MEMDIE set to disable OOM killer while the current victim releases the
memory and then enables the OOM killer again by dropping the flag.
oom_kill_process is currently prone to a race condition when the OOM
victim is already exiting and TIF_MEMDIE is set after the task releases
its address space. This might theoretically lead to OOM livelock if the
OOM victim blocks on an allocation later during exiting because it
wouldn't kill any other process and the exiting one won't be able to exit.
The situation is highly unlikely because the OOM victim is expected to
release some memory which should help to sort out OOM situation.
Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock
which will serialize the OOM killer with exit_mm which sets task->mm to
NULL. Setting the flag for current is not necessary because check and set
is not racy.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
out_of_memory() doesn't trigger the OOM killer if the current task is
already exiting or it has fatal signals pending, and gives the task
access to memory reserves instead. However, doing so is wrong if
out_of_memory() is called by an allocation (e.g. from exit_task_work())
after the current task has already released its memory and cleared
TIF_MEMDIE at exit_mm(). If we again set TIF_MEMDIE to post-exit_mm()
current task, the OOM killer will be blocked by the task sitting in the
final schedule() waiting for its parent to reap it. It will trigger an
OOM livelock if its parent is unable to reap it due to doing an
allocation and waiting for the OOM killer to kill it.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can
detect zero_page in /proc/kpageflags, and then do memory analysis more
accurately.
Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add VM_BUG_ON_PAGE() for slab pages. _mapcount is an union with slab
struct in struct page, so we must avoid accessing _mapcount if this page
is a slab page. Also remove the unneeded bracket.
Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, we use lru.next/lru.prev plus cast to access or set
destructor and order of compound page.
Let's replace it with explicit fields in struct page.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull documentation updates from Jonathan Corbet:
"Highlights this time around include:
- A thrashing of SubmittingPatches to bring it out of the "send
everything to Linus" era of kernel development.
- A new document on completions from Nicholas McGuire
- Lots of typo fixes, formatting improvements, corrections, build
fixes, and more"
* tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (35 commits)
Documentation: Fix the wrong command `echo -1 > set_ftrace_pid` for cleaning the filter.
can-doc: Fixed a wrong filepath in can.txt
Documentation: Fix trivial typo in comment.
kgdb,docs: Fix typo and minor style issues
Documentation: add description for FTRACE probe status
doc: brief user documentation for completion
Documentation/misc-devices/mei: Fix indentation of embedded code.
Documentation/misc-devices/mei: Fix indentation of enumeration.
Documentation/misc-devices/mei: Fix spacing around parentheses.
Documentation/misc-devices/mei: Fix formatting of headings.
Documentation: devicetree: Fix double words in Doumentation/devicetree
Documentation: mm: Fix typo in vm.txt
lockstat: Add documentation on contention and contenting points
Documentation: fix blackfin gptimers-example build errors
Fixes column alignment in table of contents entry 1.9 in Documentation/filesystems/proc.txt
CodingStyle: enable emacs display of trailing whitespace
DocBook: Do not exceed argument list limit
gpio: board.txt: Fix the gpio name example
Documentation/SubmittingPatches: unify whitespace/tabs for the DCO
MAINTAINERS: Add the docs-next git tree to the maintainer entry
...
|
|
git://git.linaro.org/landing-teams/working/fujitsu/integration
Pull mailbox framework updates from Jassi Brar.
* 'mailbox-devel' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
mailbox: Add Altera mailbox driver
mailbox: check for bit set before polling
Mailbox: Fix return value check in pcc_init()
|
|
Using the IPCB() macro to get the IPv4 options is convenient, but
unfortunately NetLabel often needs to examine the CIPSO option outside
of the scope of the IP layer in the stack. While historically IPCB()
worked above the IP layer, due to the inclusion of the inet_skb_param
struct at the head of the {tcp,udp}_skb_cb structs, recent commit
971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
reordered the tcp_skb_cb struct and invalidated this IPCB() trick.
This patch fixes the problem by creating a new function,
cipso_v4_optptr(), which locates the CIPSO option inside the IP header
without calling IPCB(). Unfortunately, this isn't as fast as a simple
lookup so some additional tweaks were made to limit the use of this
new function.
Cc: <stable@vger.kernel.org> # 3.18
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pincontrol updates from Linus Walleij:
:This is the bulk of pin control changes for the v3.20 cycle:
Framework changes and enhancements:
- Passing -DDEBUG recursively to subdir drivers so we get debug
messages properly turned on.
- Infer map type from DT property in the groups parsing code in the
generic pinconfig code.
- Support for custom parameter passing in generic pin config. This
is used when you are using the generic pin config, but want to add
a few custom properties that no other driver will use.
New drivers:
- Driver for the Xilinx Zynq
- Driver for the AmLogic Meson SoCs
New features in drivers:
- Sleep support (suspend/resume) for the Cherryview driver
- mvebeu a38x can now mux a UART on pins MPP19 and MPP20
- Migrated the qualcomm driver to generic pin config handling of
extended config options in the core code.
- Support BUS1 and AUDIO in the Exynos pin controller.
- Add some missing functions in the sun6i driver.
- Add support for the A31S variant in the sun6i driver.
- EMEv2 support in the Renesas PFC driver.
- Add support for Qualcomm MSM8916 in the qcom driver.
Deleted features
- Drop support for the SiRF Marco that was never released to the
market.
- Drop SH7372 support as the support for this platform is removed
from the kernel"
* tag 'pinctrl-v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (40 commits)
sh-pfc: emev2 - Fix mangled author name
pinctrl: cherryview: Configure HiZ pins to be input when requested as GPIOs
pinctrl: imx25: fix numbering for pins
pinctrl: pinctrl-imx: don't use invalid value of conf_reg
pinctrl: qcom: delete pin_config_get/set pinconf operations
pinctrl: qcom: Add msm8916 pinctrl driver
DT: pinctrl: Document Qualcomm MSM8916 pinctrl binding
pinctrl: qcom: increase variable size for register offsets
pinctrl: hide PCONFDUMP in #ifdef
pinctrl: rockchip: Only mask interrupts; never disable
pinctrl: zynq: Fix usb0 pins
pinctrl: sh-pfc: sh7372: Remove DT binding documentation
pinctrl: sh-pfc: sh7372: Remove PFC support
sh-pfc: Add emev2 pinmux support
sh-pfc: add macro to define pinmux without function
pinctrl: add driver for Amlogic Meson SoCs
staging: drivers: pinctrl: Fixed checkpatch.pl warnings
pinctrl: exynos: Add AUDIO pin controller for exynos7
sh-pfc: r8a7790: add MLB+ pin group
sh-pfc: r8a7791: add MLB+ pin group
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO changes from Linus Walleij:
"This is the GPIO bulk changes for the v3.20 series:
GPIOLIB core changes:
- Create and use of_mm_gpiochip_remove() for removing memory-mapped
OF GPIO chips
- GPIO MMIO library suppports bgpio_set_multiple for switching
several lines at once, a feature merged in the last cycle.
New drivers:
- New driver for the APM X-gene standby GPIO controller
- New driver for the Fujitsu MB86S7x GPIO controller
Cleanups:
- Moved rcar driver to use gpiolib irqchip
- Moxart converted to the GPIO MMIO library
- GE driver converted to GPIO MMIO library
- Move sx150x to irqdomain
- Move max732x to irqdomain
- Move vx855 to use managed resources
- Move dwapb to use managed resources
- Clean tc3589x from platform data
- Clean stmpe driver to use device tree only probe
New subtypes:
- sx1506 support in the sx150x driver
- Quark 1000 SoC support in the SCH driver
- Support X86 in the Xilinx driver
- Support PXA1928 in the PXA driver
Extended drivers:
- max732x supports device tree probe
- sx150x supports device tree probe
Various minor cleanups and bug fixes"
* tag 'gpio-v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (61 commits)
gpio: kconfig: replace PPC_OF with PPC
gpio: pxa: add PXA1928 gpio type support
dt/bindings: gpio: add compatible string for marvell,pxa1928-gpio
gpio: pxa: remove mach IRQ includes
gpio: max732x: use an inline function for container cast
gpio: use sizeof() instead of hardcoded values
gpio: max732x: add set_multiple function
gpio: sch: Consolidate similar algorithms
gpio: tz1090-pdc: Use resource_size to fix off-by-one resource size calculation
gpio: ge: Convert to use devm_kstrdup
gpio: correctly use const char * const
gpio: sx150x: fixup OF support
gpio: mpc8xxx: Use of_mm_gpiochip_remove
gpio: Add Fujitsu MB86S7x GPIO driver
gpio: mpc8xxx: Convert to platform device interface.
gpio: zevio: Use of_mm_gpiochip_remove
gpio: gpio-mm-lantiq: Use of_mm_gpiochip_remove
gpio: gpio-mm-lantiq: Use of_property_read_u32
gpio: gpio-mm-lantiq: Do not replicate code
gpio :gpio-mm-lantiq: Use devm_kzalloc
...
|
|
Pull MMC updates from Ulf Hansson:
"MMC core:
- Support for MMC power sequences.
- SDIO function devicetree subnode parsing.
- Refactor the hardware reset routines and enable it for SD cards.
- Various code quality improvements, especially for slot-gpio.
MMC host:
- dw_mmc: Various fixes and cleanups.
- dw_mmc: Convert to mmc_send_tuning().
- moxart: Fix probe logic.
- sdhci: Various fixes and cleanups
- sdhci: Asynchronous request handling support.
- sdhci-pxav3: Various fixes and cleanups.
- sdhci-tegra: Fixes for T114, T124 and T132.
- rtsx: Various fixes and cleanups.
- rtsx: Support for SDIO.
- sdhi/tmio: Refactor and cleanup of header files.
- omap_hsmmc: Use slot-gpio and common MMC DT parser.
- Make all hosts to deal with errors from mmc_of_parse().
- sunxi: Various fixes and cleanups.
- sdhci: Support for Fujitsu SDHCI controller f_sdh30"
* tag 'mmc-v3.20-1' of git://git.linaro.org/people/ulf.hansson/mmc: (117 commits)
mmc: sdhci-s3c: solve problem with sleeping in atomic context
mmc: pwrseq: add driver for emmc hardware reset
mmc: moxart: fix probe logic
mmc: core: Invoke mmc_pwrseq_post_power_on() prior MMC_POWER_ON state
mmc: pwrseq_simple: Add optional reference clock support
mmc: pwrseq: Document optional clock for the simple power sequence
mmc: pwrseq_simple: Extend to support more pins
mmc: pwrseq: Document that simple sequence support more than one GPIO
mmc: Add hardware dependencies for sdhci-pxav3 and sdhci-pxav2
mmc: sdhci-pxav3: Modify clock settings for the SDR50 and DDR50 modes
mmc: sdhci-pxav3: Extend binding with SDIO3 conf reg for the Armada 38x
mmc: sdhci-pxav3: Fix Armada 38x controller's caps according to erratum ERR-7878951
mmc: sdhci-pxav3: Fix SDR50 and DDR50 capabilities for the Armada 38x flavor
mmc: sdhci: switch voltage before sdhci_set_ios in runtime resume
mmc: tegra: Write xfer_mode, CMD regs in together
mmc: Resolve BKOPS compatability issue
mmc: sdhci-pxav3: fix setting of pdata->clk_delay_cycles
mmc: dw_mmc: rockchip: remove incorrect __exit_p()
mmc: dw_mmc: exynos: remove incorrect __exit_p()
mmc: Fix menuconfig alignment of MMC_SDHCI_* options
...
|
|
This one was driving me mad, with several lines of warnings during the
allmodconfig build for a single bogus pointer cast. The warning was so
verbose due to the indirect macro expansion explanation, and the whole
thing was just for a debug printout.
The bogus pointer-to-integer cast was pointless anyway, so just remove
it, and use '%p' to show the pointer.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull first round of SCSI updates from James Bottomley:
"This is the usual grab bag of driver updates (hpsa, storvsc, mp2sas,
megaraid_sas, ses) plus an assortment of minor updates.
There's also an update to ufs which adds new phy drivers and finally a
new logging infrastructure for SCSI"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (114 commits)
scsi_logging: return void for dev_printk() functions
scsi: print single-character strings with seq_putc
scsi: merge consecutive seq_puts calls
scsi: replace seq_printf with seq_puts
aha152x: replace seq_printf with seq_puts
advansys: replace seq_printf with seq_puts
scsi: remove SPRINTF macro
sg: remove an unused variable
hpsa: Use local workqueues instead of system workqueues
hpsa: add in P840ar controller model name
hpsa: add in gen9 controller model names
hpsa: detect and report failures changing controller transport modes
hpsa: shorten the wait for the CISS doorbell mode change ack
hpsa: refactor duplicated scan completion code into a new routine
hpsa: move SG descriptor set-up out of hpsa_scatter_gather()
hpsa: do not use function pointers in fast path command submission
hpsa: print CDBs instead of kernel virtual addresses for uncommon errors
hpsa: do not use a void pointer for scsi_cmd field of struct CommandList
hpsa: return failed from device reset/abort handlers
hpsa: check for ctlr lockup after command allocation in main io path
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
"The first round of updates for the input subsystem.
A few new drivers (power button handler for AXP20x PMIC, tps65218
power button driver, sun4i keys driver, regulator haptic driver, NI
Ettus Research USRP E3x0 button, Alwinner A10/A20 PS/2 controller).
Updates to Synaptics and ALPS touchpad drivers (with more to come
later), brand new Focaltech PS/2 support, update to Cypress driver to
handle Gen5 (in addition to Gen3) devices, and number of other fixups
to various drivers as well as input core"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (54 commits)
Input: elan_i2c - fix wrong %p extension
Input: evdev - do not queue SYN_DROPPED if queue is empty
Input: gscps2 - fix MODULE_DEVICE_TABLE invocation
Input: synaptics - use dmax in input_mt_assign_slots
Input: pxa27x_keypad - remove unnecessary ARM includes
Input: ti_am335x_tsc - replace delta filtering with median filtering
ARM: dts: AM335x: Make charge delay a DT parameter for TSC
Input: ti_am335x_tsc - read charge delay from DT
Input: ti_am335x_tsc - remove udelay in interrupt handler
Input: ti_am335x_tsc - interchange touchscreen and ADC steps
Input: MT - add support for balanced slot assignment
Input: drv2667 - remove wrong and unneeded drv2667-haptics modalias
Input: drv260x - remove wrong and unneeded drv260x-haptics modalias
Input: cap11xx - remove wrong and unneeded cap11xx modalias
Input: sun4i-ts - add support for touchpanel controller on A31
Input: serio - add support for Alwinner A10/A20 PS/2 controller
Input: gtco - use sign_extend32() for sign extension
Input: elan_i2c - verify firmware signature applying it
Input: elantech - remove stale comment from Kconfig
Input: cyapa - off by one in cyapa_update_fw_store()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
Pull fbdev changes from Tomi Valkeinen:
- omapdss: add DRA7xxx SoC support
- fbdev: support DMT (Display Monitor Timing) calculation
* tag 'fbdev-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux: (40 commits)
omapfb: Return error code when applying overlay settings fails
OMAPDSS: DPI: DRA7xx support
OMAPDSS: HDMI: Add DRA7xx support
OMAPDSS: DISPC: program dispc polarities to control module
OMAPDSS: DISPC: Add DRA7xx support
OMAPDSS: Add Video PLLs for DRA7xx
OMAPDSS: Add functions for external control of PLL
OMAPDSS: DSS: Add DRA7xx base support
Doc/DT: Add DT binding doc for DRA7xx DSS
OMAPDSS: add define for DRA7xx HW version
OMAPDSS: encoder-tpd12s015: Fix race issue with LS_OE
OMAPDSS: OMAP5: fix digit output's allowed mgrs
OMAPDSS: constify port arrays
OMAPDSS: PLL: add dss_pll_wait_reset_done()
OMAPDSS: Add enum dss_pll_id
video: fbdev: fix sys_copyarea
video/mmpfb: allow modular build
fb: via: turn gpiolib and i2c selects into dependencies
fbdev: ssd1307fb: return proper error code if write command fails
fbdev: fix CVT vertical front and back porch values
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
"In this batch, you can find lots of cleanups through the whole
subsystem, as our good New Year's resolution. Lots of LOCs and
commits are about LINE6 driver that was promoted finally from staging
tree, and as usual, there've been widely spread ASoC changes.
Here some highlights:
ALSA core changes
- Embedding struct device into ALSA core structures
- sequencer core cleanups / fixes
- PCM msbits constraints cleanups / fixes
- New SNDRV_PCM_TRIGGER_DRAIN command
- PCM kerneldoc fixes, header cleanups
- PCM code cleanups using more standard codes
- Control notification ID fixes
Driver cleanups
- Cleanups of PCI PM callbacks
- Timer helper usages cleanups
- Simplification (e.g. argument reduction) of many driver codes
HD-audio
- Hotkey and LED support on HP laptops with Realtek codecs
- Dock station support on HP laptops
- Toshiba Satellite S50D fixup
- Enhanced wallclock timestamp handling for HD-audio
- Componentization to simplify the linkage between i915 and hd-audio
drivers for Intel HDMI/DP
USB-audio
- Akai MPC Element support
- Enhanced timestamp handling
ASoC
- Lots of refactoringin ASoC core, moving drivers to more data driven
initialization and rationalizing a lot of DAPM usage
- Much improved handling of CDCLK clocks on Samsung I2S controllers
- Lots of driver specific cleanups and feature improvements
- CODEC support for TI PCM514x and TLV320AIC3104 devices
- Board support for Tegra systems with Realtek RT5677
- New driver for Maxim max98357a
- More enhancements / fixes for Intel SST driver
Others
- Promotion of LINE6 driver from staging along with lots of rewrites
and cleanups
- DT support for old non-ASoC atmel driver
- oxygen cleanups, XIO2001 init, Studio Evolution SE6x support
- Emu8000 DRAM size detection fix on ISA(!!) AWE64 boards
- A few more ak411x fixes for ice1724 boards"
* tag 'sound-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (542 commits)
ALSA: line6: toneport: Use explicit type for firmware version
ALSA: line6: Use explicit type for serial number
ALSA: line6: Return EIO if read/write not successful
ALSA: line6: Return error if device not responding
ALSA: line6: Add delay before reading status
ASoC: Intel: Clean data after SST fw fetch
ALSA: hda - Add docking station support for another HP machine
ALSA: control: fix failure to return new numerical ID in 'replace' event data
ALSA: usb: update trigger timestamp on first non-zero URB submitted
ALSA: hda: read trigger_timestamp immediately after starting DMA
ALSA: pcm: allow for trigger_tstamp snapshot in .trigger
ALSA: pcm: don't override timestamp unconditionally
ALSA: off by one bug in snd_riptide_joystick_probe()
ASoC: rt5670: Set use_single_rw flag for regmap
ASoC: rt286: Add rt288 codec support
ASoC: max98357a: Fix build in !CONFIG_OF case
ASoC: Intel: fix platform_no_drv_owner.cocci warnings
ARM: dts: Switch Odroid X2/U2 to simple-audio-card
ARM: dts: Exynos4 and Odroid X2/U3 sound device nodes update
ALSA: control: fix failure to return numerical ID in 'add' event
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- Some documentation updates and a few new pixel formats
- Stop btcx-risc abuse by cx88 and move it to bt8xx driver
- New platform driver: am437x
- New webcam driver: toptek
- New remote controller hardware protocols added to img-ir driver
- Removal of a few very old drivers that relies on old kABIs and are
for very hard to find hardware: parallel port webcam drivers
(bw-qcam, c-cam, pms and w9966), tlg2300, Video In/Out for SGI (vino)
- Removal of the USB Telegent driver (tlg2300). The company that
developed this driver has long gone and the hardware is hard to find.
As it relies on a legacy set of kABI symbols and nobody seems to care
about it, remove it.
- several improvements at rtl2832 driver
- conversion on cx28521 and au0828 to use videobuf2 (VB2)
- several improvements, fixups and board additions
* tag 'media/v3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (321 commits)
[media] dvb_net: Convert local hex dump to print_hex_dump_debug
[media] dvb_net: Use standard debugging facilities
[media] dvb_net: Use vsprintf %pM extension to print Ethernet addresses
[media] staging: lirc_serial: adjust boolean assignments
[media] stb0899: use sign_extend32() for sign extension
[media] si2168: add support for 1.7MHz bandwidth
[media] si2168: return error if set_frontend is called with invalid parameters
[media] lirc_dev: avoid potential null-dereference
[media] mn88472: simplify bandwidth registers setting code
[media] dvb: tc90522: re-add symbol-rate report
[media] lmedm04: add read snr, signal strength and ber call backs
[media] lmedm04: Create frontend call back for read status
[media] lmedm04: create frontend callbacks for signal/snr/ber/ucblocks
[media] lmedm04: Fix usb_submit_urb BOGUS urb xfer, pipe 1 != type 3 in interrupt urb
[media] lmedm04: Increase Interupt due time to 200 msec
[media] cx88-dvb: whitespace cleanup
[media] rtl28xxu: properly initialize pdata
[media] rtl2832: declare functions as static
[media] rtl2830: declare functions as static
[media] rtl2832_sdr: add kernel-doc comments for platform_data
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi
Pull HSI fix from Sebastian Reichel:
"Fix uninitialized device pointer in nokia-modem"
* tag 'hsi-for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi:
hsi: nokia-modem: fix uninitialized device pointer
|
|
Pull power supply and reset changes from Sebastian Reichel:
"New drivers:
- charger driver for Maxim 77693
- battery gauge driver for LTC 2941/2943
- battery gauge driver for RT5033
- reset driver for R-Mobile platforms
Convert drivers to restart handler framework:
- arm-versatile
- at91
- st-poweroff
Misc:
- remove deprecated sun6i reboot driver
- use alarmtimer instead of rtc in charger-manager
- misc fixes"
* tag 'for-v3.20' of git://git.infradead.org/battery-2.6: (48 commits)
power_supply: 88pm860x: Fix leaked power supply on probe fail
power/reset: restart-poweroff: Remove arm dependencies
power/reset: st-poweroff: Fix misleading Kconfig description
power/reset: st-poweroff: Register with kernel restart handler
power/reset: Remove sun6i reboot driver
power/reset: at91: Register with kernel restart handler
power/reset: arm-versatile: Register with kernel restart handler
power: test_power: Use enum as index for array of supplies
Add devicetree binding documentation for the LTC2941/LTC2943 driver
Add LTC2941/LTC2943 Battery Gauge Driver
power/reset: brcmstb: Add support for old 65nm chips
power/reset: brcmstb: Use the DT "compatible" string to indicate bit positions
power/reset: brcmstb: Make the driver buildable on MIPS
power: charger-manager: Use alarmtimer for battery monitoring in suspend.
power/reset: at91-poweroff: Fix error handling and other compiler warnings
bq27x00_battery: Call power_supply_changed only when capacity changed
bq27x00_battery: fix register offset for bq27425
power: max14577: Remove SYSFS dependency from Kconfig
power: bq24190_charger: suppress build warning
power: reset: Add reset driver for R-Mobile platforms
...
|
|
When an application connects to the ring buffer via splice, it can only
read full pages. Splice does not work with partial pages. If there is
not enough data to fill a page, the splice command will either block
or return -EAGAIN (if set to nonblock).
Code was added where if the page is not full, to just sleep again.
The problem is, it will get woken up again on the next event. That
is, when something is written into the ring buffer, if there is a waiter
it will wake it up. The waiter would then check the buffer, see that
it still does not have enough data to fill a page and go back to sleep.
To make matters worse, when the waiter goes back to sleep, it could
cause another event, which would wake it back up again to see it
doesn't have enough data and sleep again. This produces a tremendous
overhead and fills the ring buffer with noise.
For example, recording sched_switch on an idle system for 10 seconds
produces 25,350,475 events!!!
Create another wait queue for those waiters wanting full pages.
When an event is written, it only wakes up waiters if there's a full
page of data. It does not wake up the waiter if the page is not yet
full.
After this change, recording sched_switch on an idle system for 10
seconds produces only 800 events. Getting rid of 25,349,675 useless
events (99.9969% of events!!), is something to take seriously.
Cc: stable@vger.kernel.org # 3.16+
Cc: Rabin Vincent <rabin@rab.in>
Fixes: e30f53aad220 "tracing: Do not busy wait in buffer splice"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
The firmware version is a single byte so have the variable type agree.
Since the address to this member is passed to the read function, using
an int is not even portable.
Signed-off-by: Chris Rorvick <chris@rorvick.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
The serial number (aka ESN) is a 32-bit value.
Signed-off-by: Chris Rorvick <chris@rorvick.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
Signed-off-by: Chris Rorvick <chris@rorvick.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
Put an upper bound on how long we will wait for the device to respond to
a read/write request (i.e., 100 milliseconds) and return an error if
this is reached.
Signed-off-by: Chris Rorvick <chris@rorvick.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
The device indicates the result of a read/write operation by making the
status available on a subsequent request from the driver. This is not
ready immediately, though, so the driver is currently slamming the
device with hundreds of pointless requests before getting the expected
response. Add a two millisecond delay before each attempt. This is
approximately the behavior observed with version 4.2.7.1 of the Windows
driver.
Signed-off-by: Chris Rorvick <chris@rorvick.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
The BDW audio firmware DSP manages the DMA and the DMA cannot be
stopped exactly at the end of the playback stream. This means
stale samples may be played at PCM stop unless the driver copies
silence to the subsequent periods.
Signed-off-by: Libin Yang <libin.yang@intel.com>
Reviewed-by: Liam Girdwood <liam.r.girdwood@linux.intel.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|