summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2009-09-23/proc/kcore: update stat.st_size after memory hotplugKAMEZAWA Hiroyuki
After memory hotplug (or other events in future), kcore size can be modified. To update inode->i_size, we have to know inode/dentry but we can't get it from inside /proc directly. But considerinyg memory hotplug, kcore image is updated only when it's opened. Then, updating inode->i_size at open() is enough. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23/proc/kcore: fix stat.st_sizeKAMEZAWA Hiroyuki
Presently the size of /proc/kcore which can be read by 'ls -l' is 0. But it's not the correct value. On x86-64, ls -l shows ... root root 140737486266368 2009-09-17 10:29 /proc/kcore Then, 7FFFFFFE02000. This comes from vmalloc area's size. (*) This shows "core" size, not memory size. This patch shows the size by updating "size" field in struct proc_dir_entry. Later, lookup routine will create inode and fill inode->i_size based on this value. Then, this has a problem. - Once inode is cached, inode->i_size will never be updated. Then, this patch is not memory-hotplug-aware. To update inode->i_size, we have to know dentry or inode. But there is no way to lookup them by inside kernel. Hmmm.... Next patch will try it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: more fixes for initKAMEZAWA Hiroyuki
proc_kcore_init() doesn't check NULL case. fix it and remove unnecessary comments. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: register module area in generic wayKAMEZAWA Hiroyuki
Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area. This is handled only in x86-64. This patch make it more generic. And we can use vread/vwrite to access the area. Fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: register vmemmap rangeKAMEZAWA Hiroyuki
Benjamin Herrenschmidt <benh@kernel.crashing.org> pointed out that vmemmap range is not included in KCORE_RAM, KCORE_VMALLOC .... This adds KCORE_VMEMMAP if SPARSEMEM_VMEMMAP is used. By this, vmemmap can be readable via /proc/kcore Because it's not vmalloc area, vread/vwrite cannot be used. But the range is static against the memory layout, this patch handles vmemmap area by the same scheme with physical memory. This patch assumes SPARSEMEM_VMEMMAP range is not in VMALLOC range. It's correct now. [akpm@linux-foundation.org: fix typo] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: use registerd physmem informationKAMEZAWA Hiroyuki
For /proc/kcore, each arch registers its memory range by kclist_add(). In usual, - range of physical memory - range of vmalloc area - text, etc... are registered but "range of physical memory" has some troubles. It doesn't updated at memory hotplug and it tend to include unnecessary memory holes. Now, /proc/iomem (kernel/resource.c) includes required physical memory range information and it's properly updated at memory hotplug. Then, it's good to avoid using its own code(duplicating information) and to rebuild kclist for physical memory based on /proc/iomem. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: register text area in generic wayKAMEZAWA Hiroyuki
Some 64bit arch has special segment for mapping kernel text. It should be entried to /proc/kcore in addtion to direct-linear-map, vmalloc area. This patch unifies KCORE_TEXT entry scattered under x86 and ia64. I'm not familiar with other archs (mips has its own even after this patch) but range of [_stext ..._end) is a valid area of text and it's not in direct-map area, defining CONFIG_ARCH_PROC_KCORE_TEXT is only a necessary thing to do. Note: I left mips as it is now. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: register vmalloc area in generic wayKAMEZAWA Hiroyuki
For /proc/kcore, vmalloc areas are registered per arch. But, all of them registers same range of [VMALLOC_START...VMALLOC_END) This patch unifies them. By this. archs which have no kclist_add() hooks can see vmalloc area correctly. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: add kclist typesKAMEZAWA Hiroyuki
Presently, kclist_add() only eats start address and size as its arguments. Considering to make kclist dynamically reconfigulable, it's necessary to know which kclists are for System RAM and which are not. This patch add kclist types as KCORE_RAM KCORE_VMALLOC KCORE_TEXT KCORE_OTHER This "type" is used in a patch following this for detecting KCORE_RAM. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: use usual list for kclistKAMEZAWA Hiroyuki
This patchset is for /proc/kcore. With this, - many per-arch hooks are removed. - /proc/kcore will know really valid physical memory area. - /proc/kcore will be aware of memory hotplug. - /proc/kcore will be architecture independent i.e. if an arch supports CONFIG_MMU, it can use /proc/kcore. (if the arch uses usual memory layout.) This patch: /proc/kcore uses its own list handling codes. It's better to use generic list codes. No changes in logic. just clean up. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23procfs: provide stack information for threadsStefani Seibold
A patch to give a better overview of the userland application stack usage, especially for embedded linux. Currently you are only able to dump the main process/thread stack usage which is showed in /proc/pid/status by the "VmStk" Value. But you get no information about the consumed stack memory of the the threads. There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks the vm mapping where the thread stack pointer reside with "[thread stack xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value information, because libpthread doesn't set the start of the stack to the top of the mapped area, depending of the pthread usage. A sample output of /proc/<pid>/task/<tid>/maps looks like: 08048000-08049000 r-xp 00000000 03:00 8312 /opt/z 08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z 0804a000-0806b000 rw-p 00000000 00:00 0 [heap] a7d12000-a7d13000 ---p 00000000 00:00 0 a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4] a7f13000-a7f14000 ---p 00000000 00:00 0 a7f14000-a7f36000 rw-p 00000000 00:00 0 a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6 a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6 a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6 a806c000-a806f000 rw-p 00000000 00:00 0 a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 a8085000-a8088000 rw-p 00000000 00:00 0 a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack] ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status which will you give the current stack usage in kb. A sample output of /proc/self/status looks like: Name: cat State: R (running) Tgid: 507 Pid: 507 . . . CapBnd: fffffffffffffeff voluntary_ctxt_switches: 0 nonvoluntary_ctxt_switches: 0 Stack usage: 12 kB I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base address of the associated thread stack and not the one of the main process. This makes more sense. [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()] Signed-off-by: Stefani Seibold <stefani@seibold.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23fs/proc/base.c: fix proc_fault_inject_write() input sanity checkVincent Li
Remove obfuscated zero-length input check and return -EINVAL instead of -EIO error to make the error message clear to user. Add whitespace stripping. No functionality changes. The old code: echo 1 > /proc/pid/make-it-fail (ok) echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Input/output error) The new code: echo 1 > /proc/pid/make-it-fail (ok) echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Invalid argument) This patch is conservative in changes to not breaking existing scripts/applications. Signed-off-by: Vincent Li <macli@brc.ubc.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23fs/proc/task_mmu.c v1: fix clear_refs_write() input sanity checkVincent Li
Andrew Morton pointed out similar string hacking and obfuscated check for zero-length input at the end of the function, David Rientjes suggested to use strict_strtol to replace simple_strtol, this patch cover above suggestions, add removing of leading and trailing whitespace from user input. It does not change function behavious. Signed-off-by: Vincent Li <macli@brc.ubc.ca> Acked-by: David Rientjes <rientjes@google.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Amerigo Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23kcore: fix /proc/kcore's stat.st_sizeAmerigo Wang
In 9063c61fd5cbd ("x86, 64-bit: Clean up user address masking") Linus fixed the wrong size of /proc/kcore problem. But its size still looks insane, since it never equals the size of physical memory. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Tao Ma <tao.ma@oracle.com> Cc: <mtk.manpages@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23proc_flush_task: flush /proc/tid/task/pid when a sub-thread exitsOleg Nesterov
The exiting sub-thread flushes /proc/pid only, but this doesn't buy too much: ps and friends mostly use /proc/tid/task/pid. Remove "if (thread_group_leader())" checks from proc_flush_task() path, this means we always remove /proc/tid/task/pid dentry on exit, and this actually matches the comment above proc_flush_task(). The test-case: static void* tfunc(void *arg) { char name[256]; sprintf(name, "/proc/%d/task/%ld/status", getpid(), gettid()); close(open(name, O_RDONLY)); return NULL; } int main(void) { pthread_t t; for (;;) { if (!pthread_create(&t, NULL, &tfunc, NULL)) pthread_join(t, NULL); } } slabtop shows that pid/proc_inode_cache/etc grow quickly and "indefinitely" until the task is killed or shrink_slab() is called, not good. And the main thread needs a lot of time to exit. The same can happen if something like "ps -efL" runs continuously, while some application spawns short-living threads. Reported-by: "James M. Leddy" <jleddy@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dominic Duval <dduval@redhat.com> Cc: Frank Hirtz <fhirtz@redhat.com> Cc: "Fuller, Johnray" <Johnray.Fuller@gs.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Paul Batkowski <pbatkowski@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23proc: fix reported unit for RLIMIT_CPUKees Cook
/proc/$pid/limits should show RLIMIT_CPU as seconds, which is the unit used in kernel/posix-cpu-timers.c: unsigned long psecs = cputime_to_secs(ptime); ... if (psecs >= sig->rlim[RLIMIT_CPU].rlim_max) { ... __group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk); Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23getrusage: fill ru_maxrss valueJiri Pirko
Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater mark. This struct is filled as a parameter to getrusage syscall. ->ru_maxrss value is set to KBs which is the way it is done in BSD systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs which seems to be incorrect behavior. Maintainer of this util was notified by me with the patch which corrects it and cc'ed. To make this happen we extend struct signal_struct by two fields. The first one is ->maxrss which we use to store rss hiwater of the task. The second one is ->cmaxrss which we use to store highest rss hiwater of all task childs. These values are used in k_getrusage() to actually fill ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm struct exists. Note: exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss. it is intetionally behavior. *BSD getrusage have exec() inheriting. test programs ======================================================== getrusage.c =========== #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <signal.h> #include <sys/mman.h> #include "common.h" #define err(str) perror(str), exit(1) int main(int argc, char** argv) { int status; printf("allocate 100MB\n"); consume(100); printf("testcase1: fork inherit? \n"); printf(" expect: initial.self ~= child.self\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { show_rusage("fork child"); _exit(0); } printf("\n"); printf("testcase2: fork inherit? (cont.) \n"); printf(" expect: initial.children ~= 100MB, but child.children = 0\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { show_rusage("child"); _exit(0); } printf("\n"); printf("testcase3: fork + malloc \n"); printf(" expect: child.self ~= initial.self + 50MB\n"); show_rusage("initial"); if (__fork()) { wait(&status); } else { printf("allocate +50MB\n"); consume(50); show_rusage("fork child"); _exit(0); } printf("\n"); printf("testcase4: grandchild maxrss\n"); printf(" expect: post_wait.children ~= 300MB\n"); show_rusage("initial"); if (__fork()) { wait(&status); show_rusage("post_wait"); } else { system("./child -n 0 -g 300"); _exit(0); } printf("\n"); printf("testcase5: zombie\n"); printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n"); printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n"); show_rusage("initial"); if (__fork()) { sleep(1); /* children become zombie */ show_rusage("pre_wait"); wait(&status); show_rusage("post_wait"); } else { system("./child -n 400"); _exit(0); } printf("\n"); printf("testcase6: SIG_IGN\n"); printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n"); show_rusage("initial"); signal(SIGCHLD, SIG_IGN); if (__fork()) { sleep(1); /* children become zombie */ show_rusage("after_zombie"); } else { system("./child -n 500"); _exit(0); } printf("\n"); signal(SIGCHLD, SIG_DFL); printf("testcase7: exec (without fork) \n"); printf(" expect: initial ~= exec \n"); show_rusage("initial"); execl("./child", "child", "-v", NULL); return 0; } child.c ======= #include <sys/types.h> #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include "common.h" int main(int argc, char** argv) { int status; int c; long consume_size = 0; long grandchild_consume_size = 0; int show = 0; while ((c = getopt(argc, argv, "n:g:v")) != -1) { switch (c) { case 'n': consume_size = atol(optarg); break; case 'v': show = 1; break; case 'g': grandchild_consume_size = atol(optarg); break; default: break; } } if (show) show_rusage("exec"); if (consume_size) { printf("child alloc %ldMB\n", consume_size); consume(consume_size); } if (grandchild_consume_size) { if (fork()) { wait(&status); } else { printf("grandchild alloc %ldMB\n", grandchild_consume_size); consume(grandchild_consume_size); exit(0); } } return 0; } common.c ======== #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <sys/time.h> #include <sys/resource.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <signal.h> #include <sys/mman.h> #include "common.h" #define err(str) perror(str), exit(1) void show_rusage(char *prefix) { int err, err2; struct rusage rusage_self; struct rusage rusage_children; printf("%s: ", prefix); err = getrusage(RUSAGE_SELF, &rusage_self); if (!err) printf("self %ld ", rusage_self.ru_maxrss); err2 = getrusage(RUSAGE_CHILDREN, &rusage_children); if (!err2) printf("children %ld ", rusage_children.ru_maxrss); printf("\n"); } /* Some buggy OS need this worthless CPU waste. */ void make_pagefault(void) { void *addr; int size = getpagesize(); int i; for (i=0; i<1000; i++) { addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (addr == MAP_FAILED) err("make_pagefault"); memset(addr, 0, size); munmap(addr, size); } } void consume(int mega) { size_t sz = mega * 1024 * 1024; void *ptr; ptr = malloc(sz); memset(ptr, 0, sz); make_pagefault(); } pid_t __fork(void) { pid_t pid; pid = fork(); make_pagefault(); return pid; } common.h ======== void show_rusage(char *prefix); void make_pagefault(void); void consume(int mega); pid_t __fork(void); FreeBSD result (expected result) ======================================================== allocate 100MB testcase1: fork inherit? expect: initial.self ~= child.self initial: self 103492 children 0 fork child: self 103540 children 0 testcase2: fork inherit? (cont.) expect: initial.children ~= 100MB, but child.children = 0 initial: self 103540 children 103540 child: self 103564 children 0 testcase3: fork + malloc expect: child.self ~= initial.self + 50MB initial: self 103564 children 103564 allocate +50MB fork child: self 154860 children 0 testcase4: grandchild maxrss expect: post_wait.children ~= 300MB initial: self 103564 children 154860 grandchild alloc 300MB post_wait: self 103564 children 308720 testcase5: zombie expect: pre_wait ~= initial, IOW the zombie process is not accounted. post_wait ~= 400MB, IOW wait() collect child's max_rss. initial: self 103564 children 308720 child alloc 400MB pre_wait: self 103564 children 308720 post_wait: self 103564 children 411312 testcase6: SIG_IGN expect: initial ~= after_zombie (child's 500MB alloc should be ignored). initial: self 103564 children 411312 child alloc 500MB after_zombie: self 103624 children 411312 testcase7: exec (without fork) expect: initial ~= exec initial: self 103624 children 411312 exec: self 103624 children 411312 Linux result (actual test result) ======================================================== allocate 100MB testcase1: fork inherit? expect: initial.self ~= child.self initial: self 102848 children 0 fork child: self 102572 children 0 testcase2: fork inherit? (cont.) expect: initial.children ~= 100MB, but child.children = 0 initial: self 102876 children 102644 child: self 102572 children 0 testcase3: fork + malloc expect: child.self ~= initial.self + 50MB initial: self 102876 children 102644 allocate +50MB fork child: self 153804 children 0 testcase4: grandchild maxrss expect: post_wait.children ~= 300MB initial: self 102876 children 153864 grandchild alloc 300MB post_wait: self 102876 children 307536 testcase5: zombie expect: pre_wait ~= initial, IOW the zombie process is not accounted. post_wait ~= 400MB, IOW wait() collect child's max_rss. initial: self 102876 children 307536 child alloc 400MB pre_wait: self 102876 children 307536 post_wait: self 102876 children 410076 testcase6: SIG_IGN expect: initial ~= after_zombie (child's 500MB alloc should be ignored). initial: self 102876 children 410076 child alloc 500MB after_zombie: self 102880 children 410076 testcase7: exec (without fork) expect: initial ~= exec initial: self 102880 children 410076 exec: self 102880 children 410076 Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23fix compat_sys_utimensat()Suzuki Poulose
Compat utimensat() returns EINVAL when the tv_nsec is one of UTIME_OMIT or UTIME_NOW and the tv_sec is set to non-zero. As per man pages, the tv_sec field should be ignored. sys_utimensat() works fine in this case. Test case: #define _GNU_SOURCE #define _ATFILE_SOURCE #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <sys/stat.h> #include <stdlib.h> main(int argc, char *argv[]) { struct timespec ts[2]; struct timespec *tsp; if (argc < 2) { fprintf(stderr, "Usage : %s filename\n", argv[0]); exit (-1); } ts[0].tv_nsec = ts[1].tv_nsec = UTIME_NOW; ts[0].tv_sec = ts[1].tv_sec = 1; tsp = ts; if (utimensat(AT_FDCWD, argv[1],tsp,0) == -1) perror("utimensat"); else fprintf(stdout, "utimensat success\n"); return 0; } mjs22lp5:~ # cc -m64 utimensat-test.c -o utimensat_test64 mjs22lp5:~ # cc -m32 utimensat-test.c -o utimensat_test32 mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test utimensat: Invalid argument mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test utimensat success mjs22lp5:~ # uname -r 2.6.31-rc8 With the patch : mjs22lp5:~ # ./utimensat_test64 /tmp/utimensat_test utimensat success mjs22lp5:~ # ./utimensat_test32 /tmp/utimensat_test utimensat success mjs22lp5:~ # uname -r 2.6.31-rc8utimensat Signed-off-by: Suzuki K P <suzuki@in.ibm.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23qnx4: remove write supportChristoph Hellwig
qnx4 wrte support has never been fully implement, is broken since the dawn of time and hasn't been actively developed since before git history started. Instead of letting it further bitrot and complicate API transition (like the new truncate code) remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Anders Larsen <al@alarsen.net> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23ntfs: remove ntfs_file_writeChristoph Hellwig
do_sync_write() does the right thing for turning the aio_writev method into a normal non-vectored synchronous write, no need to duplicate it in ntfs. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23anonfd: split interface into file creation and installDavide Libenzi
Split the anonfd interface into a bare file pointer creation one, and a file pointer creation plus install one. There are cases, like the usage of eventfds inside other kernel interfaces, where the file pointer created by anonfd needs to be used inside the initialization of other structures. As it is right now, as soon as anon_inode_getfd() returns, the kenrle can race with userspace closing the newly installed file descriptor. This patch, while keeping the old anon_inode_getfd(), introduces a new anon_inode_getfile() (whose services are reused in anon_inode_getfd()) that allows to split the file creation phase and the fd install one. Once all the kernel structures are initialized, the code can call the proper fd_install(). Gregory manifested the need for something like this inside KVM. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: James Morris <jmorris@namei.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23aio.c: move EXPORT* macros to line after functionH Hartley Sweeten
As mentioned in Documentation/CodingStyle, move EXPORT* macro's to the line immediately after the closing function brace line. Also, move the __initcall() similarly. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23fs/buffer.c: clean up EXPORT* macrosH Hartley Sweeten
According to Documentation/CodingStyle the EXPORT* macro should follow immediately after the closing function brace line. Also, mark_buffer_async_write_endio() and do_thaw_all() are not used elsewhere so they should be marked as static. In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to that file. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23fs: turn iprune_mutex into rwsemNick Piggin
We have had a report of bad memory allocation latency during DVD-RAM (UDF) writing. This is causing the user's desktop session to become unusable. Jan tracked the cause of this down to UDF inode reclaim blocking: gnome-screens D ffff810006d1d598 0 20686 1 ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800 ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580 ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0 Call Trace: [<ffffffff804477f3>] io_schedule+0x63/0xa5 [<ffffffff802c2587>] sync_buffer+0x3b/0x3f [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79 [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77 [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21 [<ffffffff802c442a>] __bread+0x70/0x86 [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394 [<ffffffff802bd205>] write_inode_now+0x7d/0xc4 [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53 [<ffffffff802b39ae>] clear_inode+0xc2/0x11b [<ffffffff802b3ab1>] dispose_list+0x5b/0x102 [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392 [<ffffffff802951fa>] alloc_page_vma+0x176/0x189 [<ffffffff802822d8>] __do_fault+0x10c/0x417 [<ffffffff80284232>] handle_mm_fault+0x466/0x940 [<ffffffff8044b922>] do_page_fault+0x676/0xabf This blocks with iprune_mutex held, which then blocks other reclaimers: X D ffff81009d47c400 0 17285 14831 ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288 ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400 ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740 Call Trace: [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9 [<ffffffff80447e1a>] mutex_lock+0x1e/0x22 [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213 [<ffffffff8027ede3>] shrink_slab+0xe3/0x158 [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232 [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392 [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6 [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73 [<ffffffff802ad949>] do_select+0x308/0x506 [<ffffffff802adced>] core_sys_select+0x1a6/0x254 [<ffffffff802ae0b7>] sys_select+0xb5/0x157 Now I think the main problem is having the filesystem block (and do IO) in inode reclaim. The problem is that this doesn't get accounted well and penalizes a random allocator with a big latency spike caused by work generated from elsewhere. I think the best idea would be to avoid this. By design if possible, or by deferring the hard work to an asynchronous context. If the latter, then the fs would probably want to throttle creation of new work with queue size of the deferred work, but let's not get into those details. Anyway, the other obvious thing we looked at is the iprune_mutex which is causing the cascading blocking. We could turn this into an rwsem to improve concurrency. It is unreasonable to totally ban all potentially slow or blocking operations in inode reclaim, so I think this is a cheap way to get a small improvement. This doesn't solve the whole problem of course. The process doing inode reclaim will still take the latency hit, and concurrent processes may end up contending on filesystem locks. So fs developers should keep these problems in mind. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jan Kara <jack@ucw.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23seq_file: constify seq_operationsJames Morris
Make all seq_operations structs const, to help mitigate against revectoring user-triggerable function pointers. This is derived from the grsecurity patch, although generated from scratch because it's simpler than extracting the changes from there. Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23Move magic numbers into magic.hNick Black
Move various magic-number definitions into magic.h. Signed-off-by: Nick Black <dank@qemfd.net> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23poll/select: avoid arithmetic overflow in __estimate_accuracy()Guillaume Knispel
__estimate_accuracy() was prone to integer overflow, for example if *tv == {2147, 483648000} on a 32 bit computer (or even for delays as small as {429, 500000000} if the task is niced). Because the result was already forced between 0 and 100ms, the effect of the overflow was not too problematic, but the use of the hrtimer range feature was not optimal in overflow cases. This patch ensures that there can not be an integer overflow in this function. Signed-off-by: Guillaume Knispel <gknispel@proformatique.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23smbfs: read buffer overflowRoel Kluin
This function uses signed integers for the unix_date and local variables - if a negative number is supplied and the leap-year condition is not met, month will be 0, leading to a read of day_n[-1] Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23eCryptfs: Prevent lower dentry from going negative during unlinkTyler Hicks
When calling vfs_unlink() on the lower dentry, d_delete() turns the dentry into a negative dentry when the d_count is 1. This eventually caused a NULL pointer deref when a read() or write() was done and the negative dentry's d_inode was dereferenced in ecryptfs_read_update_atime() or ecryptfs_getxattr(). Placing mutt's tmpdir in an eCryptfs mount is what initially triggered the oops and I was able to reproduce it with the following sequence: open("/tmp/upper/foo", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW, 0600) = 3 link("/tmp/upper/foo", "/tmp/upper/bar") = 0 unlink("/tmp/upper/foo") = 0 open("/tmp/upper/bar", O_RDWR|O_CREAT|O_NOFOLLOW, 0600) = 4 unlink("/tmp/upper/bar") = 0 write(4, "eCryptfs test\n"..., 14 <unfinished ...> +++ killed by SIGKILL +++ https://bugs.launchpad.net/ecryptfs/+bug/387073 Reported-by: Loïc Minier <loic.minier@canonical.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: stable <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-09-23eCryptfs: Propagate vfs_read and vfs_write return codesTyler Hicks
Errors returned from vfs_read() and vfs_write() calls to the lower filesystem were being masked as -EINVAL. This caused some confusion to users who saw EINVAL instead of ENOSPC when the disk was full, for instance. Also, the actual bytes read or written were not accessible by callers to ecryptfs_read_lower() and ecryptfs_write_lower(), which may be useful in some cases. This patch updates the error handling logic where those functions are called in order to accept positive return codes indicating success. Cc: Eric Sandeen <esandeen@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-09-23eCryptfs: Validate global auth tok keysTyler Hicks
When searching through the global authentication tokens for a given key signature, verify that a matching key has not been revoked and has not expired. This allows the `keyctl revoke` command to be properly used on keys in use by eCryptfs. Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: stable <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-09-23eCryptfs: Filename encryption only supports password auth tokensTyler Hicks
Returns -ENOTSUPP when attempting to use filename encryption with something other than a password authentication token, such as a private token from openssl. Using filename encryption with a userspace eCryptfs key module is a future goal. Until then, this patch handles the situation a little better than simply using a BUG_ON(). Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: stable <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-09-23eCryptfs: Check for O_RDONLY lower inodes when opening lower filesTyler Hicks
If the lower inode is read-only, don't attempt to open the lower file read/write and don't hand off the open request to the privileged eCryptfs kthread for opening it read/write. Instead, only try an unprivileged, read-only open of the file and give up if that fails. This patch fixes an oops when eCryptfs is mounted on top of a read-only mount. Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Eric Sandeen <esandeen@redhat.com> Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: stable <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-09-23eCryptfs: Handle unrecognized tag 3 cipher codesTyler Hicks
Returns an error when an unrecognized cipher code is present in a tag 3 packet or an ecryptfs_crypt_stat cannot be initialized. Also sets an crypt_stat->tfm error pointer to NULL to ensure that it will not be incorrectly freed in ecryptfs_destroy_crypt_stat(). Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: ecryptfs-devel@lists.launchpad.net Cc: stable <stable@kernel.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-09-23ecryptfs: improved dependency checking and reportingDave Hansen
So, I compiled a 2.6.31-rc5 kernel with ecryptfs and loaded its module. When it came time to mount my filesystem, I got this in dmesg, and it refused to mount: [93577.776637] Unable to allocate crypto cipher with name [aes]; rc = [-2] [93577.783280] Error attempting to initialize key TFM cipher with name = [aes]; rc = [-2] [93577.791183] Error attempting to initialize cipher with name = [aes] and key size = [32]; rc = [-2] [93577.800113] Error parsing options; rc = [-22] I figured from the error message that I'd either forgotten to load "aes" or that my key size was bogus. Neither one of those was the case. In fact, I was missing the CRYPTO_ECB config option and the 'ecb' module. Unfortunately, there's no trace of 'ecb' in that error message. I've done two things to fix this. First, I've modified ecryptfs's Kconfig entry to select CRYPTO_ECB and CRYPTO_CBC. I also took CRYPTO out of the dependencies since the 'select' will take care of it for us. I've also modified the error messages to print a string that should contain both 'ecb' and 'aes' in my error case. That will give any future users a chance of finding the right modules and Kconfig options. I also wonder if we should: select CRYPTO_AES if !EMBEDDED since I think most ecryptfs users are using AES like me. Cc: ecryptfs-devel@lists.launchpad.net Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Dustin Kirkland <kirkland@canonical.com> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> [tyhicks@linux.vnet.ibm.com: Removed extra newline, 80-char violation] Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-09-23eCryptfs: Fix lockdep-reported AB-BA mutex issueRoland Dreier
Lockdep reports the following valid-looking possible AB-BA deadlock with global_auth_tok_list_mutex and keysig_list_mutex: ecryptfs_new_file_context() -> ecryptfs_copy_mount_wide_sigs_to_inode_sigs() -> mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex); -> ecryptfs_add_keysig() -> mutex_lock(&crypt_stat->keysig_list_mutex); vs ecryptfs_generate_key_packet_set() -> mutex_lock(&crypt_stat->keysig_list_mutex); -> ecryptfs_find_global_auth_tok_for_sig() -> mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex); ie the two mutexes are taken in opposite orders in the two different code paths. I'm not sure if this is a real bug where two threads could actually hit the two paths in parallel and deadlock, but it at least makes lockdep impossible to use with ecryptfs since this report triggers every time and disables future lockdep reporting. Since ecryptfs_add_keysig() is called only from the single callsite in ecryptfs_copy_mount_wide_sigs_to_inode_sigs(), the simplest fix seems to be to move the lock of keysig_list_mutex back up outside of the where global_auth_tok_list_mutex is taken. This patch does that, and fixes the lockdep report on my system (and ecryptfs still works OK). The full output of lockdep fixed by this patch is: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.31-2-generic #14~rbd2 ------------------------------------------------------- gdm/2640 is trying to acquire lock: (&mount_crypt_stat->global_auth_tok_list_mutex){+.+.+.}, at: [<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 but task is already holding lock: (&crypt_stat->keysig_list_mutex){+.+.+.}, at: [<ffffffff81217728>] ecryptfs_generate_key_packet_set+0x58/0x2b0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&crypt_stat->keysig_list_mutex){+.+.+.}: [<ffffffff8108c897>] check_prev_add+0x2a7/0x370 [<ffffffff8108cfc1>] validate_chain+0x661/0x750 [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430 [<ffffffff8108d585>] lock_acquire+0xa5/0x150 [<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0 [<ffffffff81552b56>] mutex_lock_nested+0x46/0x60 [<ffffffff8121526a>] ecryptfs_add_keysig+0x5a/0xb0 [<ffffffff81213299>] ecryptfs_copy_mount_wide_sigs_to_inode_sigs+0x59/0xb0 [<ffffffff81214b06>] ecryptfs_new_file_context+0xa6/0x1a0 [<ffffffff8120e42a>] ecryptfs_initialize_file+0x4a/0x140 [<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60 [<ffffffff8113a7d4>] vfs_create+0xb4/0xe0 [<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110 [<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0 [<ffffffff8112d8d9>] do_sys_open+0x69/0x140 [<ffffffff8112d9f0>] sys_open+0x20/0x30 [<ffffffff81013132>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff -> #0 (&mount_crypt_stat->global_auth_tok_list_mutex){+.+.+.}: [<ffffffff8108c675>] check_prev_add+0x85/0x370 [<ffffffff8108cfc1>] validate_chain+0x661/0x750 [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430 [<ffffffff8108d585>] lock_acquire+0xa5/0x150 [<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0 [<ffffffff81552b56>] mutex_lock_nested+0x46/0x60 [<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 [<ffffffff812177d5>] ecryptfs_generate_key_packet_set+0x105/0x2b0 [<ffffffff81212f49>] ecryptfs_write_headers_virt+0xc9/0x120 [<ffffffff8121306d>] ecryptfs_write_metadata+0xcd/0x200 [<ffffffff8120e44b>] ecryptfs_initialize_file+0x6b/0x140 [<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60 [<ffffffff8113a7d4>] vfs_create+0xb4/0xe0 [<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110 [<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0 [<ffffffff8112d8d9>] do_sys_open+0x69/0x140 [<ffffffff8112d9f0>] sys_open+0x20/0x30 [<ffffffff81013132>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff other info that might help us debug this: 2 locks held by gdm/2640: #0: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8113cb8b>] do_filp_open+0x3cb/0xae0 #1: (&crypt_stat->keysig_list_mutex){+.+.+.}, at: [<ffffffff81217728>] ecryptfs_generate_key_packet_set+0x58/0x2b0 stack backtrace: Pid: 2640, comm: gdm Tainted: G C 2.6.31-2-generic #14~rbd2 Call Trace: [<ffffffff8108b988>] print_circular_bug_tail+0xa8/0xf0 [<ffffffff8108c675>] check_prev_add+0x85/0x370 [<ffffffff81094912>] ? __module_text_address+0x12/0x60 [<ffffffff8108cfc1>] validate_chain+0x661/0x750 [<ffffffff81017275>] ? print_context_stack+0x85/0x140 [<ffffffff81089c68>] ? find_usage_backwards+0x38/0x160 [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430 [<ffffffff8108d585>] lock_acquire+0xa5/0x150 [<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 [<ffffffff8108b0b0>] ? check_usage_backwards+0x0/0xb0 [<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0 [<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 [<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 [<ffffffff8108c02c>] ? mark_held_locks+0x6c/0xa0 [<ffffffff81125b0d>] ? kmem_cache_alloc+0xfd/0x1a0 [<ffffffff8108c34d>] ? trace_hardirqs_on_caller+0x14d/0x190 [<ffffffff81552b56>] mutex_lock_nested+0x46/0x60 [<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90 [<ffffffff812177d5>] ecryptfs_generate_key_packet_set+0x105/0x2b0 [<ffffffff81212f49>] ecryptfs_write_headers_virt+0xc9/0x120 [<ffffffff8121306d>] ecryptfs_write_metadata+0xcd/0x200 [<ffffffff81210240>] ? ecryptfs_init_persistent_file+0x60/0xe0 [<ffffffff8120e44b>] ecryptfs_initialize_file+0x6b/0x140 [<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60 [<ffffffff8113a7d4>] vfs_create+0xb4/0xe0 [<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110 [<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0 [<ffffffff8129a93e>] ? _raw_spin_unlock+0x5e/0xb0 [<ffffffff8155410b>] ? _spin_unlock+0x2b/0x40 [<ffffffff81139e9b>] ? getname+0x3b/0x240 [<ffffffff81148a5a>] ? alloc_fd+0xfa/0x140 [<ffffffff8112d8d9>] do_sys_open+0x69/0x140 [<ffffffff81553b8f>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8112d9f0>] sys_open+0x20/0x30 [<ffffffff81013132>] system_call_fastpath+0x16/0x1b Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-09-23ecryptfs: Remove unneeded locking that triggers lockdep false positivesRoland Dreier
In ecryptfs_destroy_inode(), inode_info->lower_file_mutex is locked, and just after the mutex is unlocked, the code does: kmem_cache_free(ecryptfs_inode_info_cache, inode_info); This means that if another context could possibly try to take the same mutex as ecryptfs_destroy_inode(), then it could end up getting the mutex just before the data structure containing the mutex is freed. So any such use would be an obvious use-after-free bug (catchable with slab poisoning or mutex debugging), and therefore the locking in ecryptfs_destroy_inode() is not needed and can be dropped. Similarly, in ecryptfs_destroy_crypt_stat(), crypt_stat->keysig_list_mutex is locked, and then the mutex is unlocked just before the code does: memset(crypt_stat, 0, sizeof(struct ecryptfs_crypt_stat)); Therefore taking this mutex is similarly not necessary. Removing this locking fixes false-positive lockdep reports such as the following (and they are false-positives for exactly the same reason that the locking is not needed): ================================= [ INFO: inconsistent lock state ] 2.6.31-2-generic #14~rbd3 --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/323 [HC0[0]:SC0[0]:HE1:SE1] takes: (&inode_info->lower_file_mutex){+.+.?.}, at: [<ffffffff81210d34>] ecryptfs_destroy_inode+0x34/0x100 {RECLAIM_FS-ON-W} state was registered at: [<ffffffff8108c02c>] mark_held_locks+0x6c/0xa0 [<ffffffff8108c10f>] lockdep_trace_alloc+0xaf/0xe0 [<ffffffff81125a51>] kmem_cache_alloc+0x41/0x1a0 [<ffffffff8113117a>] get_empty_filp+0x7a/0x1a0 [<ffffffff8112dd46>] dentry_open+0x36/0xc0 [<ffffffff8121a36c>] ecryptfs_privileged_open+0x5c/0x2e0 [<ffffffff81210283>] ecryptfs_init_persistent_file+0xa3/0xe0 [<ffffffff8120e838>] ecryptfs_lookup_and_interpose_lower+0x278/0x380 [<ffffffff8120f97a>] ecryptfs_lookup+0x12a/0x250 [<ffffffff8113930a>] real_lookup+0xea/0x160 [<ffffffff8113afc8>] do_lookup+0xb8/0xf0 [<ffffffff8113b518>] __link_path_walk+0x518/0x870 [<ffffffff8113bd9c>] path_walk+0x5c/0xc0 [<ffffffff8113be5b>] do_path_lookup+0x5b/0xa0 [<ffffffff8113bfe7>] user_path_at+0x57/0xa0 [<ffffffff811340dc>] vfs_fstatat+0x3c/0x80 [<ffffffff8113424b>] vfs_stat+0x1b/0x20 [<ffffffff81134274>] sys_newstat+0x24/0x50 [<ffffffff81013132>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff irq event stamp: 7811 hardirqs last enabled at (7811): [<ffffffff810c037f>] call_rcu+0x5f/0x90 hardirqs last disabled at (7810): [<ffffffff810c0353>] call_rcu+0x33/0x90 softirqs last enabled at (3764): [<ffffffff810631da>] __do_softirq+0x14a/0x220 softirqs last disabled at (3751): [<ffffffff8101440c>] call_softirq+0x1c/0x30 other info that might help us debug this: 2 locks held by kswapd0/323: #0: (shrinker_rwsem){++++..}, at: [<ffffffff810f67ed>] shrink_slab+0x3d/0x190 #1: (&type->s_umount_key#35){.+.+..}, at: [<ffffffff811429a1>] prune_dcache+0xd1/0x1b0 stack backtrace: Pid: 323, comm: kswapd0 Tainted: G C 2.6.31-2-generic #14~rbd3 Call Trace: [<ffffffff8108ad6c>] print_usage_bug+0x18c/0x1a0 [<ffffffff8108aff0>] ? check_usage_forwards+0x0/0xc0 [<ffffffff8108bac2>] mark_lock_irq+0xf2/0x280 [<ffffffff8108bd87>] mark_lock+0x137/0x1d0 [<ffffffff81164710>] ? fsnotify_clear_marks_by_inode+0x30/0xf0 [<ffffffff8108bee6>] mark_irqflags+0xc6/0x1a0 [<ffffffff8108d337>] __lock_acquire+0x287/0x430 [<ffffffff8108d585>] lock_acquire+0xa5/0x150 [<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100 [<ffffffff8108d2e7>] ? __lock_acquire+0x237/0x430 [<ffffffff815526ad>] __mutex_lock_common+0x4d/0x3d0 [<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100 [<ffffffff81164710>] ? fsnotify_clear_marks_by_inode+0x30/0xf0 [<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100 [<ffffffff8129a91e>] ? _raw_spin_unlock+0x5e/0xb0 [<ffffffff81552b36>] mutex_lock_nested+0x46/0x60 [<ffffffff81210d34>] ecryptfs_destroy_inode+0x34/0x100 [<ffffffff81145d27>] destroy_inode+0x87/0xd0 [<ffffffff81146b4c>] generic_delete_inode+0x12c/0x1a0 [<ffffffff81145832>] iput+0x62/0x70 [<ffffffff811423c8>] dentry_iput+0x98/0x110 [<ffffffff81142550>] d_kill+0x50/0x80 [<ffffffff81142623>] prune_one_dentry+0xa3/0xc0 [<ffffffff811428b1>] __shrink_dcache_sb+0x271/0x290 [<ffffffff811429d9>] prune_dcache+0x109/0x1b0 [<ffffffff81142abf>] shrink_dcache_memory+0x3f/0x50 [<ffffffff810f68dd>] shrink_slab+0x12d/0x190 [<ffffffff810f9377>] balance_pgdat+0x4d7/0x640 [<ffffffff8104c4c0>] ? finish_task_switch+0x40/0x150 [<ffffffff810f63c0>] ? isolate_pages_global+0x0/0x60 [<ffffffff810f95f7>] kswapd+0x117/0x170 [<ffffffff810777a0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff810f94e0>] ? kswapd+0x0/0x170 [<ffffffff810773be>] kthread+0x9e/0xb0 [<ffffffff8101430a>] child_rip+0xa/0x20 [<ffffffff81013c90>] ? restore_args+0x0/0x30 [<ffffffff81077320>] ? kthread+0x0/0xb0 [<ffffffff81014300>] ? child_rip+0x0/0x20 Signed-off-by: Roland Dreier <roland@digitalvampire.org> Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-09-23ocfs2: Use buffer IO if we are appending a file.Tao Ma
In ocfs2_file_aio_write, we will prevent direct io if we find that we are appending(changing i_size) and call generic_file_aio_write_nolock. But actually O_DIRECT flag is there and this function will call generic_file_direct_write eventually which will update i_size and leave di->i_size alone. The bug is http://oss.oracle.com/bugzilla/show_bug.cgi?id=1173. So this patch let ocfs2_direct_IO returns 0 directly if we are appending so that buffered write will be called and di->i_size get updated successfully. And this is also what we want in ocfs2_file_aio_write. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-23ocfs2: add spinlock protection when dealing with lockres->purge.Wengang Wang
when we check/modify lockres->purge, we should with the protection of lockres->spinlock. in dlm_purge_lockres(), the checking/modifying is not with the protectin. this patch fixes it. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-23dlmglue.c: add missed mlog linesColy Li
This patch adds the missed mlog_exit() and mlog_exit_void() lines when routines return. Signed-off-by: Coly Li <coly.li@suse.de> Acked-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-23ocfs2: __ocfs2_abort() should not enable panic for local mountsSunil Mushran
In a clustered setup, we have to panic the box on journal abort. This is because we don't have the facility to go hard readonly. With hard ro, another node would detect node failure and initiate recovery. Having said that, we shouldn't force panic if the volume is mounted locally. This patch defers the handling to the mount option, errors. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-22ocfs2: Add ioctl for reflink.Tao Ma
The ioctl will take 3 parameters: old_path, new_path and preserve and call vfs_reflink. It is useful when we backport reflink features to old kernels. Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22ocfs2: Enable refcount tree support.Tao Ma
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22ocfs2: Implement ocfs2_reflink.Tao Ma
Implement ocfs2_reflink. Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22ocfs2: Add preserve to reflink.Tao Ma
reflink has 2 options for the destination file: 1. snapshot: reflink will attempt to preserve ownership, permissions, and all other security state in order to create a full snapshot. 2. new file: it will acquire the data extent sharing but will see the file's security state and attributes initialized as a new file. So add the option to ocfs2. Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22ocfs2: Create reflinked file in orphan dir.Tao Ma
reflink is a very complicated process, so it can't be integrated into one transaction. So if the system panic in the operation, we may leave a unfinished inode in the destication directory. So we will try to create an inode in orphan_dir first, reflink it to the src file and then move it to the destication file in the end. In that way we won't be afraid of any corruption during the reflink. This patch adds 2 functions for orphan_dir operation: 1. Create a new inode in orphand dir. 2. Move an inode to a target dir. Note: fsck.ocfs2 should work for us to remove the unfinished file in the orphan_dir. Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22ocfs2: Use proper parameter for some inode operation.Tao Ma
In order to make the original function more suitable for reflink, we modify the following inode operations. Both are tiny. 1. ocfs2_mknod_locked only use dentry for mlog, so move it to the caller so that reflink can use it without dentry. 2. ocfs2_prepare_orphan_dir only want inode to get its ip_blkno. So use ip_blkno instead. Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22ocfs2: Make transaction extend more efficient.Tao Ma
In ocfs2_extend_rotate_transaction, op_credits is the orignal credits in the handle and we only want to extend the credits for the rotation, but the old solution always double it. It is harmless for some minor operations, but for actions like reflink we may rotate tree many times and cause the credits increase dramatically. So this patch try to only increase the desired credits. Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22ocfs2: Don't merge in 1st refcount ops of reflink.Tao Ma
Actually the whole reflink will touch refcount tree 2 times: 1. It will add the clusters in the extent record to the tree if it isn't refcounted before. 2. It will add 1 refcount to these clusters when it add these extent records to the tree. So actually we shouldn't do merge in the 1st operation since the 2nd one will soon be called and we may have to split it again. Do a merge first and split soon is a waste of time. So we only merge in the 2nd round. This is done by adding a new internal __ocfs2_increase_refcount and call it with "not-merge" for 1st refcount operation in reflink. This also has a side-effect that we don't need to worry too much about the metadata allocation in the 2nd round since it will only merge and no split will happen for those records. Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22ocfs2: Modify removing xattr process for refcount.Tao Ma
The old xattr value remove is quite simple, it just erase the tree and free the clusters. But as we have added refcount support, The process is a little complicated. We have to lock the refcount tree at the beginning, what's more, we may split the refcount tree in some cases, so meta/credits are needed. Signed-off-by: Tao Ma <tao.ma@oracle.com>