Add the rational behind the VMM.

author: neal <neal> 2003-09-23 18:20:58 +0000
committer: neal <neal> 2003-09-23 18:20:58 +0000
commit: 61537f4ef4a693f9c2121ce962f3dda54f9c77fd (patch)
tree: 1a98e8646c71a0f55f18f52f1816bd51bc6211a3 /doc
parent: d1fac21443fd260eb7bb5a05ee8144fdf1d80bba (diff)
2 files changed, 316 insertions, 22 deletions
diff --git a/doc/hurd-on-l4.tex b/doc/hurd-on-l4.tex
index 2dda4b6..1a58b16 100644
--- a/doc/hurd-on-l4.tex
+++ b/doc/hurd-on-l4.tex
@@ -4,14 +4,25 @@
 
 \newenvironment{comment}{\footnotesize \begin{quote}}{\end{quote}}
 
+\newenvironment{code}
+	       {\begin{quote}}
+	       {\end{quote}}
+\newcommand{\keyword}[1]{\texttt{#1}}
+\newcommand{\function}[1]{\texttt{#1}}
+
+
 \title{Porting the GNU Hurd to the L4 Microkernel}
 \author{Marcus Brinkmann}
 \date{August 2003}
 
 \begin{document}
 
+\frontmatter
 \maketitle
 \tableofcontents
+
+\mainmatter
+
 \setlength{\parindent}{0pt}
 \setlength{\parskip}{1ex plus 0.5ex minus 0.2ex}
 
@@ -24,4 +35,6 @@
 \include{debugging}
 \include{device-drivers}
 
+% \backmatter
+
 \end{document}
diff --git a/doc/vmm.tex b/doc/vmm.tex
index a41c31e..63b5879 100644
--- a/doc/vmm.tex
+++ b/doc/vmm.tex
@@ -1,26 +1,307 @@
 \chapter{Virtual Memory Management}
 
-Traditionally, monolithical kernels, but even kernels like Mach,
-provide a virtual memory management system in the kernel.  All paging
-decisions are made by the kernel itself.  This requires good
-heuristics.  Smart paging decisions are often not possible because the
-kernel lacks the information about how the data is used.
-
-In the Hurd, paging will be done locally in each task.  A physical
-memory server provides a number of guaranteed physical pages to tasks.
-It will also provide a number of excess pages (over-commit).  The task
-might have to return any number of excess pages on short notice.  If
-the task does not comply, all mappings are revoked (essentially
-killing the task).
-
-A problem arises when data has to be exchanged between a client and a
-server, and the server wants to have control over the content of the
-pages (for example, pass it on to other servers, like device drivers).
-The client can not map the pages directly into the servers address
-space, as it is not trusted.  Container objects created in the
-physical memory server and mapped into the client and/or the servers
-address space will provide the necessary security features to allow
-this.  This can be used for DMA and zero-copying in the data exchange
-between device drivers and (untrusted) user tasks.
+\begin{quote}
+\emph{The mind and memory are more sharply exercised in comprehending
+another man's things than our own.}
 
+\begin{flushright}
+\emph{Timber} or \emph{Discoveries} by Ben Jonson
+\end{flushright}
+\end{quote}
 
+
+\section{Introduction}
+
+The goal of an operating system is simply, perhaps reductively,
+stated: manage the available resources.  In other words, it is the
+operating system's job to dictate the policy for obtaining resources
+and to provide mechanisms to use them.  Most resources which the
+operating system manages are sparse resources, for instance the CPUs,
+the memory and the various peripherals including graphics cards and
+hard drives.  Any given process, therefore, needs to compete with the
+other processes in the system for some subset of the available
+resources at any given time.  As can be imagined, the policy to access
+and the mechanisms to use these resources determines many important
+characteristics of the system.
+
+A simple single user system may use a trivial first come first serve
+policy for allocating resources, a device abstraction layer and no
+protection domains.  Although this design may be very light-weight and
+the thin access layer conducive to high speed, this design will only
+work on a system where all programs can be trusted: a single malicious
+or buggy program can potentially halt all others from making progress
+simply by refusing to yield the CPU or allocating and not releasing
+resources in a timely fashion.
+
+The Hurd, like Unix, aims to provide strong protection domains thereby
+preventing processes from accidentally or maliciously harming the rest
+of the system.  Unix has shown that this can be done efficiently.  But
+more than Unix, the Hurd desires to identify pieces of the system
+which Unix placed in the kernel but which need not be there as they
+could be done in user space and provide additional user flexibility.
+Through our experience and analysis, we are convinced that one area is
+much of the virtual memory system: tasks are often allocating as much
+memory without regard---because Unix provides them with no mechanism
+to do so---for the rest of the system.  But it is not a cooperative
+model which we wish to embrace but a model which holds the users of
+the resource responsible for it and when asked to release some of its
+memory will or violate the social contract and face exile.  Not only
+will this empower users but it will force them to make smarter
+decisions.
+
+\subsection{Learning from Unix}
+
+Unix was designed as a multiuser timesharing system with protection
+domains thereby permitting process separation, i.e. allowing different
+users to concurrently run processes in the system and gain access to
+resources in a controlled fashion such that any one process cannot
+hurt or excessively starve any other.  Unix achieved this through a
+monolithic kernel design wherein both policy and mechanism are
+provided by the kernel.  Due to the limited hardware available at the
+time and the state of Multics\footnote{Multics was seen as a system
+which would never realize due to its overly ambitious feature set.},
+Unix imposed a strong policy on how resources could be used: a program
+could access files, however, lower level mechanism such as the file
+system, the virtual file system, network protocol stacks and devices
+drivers all existed in the kernel proper.  This approach made sense
+for the extremely limited hardware that Unix was targeted for in the
+1970s.  As hardware performance increased, however, a separation
+between mechanism and policy never took place and today Unix-like
+operating systems are in a very similar state to those available two
+decades ago; certainly, the implementations have been vastly improved
+and tuned, however, the fundamental design remains the same.
+
+One of the most important of the policy/mechanism couplings in the
+kernel is the virtual memory subsystem: every component in the system
+needs memory for a variety of reasons and with different priorities.
+The system must attempt to meet a given allocation criteria.  However,
+as the kernel does not and cannot know how how a task will use its
+memory except based on the use of page fault statistics is bound to
+make sub-ideal eviction decisions.  It is in part through years of
+fine tuning that Unix is able to perform as well as it does for the
+general applications which fit its assumed statistical model.
+
+\subsection{Learning from Mach}
+
+The faults of Unix became clear through the use of Mach.  The
+designers of Mach observed that there was too much mechanism in the
+kernel and attempted to export the file systems, network stack and
+much of the system API into user space servers.  They left a very
+powerful VMM in the kernel with the device drivers and a novel IPC
+system.  Our experience shows that the VMM although very flexible, is
+unable to make smart paging decisions: because Unix was tied to so
+many subsystems, it had a fair knowledge of how a lot of the memory in
+the system was being used.  It could therefore make good guesses about
+what memory could be evicted and not be needed in the near future.
+Mach, however, did not have this advantage and relied strictly on page
+fault statistics and access pattern detection for its page eviction
+policy.
+
+Based on this observation, it is imperitive that the page eviction
+scheme have good knowledge about how pages are being used as it only
+requires a few bad decisions to destroy performance.  Thus, a new
+design can either choose to return to the monolithic design and add
+even more knowledge to the kernel to increase performance or the page
+eviction scheme can be remove from the kernel completely and placed in
+user space and make all tasks self paged.
+
+\subsection{Following the Hurd Philosophy}
+
+As the Hurd aims, like Unix, to be a multiuser system for mutually
+untrusted users, security is an absolute necessity.  But it is not the
+object of the system to limit users excessively: as long as operations
+can be done securely, they should be permitted.  It is based on this
+philosophy that we have adopted a self paging design for the new Hurd
+VMM: who knows better how a task will use its memory than the task
+itself?  This is clear from the problems that have been encountered
+with LRU, the basic page evition algorithm, by database developers,
+language designers implementing garbage collectors and soft realtime
+application developers such as multimedia developers: they all wrestle
+with the underlying operating system's page eviction scheme.  By
+putting the responsibility to page on tasks we think that tasks will
+be forced to make smart decisions as they can only hurt themselves.
+
+\section{Memory Allocation}
+
+If memory was infinite and the only problem was worrying about one
+program accessing the memory of another, memory allocation would be
+trivial.  This is not, however, the case: memory is visibly finite and
+a well designed system will exploit it all.  As memory is a system
+resource, a system wide memory allocation policy must be established
+which maximizes memory usage according to a given set of criteria.
+
+In a typical Unix-like VMM, allocating memory (e.g. using
+\function{sbrk} or \function{mmap}) does not allocate physical memory
+but \keyword{virtual memory}.  In order to increase the amount of
+memory available to users, the kernel uses a \keyword{backing store},
+typically a hard disk, to temporarily free physical memory thereby
+allowing other processes to make progress.  The sum of these two is
+referred to as virtual memory.  The use of backing store ensures data
+integrity when physical memory must be freed and application
+transparency is required.  A variety of criteria are used to determine
+which frames are \keyword{paged out}, however, most often some form of
+a priority based least recently used, LRU, algorithm is applied.  Upon
+\keyword{memory pressure}, the system steals pages from low priority
+processes which have not been used recently or drain pages from an
+internal cache.
+
+This design has a major problem: the kernel has to evict the pages but
+only the applications know which pages they really need in the near
+term.  The kernel could ask the applications for this data, however,
+it is unable to trust the applications as they could, for instance,
+not respond, and the kernel would have to forcefully evict pages
+anyway.  As such, the kernel relies on page fault statistics to make
+projections about how the memory will be used, thus the LRU eviction
+scheme.  An additional result of this scheme is that as applications
+never know if mapped memory is in core, they are unable to make
+guarantees about deadlines.
+
+These problems are grounded in the way the Unix VMM allocates memory:
+it does not allocate physical memory but virtual memory.  This is
+illustated by the following scenario: when a process starts and begins
+to use memory, the allocator will happily give it all of memory in the
+system as long as no other process wants it.  What happens, however,
+when a second memory hungry process starts is that the kernel has no
+way to take back memory it allocated to the first process.  At this
+point, it has two options: it can either return failure to the second
+process or it can steal memory from the first process and send it to
+backing store.
+
+One way to solve these problems is to have the VMM allocate phsyical
+memory and make applications completely self-paged.  Thus, the burden
+of paging lies the application themselves.  When application request
+memory, they no longer request virutal memory but physical memory.
+Once the application has exhausted its available frames, it is its
+responsibility to multiplex the available frames.  Thus, virtual
+memory is done in the application itself.  It is important to note
+that a standard manager or managers should be supplied by the
+operating system.  This is important for implementing something like a
+POSIX personality.  This should not, however, be hard coded: certain
+application may greatly benefit by being able to control their own
+eviction schemes.  At its most basic level, hints could be provided to
+the manager by introducing extentions on basic function calls.  For
+instance, \function{malloc} could take an extra parameter indicating
+the class of data being allocated.  These class would provide hints
+about the expected usage pattern and life time of the data.
+
+\subsection{Bootstrap}
+
+When the Hurd starts up, all physical memory is eventually transfered
+to the physical memory server by the root server.  At this point, the
+physical memory server will control all of the physical pages in the
+system.
+
+\subsection{Allocation Policy}
+
+The physical memory server maintains a concept of \keyword{guaranteed
+pages} and \keyword{extra pages}.  The former are pages that a given
+task is guaranteed to map in a very short amount of time.  Given this
+predicate, the total number of guaranteed pages can never exceed the
+total number of frames in the system.  Extra pages are pages which are
+given to clients who have reached their guaranteed page allocation
+limit.  The phsyical memory server may request that a client
+relinquish a number of extant extra pages at any time.  The client
+must return the pages to the physical memory (i.e. free them) in a
+short amount of time.  Should a task fail to do this, it risks having
+all of its memory dropped (i.e. not swapped out or saved in anyway)
+and reclaimed by the physical memory server.
+
+Readers familiar with VMS will see a striking difference between these
+two systems.  This is not without reason.  Yet, differences remains:
+VMS does not have extra pages and the number of pages is fixed at task
+creation time.  VMS than maintains a dirty list of pages thereby
+having a very fast backing store and essentially allowing tasks to
+have more than their quota of memory if there is no memory pressure.
+One reason that this is copied in this design is that unlike in VMS,
+the file systems and device drivers are in user space.  Thus, the
+caching that was being done by VMS can not be done intelligently by
+the physical memory server.
+
+The number of guaranteed pages that a given task has access to is not
+determined by the physical memory server but by the \keyword{memory
+policy server}.  This division allows the physical memory server to
+only concern itself with the mechanisms and means that it must know
+essentially nothing about how the underlying operating system
+functions.  (The implication is that although tailored for Hurd
+specific needs, the physical memory server is completely separate from
+the Hurd and can be used by other operating systems running on the
+microkernel.)  Thus, it is the memory policy server's responsibility
+to determine who gets how much memory.  This may be determined as a
+function of the user or looking in file on disk for e.g. quotas.  As
+can be seen this type of data acquisition could add significant
+complexity to the physical memory server and require blocking states
+(e.g. waiting for a read operation on file i/o) and could create
+circular dependencies.
+
+The physical memory server and the memory policy server will contain a
+shared buffer of tupples indexed by task id containing the number of
+allocated pages, the number of guaranteed page, and a boolean
+indicating whether or not this task is eligible for guaranteed pages.
+The guaranteed page field and the extra page predicate may only be
+written to by the memory policy server.  The number of allocated pages
+may only be written to by the physical memory server.  This scheme
+means that no locking in required.  (On some architectures where a
+read of a given field cannot be performed in a single operation, the
+read may have to be done twice).
+
+Until the memory policy server makes the intial contact with the
+physical memory server, memory will be allocated on a first come first
+serve basis.  The memory policy server shall use the following remote
+procedure call to contact the physical memory server:
+
+\begin{code}
+error\_t physical\_memory\_server\_introduce (void)
+\end{code}
+
+\noindent
+This function will succeed the first time it is called.  It will fail
+all subsequent times.  The physical memory server will record the
+sender of this rpc as the memory policy server and begin allocating
+memory according to the previously described protocol.
+
+The shared policy buffer may be obtained from the physical memory
+server by the policy by calling:
+
+\begin{code}
+error\_t physical\_memory\_server\_get\_policy\_buffer (out l4\_map\_t buffer)
+\end{code}
+
+\noindent
+The returned buffer is mapped with read and write access into the
+policy memory server's address space.  It may need to be resized.  If
+this is the case, the physical memory server shall unmap the buffer
+from the policy memory server's address space, copy the buffer
+internally as required.  The policy memory server will fault on the
+memory region on its next access and it may repeat the call.  This
+call will succeed when the sender is the memory policy server, it will
+fail otherwise.
+
+\subsection{Allocation Mechanisms}
+
+Applications are able allocate memory by  Memory allocation will be 
+
+
+% Traditionally, monolithical kernels, but even kernels like Mach,
+% provide a virtual memory management system in the kernel.  All paging
+% decisions are made by the kernel itself.  This requires good
+% heuristics.  Smart paging decisions are often not possible because the
+% kernel lacks the information about how the data is used.
+% 
+% In the Hurd, paging will be done locally in each task.  A physical
+% memory server provides a number of guaranteed physical pages to tasks.
+% It will also provide a number of excess pages (over-commit).  The task
+% might have to return any number of excess pages on short notice.  If
+% the task does not comply, all mappings are revoked (essentially
+% killing the task).
+% 
+% A problem arises when data has to be exchanged between a client and a
+% server, and the server wants to have control over the content of the
+% pages (for example, pass it on to other servers, like device drivers).
+% The client can not map the pages directly into the servers address
+% space, as it is not trusted.  Container objects created in the
+% physical memory server and mapped into the client and/or the servers
+% address space will provide the necessary security features to allow
+% this.  This can be used for DMA and zero-copying in the data exchange
+% between device drivers and (untrusted) user tasks.
+% 
+%
author	neal <neal>	2003-09-23 18:20:58 +0000
committer	neal <neal>	2003-09-23 18:20:58 +0000
commit	61537f4ef4a693f9c2121ce962f3dda54f9c77fd (patch)
tree	1a98e8646c71a0f55f18f52f1816bd51bc6211a3 /doc
parent	d1fac21443fd260eb7bb5a05ee8144fdf1d80bba (diff)