\chapter{The POSIX personality}

The Hurd offers a POSIX API to the user by default.  This is
implemented in the GNU C library which uses the services provided by
the Hurd servers.  Several system servers support the C library.


\section{Authentication}
\label{auth}

Capabilities are a good way to give access to protected objects and
services.  They are flexible, lightweight and generic.  However, Unix
traditionally uses access control lists (ACL) to restrict access to
objects like files.  Any task running with a certain user ID can
access all files that are readable for the user with that user ID.
Although all objects are implemented as capabilities in the Hurd, the
Hurd also supports the use of user IDs for access control.

The system authentication server \texttt{auth} implements the Unix
authentication scheme using capabilities.  It provides auth
capabilities, which are associated with a list of effective and
available user and group IDs.  The holder of such a capability can use
it to authenticate itself to other servers, using the protocol below.

Of course, these other servers must use (and trust) the same
\texttt{auth} server as the user.  Otherwise, the authentication will
fail.  Once a capability is authenticated in the server, the server
will know the user IDs of the client, and can use them to validate
further operations.

The \texttt{auth} server provides two types of capabilities:

\paragraph{Auth capabilities}
An auth capability is associated with four vectors of IDs: The
effective user and group IDs, which should be used by other servers to
authenticate operations that require certain user or group IDs, and
the available user and group IDs.  Available IDs should not be used
for authentication purposes, but can be turned into effective IDs by
the holder of an auth capability at any time.

New auth capabilities can be created from existing auth capabilities,
but only if the requested IDs are a subsets from the union of the
(effective and available) IDs in the provided auth capabilities.  If
an auth capability has an effective or available user ID 0, then
arbitrary new auth objects can be created from that.

\paragraph{Passport capabilities}
A passport capability can be created from an auth capability and is
only valid for the task that created it.  It can be provided to a
server in an authentication process (see below).  For the client, the
passport capability does not directly implement any useful operation.
For the server, it can be used to verify the identity of a user and
read out the effective user and group IDs.

The auth server should always create new passport objects for
different tasks, even if the underlying auth object is the same, so
that a task having the passport capability can not spy on other tasks
unless they were given the passport capability by that task.

\subsection{Authenticating a client to a server}

A client can authenticate itself to a server with the following
protocol:

\paragraph{Preconditions}
The client $C$ has an auth capability implemented by the \texttt{auth}
server $A$.  It also has a capability implemented by the server $S$.
It wants to reauthenticate this capability with the auth capability,
so the server associates the new user and group IDs with it.

The server also has an auth capability implemented by its trusted
\texttt{auth} server.  For the reauthentication to succeed, the
\texttt{auth} server of the client and the server must be identical.
If this is the case, the participating tasks hold task info caps for
all other participating tasks (because of the capabilities they hold).

\begin{enumerate}
\item The client $C$ requests the passport capability for itself from
  the auth capability from $A$.

  \begin{comment}
    Normally, the client will request the passport capability only
    once and store it together with the auth capability.
  \end{comment}
  
\item The \texttt{auth} server receives the request and creates a new
  passport capability for this auth capability and this client.  The
  passport capability is returned to the user.
  
\item The user receives the reply from the \texttt{auth} server.
  
  It then sends the reauthentication request to the server $S$, which
  is invoked on the capability the client wants to reauthenticate.  It
  provides the passport capability as an argument.
  
\item The server $S$ can accept the passport capability, if it
  verifies that it is really implemented by the \texttt{auth} server
  it trusts.  If the client does not provide a passport capability to
  the trusted \texttt{auth} server, the authentication process is
  aborted with an error.
  
  Now the server can send a request to the \texttt{auth} server to
  validate the passport capability.  The RPC is invoked on the
  passport capability.
  
\item The \texttt{auth} server receives the validation request on the
  passport capability and returns the task ID of the client $C$ that
  this passport belongs to, and the effective user and group IDs for
  the auth cap to which this passport cap belongs.

  \begin{comment}
    The Hurd on Mach returned the available IDs as well.  This feature
    is not used anywhere in the Hurd, and as the available IDs should
    not be used for authentication anyway, this does not seem to be
    useful.  If it is needed, it can be added in an extended version
    of the validation RPC.
  \end{comment}
  
\item The server receives the task ID and the effective user and group
  IDs.  The server now verifies that the task ID is the same as the
  task ID of the sender of the reauthentication request.  Only then
  was the reauthentication request made by the owner of the auth cap.
  It can then return a new capability authenticated with the new user
  and group IDs.

  \begin{comment}
    The verification of the client's task ID is necessary.  As the
    passport cap is copied to other tasks, it can not serve as a proof
    of identity alone.  It is of course absolutely crucial that the
    server holds the task info cap for the client task $C$ for the
    whole time of the protocol.  But the same is actually true for any
    RPC, as the server needs to be sure that the reply message is sent
    to the sender thread (and not any imposter).
  \end{comment}
  
\item The client receives the reply with the new, reauthenticated
  capability.  Usually this capability is associated in the server
  with the same abstract object, but different user credentials.

  \begin{comment}
    Of course a new capability must be created.  Otherwise, all other
    users holding the same capability would be affected as well.
  \end{comment}

  The client can now deallocate the passport cap.

  \begin{comment}
    As said before, normally the passport cap is cached by the client
    for other reauthentications.
  \end{comment}
\end{enumerate}

\paragraph{Result}
The client $C$ has a new capability that is authenticated with the new
effective user and group IDs.  The server has obtained the effective
user and group IDs from the \texttt{auth} server it trusts.

\begin{comment}
  The Hurd on Mach uses a different protocol, which is more complex
  and is vulnerable to DoS attacks.  The above protocol can not
  readily be used on Mach, because the sender task of a message can
  not be easily identified.
\end{comment}


\section{Process Management}
\label{proc}

The \texttt{proc} server implements Unix process semantics in the Hurd
system.  It will also assign a PID to each task that was created with
the \texttt{task} server, so that the owner of these tasks, and the
system administrator, can at least send the \verb/SIGKILL/ signal to
them.

The \texttt{proc} server uses the task manager capability from the
\texttt{task} server to get hold of the information about all tasks
and the task control caps.

\begin{comment}
  The \texttt{proc} server might also be the natural place to
  implement a first policy server for the \texttt{task} server.
\end{comment}


\subsection{Signals}
\label{signals}

Each process can register the thread ID of a signal thread with the
\texttt{proc} server.  The proc server will give the signal thread ID
to any other task which asks for it.

\begin{comment}
  The thread ID can be guessed, so there is no point in protecting it.
\end{comment}

The signal thread ID can then be used by a task to contact the task to
which it wants to send a signal.  The task must bootstrap its
connection with the intended receiver of the signal, according to the
protocol described in section \ref{ipcbootstrap} on page
\pageref{ipcbootstrap}.  As a result, it will receive the signal
capability of the receiving task.

The sender of a signal must then provide some capability that proves
that the sender is allowed to send the signal when a signal is posted
to the signal capability.  For example, the owner of the task control
cap is usually allowed to send any signal to it.  Other capabilities
might only give permission to send some types of signals.

\begin{comment}
  The receiver of the signal decides itself which signals to accept
  from which other tasks.  The default implementation in the C library
  provides POSIX semantics, plus some extensions.
\end{comment}

Signal handling is thus completely implemented locally in each task.
The \texttt{proc} server only serves as a name-server for the thread
IDs of the signal threads.

\begin{comment}
  The \texttt{proc} server can not hold the signal capability itself,
  as it used to do in the implementation on Mach, as it does not trust
  the tasks implementing the capability.  But this is not a problem,
  as the sender and receiver of a signal can negotiate and bootstrap
  the connection without any further support by the \texttt{proc}
  server.
  
  Also, the \texttt{proc} server can not even hold task info caps to
  support the sender of a signal in bootstrapping the connection.
  This means that there is a race between looking up the signal thread
  ID from the PID in the \texttt{proc} server and acquiring a task
  info cap for the task ID of the signal receiver in the sender.
  However, in Unix, there is always a race when sending a signal using
  \verb/kill/.  The task server helps the users a bit here by not
  reusing task IDs as long as possible.
\end{comment}

Some signals are not implemented by sending a message to the task.
\verb/SIGKILL/ for example destroys the tasks without contacting it at
all.  This feature is implemented in the \texttt{proc} server.

The signal capability is also used for other things, like the message
interface (which allows you to manipulate the environment variables
and \texttt{auth} capability of a running task, etc).


\subsection{The \texttt{fork()} function}

To be written.


\subsection{The \texttt{exec} functions}
\label{exec}

The \texttt{exec} operation will be done locally in a task.
Traditionally, \texttt{exec} overlays the same task with a new
process image, because creating a new task and transferring the
associated state is expensive.  In L4, only the threads and virtual
memory mappings are actually kernel state associated with a task, and
exactly those have to be destroyed by \texttt{exec} anyway.  There
is a lot of Hurd specific state associated with a task (capabilities,
for example), but it is difficult to preserve that.  There are
security concerns, because POSIX programs do not know about Hurd
features like capabilities, so inheriting all capabilities across
\texttt{exec} unconditionally seems dangerous.

\begin{comment}
  One could think that if a program is not Hurd-aware, then it will
  not make any use of capabilities except through the normal POSIX
  API, and thus there are no capabilities except those that the GNU C
  library uses itself, which \texttt{exec} can take care of.
  However, this is only true if code that is not Hurd-aware is never
  mixed with Hurd specific code, even libraries (unless the library
  intimately cooperates with the GNU C library).  This would be a high
  barrier to enable Hurd features in otherwise portable programs and
  libraries.
  
  It is better to make all POSIX functions safe by default and allow
  for extensions to let the user specify which capabilities besides
  those used for file descriptors etc to be inherited by the new
  executable.
  
  For \verb/posix_spawn()/, this is straight-forward.  For
  \texttt{exec}, it is not. either specific capabilities could be
  markes as ``do not close on \texttt{exec}'', or variants of the
  \texttt{exec} function could be provided which take further
  arguments.
\end{comment}

There are also implementation obstacles hindering the reuse of the
existing task.  Only local threads can manipulate the virtual memory
mappings, and there is a lot of local state that has to be kept
somewhere between the time the old program becomes defunct and the new
binary image is installed and used (not to speak of the actual program
snippet that runs during the transition).

So the decision was made to always create a new task with
\texttt{exec}, and copy the desired state from the current task to the
new task.  This is a clean solution, because a new task will always
start out without any capabilities in servers, etc, and thus there is
no need for the old task to try to destroy all unneeded capabilities
and other local state before \texttt{exec}.  Also, in case the
\texttt{exec} fails, the old program can continue to run, even if the
\texttt{exec} fails at a very late point (there is no ``point of no
return'' until the new task is actually up and running).

For suid and sgid applications, the actual \texttt{exec} has to be
done by the filesystem.  However, the filesystem can not be bothered
to also transfer all the user state into the new task.  It can not
even do that, because it can not accept capabilities implemented by
untrusted servers from the user.  Also, the filesystem does not want
to rely on the new task to be cooperative, because it does not
necessarily trust the code, if is is owned by an untrusted user.

\begin{enumerate}
\item The user creates a new task and a container with a single
  physical page, and makes the \texttt{exec} call to the file
  capability, providing the task control capability.  Before that, it
  creates a task info capability from it for its own use.
\item The filesystem checks permission and then revokes all other
  users on the task control capability.  This will revoke the users
  access to the task, and will fail if the user did not provide a
  pristine task object.  (It is assumed that the filesystem should not
  create the task itself so the user can not use suid/sgid
  applications to escape from their quota restriction).
\item Then it revokes access to the provided physical page and writes
  a trusted startup code to it.
\item The filesystem will also prepare all capability transactions and
  write the required information (together with other useful
  information) in a stack on the physical page.
\item Then it creates a thread in the task, and starts it.  At
  pagefault, it will provide the physical page.
\item The startup code on the physical page completes the capability
  transfer.  It will also install a small pager that can install file
  mappings for this binary image.  Then it jumps to the entry point.
\item The filesystem in the meanwhile has done all it can do to help
  the task startup.  It will provide the content of the binary or
  script via paging or file reads, but that happens asynchronously,
  and as for any other task.  So the filesystem returns to the client.
\item The client can then send its untrusted information to the new
  task.  The new task got the client's thread ID from the filesystem
  (possibly provided by the client), and thus knows to which thread it
  should listen.  The new task will not trust this information
  ultimatively (ie, the new task will use the authentication, root
  directory and other capabilities it got from the filesystem), but it
  will accept all capabilities and make proper use of them.
\item Then the new task will send a message to proc to take over the
  old PID and other process state.  How this can be done best is still
  to be determined (likely the old task will provide a process control
  capability to the new task).  At that moment, the old task is
  desrtoyed by the proc server.
\end{enumerate}

This is a coarse and incomplete description, but it shows the general
idea.  The details will depend a lot on the actual implementation.


\subsubsection{The startup information}

The following information is passed to the new task by the parent (the
filesystem in the suid case).  Every item is a machine word.

\begin{enumerate}
\item \texttt{magic}
  
  The first four bytes are \texttt{E}, \texttt{X}, \texttt{E},
  \texttt{C}.

\item \texttt{program header location}
\item \texttt{program header size}
  
  The location and size of the program header.  The meaning of this
  field depends on the binary format.

\item \texttt{feature flags}
  
  This bit-field indicates which of the following information is
  present.  If the information is not present, the corresponding
  machine words are undefined.  This provides simple version control.

  \begin{comment}
    They could also be undefined.
  \end{comment}

\item \texttt{wortel thread ID}
\item \texttt{wortel control cap ID}
  
  The thread ID of the \texttt{wortel} rootserver, and the local ID of
  the \texttt{wortel} control cap.  The \texttt{wortel} control cap
  allows the user to make privileged system calls.  This field is only
  present if the user has this capability.  Usually, this is only the
  case for some initial servers at bootstrap.

\item \texttt{physmem thread ID}
\item \texttt{physmem control cap ID}
  
  The thread ID physical memory server, and the local ID of the
  \texttt{physmem} control cap.  This cap can be used to manage the
  physical memory of this task.

\item \texttt{physmem startup page container cap ID}
  
  The container cap ID for the startup code, containing this
  information, the initial pager, and other startup code.  This
  container is mapped into the address space of the task outside of
  the actual program, and can be unmapped by the program after it has
  used this information and installed its own pager, by destroying
  this container, to reclaim the virtual address space and physical
  memory it occupies.

\item (More to come.)
\end{enumerate}


\section{Unix Domain Sockets}
\label{unixdomainsockets}

In the Hurd on Mach, there was a global pflocal server that provided
unix domain sockets and pipes to all users.  This will not work very
well in the Hurd on L4, because for descriptor passing, read:
capability passing, the unix domain socket server needs to accept
capabilities in transit.  User capabilities are often implemented by
untrusted servers, though, and thus a global pflocal server running as
root can not accept them.

However, unix domain sockets and pipes can not be implemented locally
in the task.  An external task is needed to hold buffered data
capabilities in transit.  in theory, a new task could be used for
every pipe or unix domain socketpair.  However, in practice, one
server for each user would suffice and perform better.

This works, because access to Unix Domain Sockets is controlled via
the filesystem, and access to pipes is controlled via file
descriptors, usually by inheritance.  For example, if a fifo is
installed as a passive translator in the filesystem, the first user
accessing it will create a pipe in his pflocal server.  From then on,
an active translator must be installed in the node that redirects any
other users to the right pflocal server implementing this fifo.  This
is asymmetrical in that the first user to access a fifo will implement
it, and thus pay the costs for it.  But it does not seem to cause any
particular problems in implementing the POSIX semantics.

The GNU C library can contact ~/servers/socket/pflocal to implement
socketpair, or start a pflocal server for this task's exclusive use if
that node does not exist.

All this are optimizations: It should work to have one pflocal process
for each socketpair.  However, performance should be better with a
shared pflocal server, one per user.


\section{Pipes}

Pipes are implemented using \texttt{socketpair()}, that means as
unnamed pair of Unix Domain Sockets.  The \texttt{pflocal} server will
support this by implementing pipe semantics on the socketpair if
requested.

\begin{comment}
  It was considered to use shared memory for the pipe implementation.
  But we are not aware of a lock-free protocol using shared memory
  with multiple readers and multiple writers.  It might be possible,
  but it is not obvious if that would be faster: Pipes are normally
  used with \texttt{read()} and \texttt{write()}, so the data has to
  be copied from and to the supplied buffer.  This can be done
  efficiently in L4 even across address spaces using string items.  In
  the implementation using sockets, the \texttt{pflocal} server
  handles concurrent read and write accesses with mutual exclusion.
\end{comment}


\section{Filesystems}

\subsection{Directory lookup across filesystems}
\label{xfslookup}

The Hurd has the ability to let users mount filesystems and other
servers providing a filesystem-like interface.  Such filesystem
servers are called translators.  In the Hurd on GNU Mach, the parent
filesystem would automatically start up such translators from passive
translator settings in the inode.  It would then block until the child
filesystem sends a message to its bootstrap port (provided by the
parent fs) with its root directory port.  This root directory port can
then be given to any client looking up the translated node.

There are several things wrong with this scheme, which becomes
apparent in the Hurd on L4.  The parent filesystem must be careful to
not block on creating the child filesystem task.  It must also be
careful to not block on receiving any acknowledgement or startup
message from it.  Furthermore, it can not accept the root directory
capability from the child filesystem and forward it to clients, as
they are potentially not trusted.

The latter problem can be solved the following way: The filesystem
knows about the server thread in the child filesystem.  It also
implements an authentication capability that represents the ability to
access the child filesystem.  This capability is also given to the
child filesystem at startup (or when it attaches itself to the parent
filesystem).  On client dir\_lookup, the parent filesystem can return
the server\_thread and the authentication capability to the client.
The client can use that to initiate a connection with the child
filesystem (by first building up a connection, then sending the
authentication capability from the parent filesystem, and receiving a
root directory capability in exchange).

\begin{comment}
  There is a race here.  If the child filesystem dies and the parent
  filesystem processes the task death notification and releases the
  task info cap for the child before the user acquires its own task
  info cap for the child, then an imposter might be able to pretend to
  be the child filesystem for the client.
  
  This race can only be avoided by a more complex protocol:
  
  Variant 1: The user has to acquire the task info cap for the child
  fs, and then it has to perform the lookup again.  If then the thread
  ID is for the task it got the task ID for in advance, it can go on.
  If not, it has to retry.  This is not so good because a directory
  lookup is usually an expensive operation.  However, it has the
  advantage of only slowing down the rare case.
  
  Variant 2: The client creates an empty reference container in the
  task server, which can then be used by the server to fill in a
  reference to the child's task ID.  However, the client has to create
  and destroy such a container for every filesystem where it excepts
  it could be redirected to another (that means: for all filesystems
  for which it does not use \verb/O_NOTRANS/).  This is quite an
  overhead to the common case.

\begin{verbatim}
<marcus> I have another idea
<marcus> the client does not give a container
<marcus> server sees child fs, no container -> returns O_NOTRANS node
<marcus> then client sees error, uses O_NOTRANS node, "" and container
<marcus> problem solved
<marcus> this seems to be the optimum
<neal> hmm.
<neal> So lazily supply a container.
<marcus> yeah
<neal> Hoping you won't need one.
<marcus> and the server helps you by doing as much as it can usefully
<neal> And that is the normal case.
<neal> Yeah, that seems reasonable.
<marcus> the trick is that the server won't fail completely
<marcus> it will give you at least the underlying node
\end{verbatim}
\end{comment}

The actual creation of the child filesystem can be performed much like
a suid \texttt{exec}, just without any client to follow up with
further capabilities and startup info.  The only problem that remains
is how the parent filesystem can know which thread in the child
filesystem implements the initial handshake protocol for the clients
to use.  The only safe way here seems to be that the parent filesystem
requires the child to use the main thread for that, or that the parent
filesystem creates a second thread in the child at startup (passing
its thread ID in the startup data), requiring that this second thread
is used.  In either case the parent filesystem will know the thread ID
in advance because it created the thread in the first place.  This
looks a bit ugly, and violates good taste, so we might try to look for
alternative solutions.


\subsection{Reparenting}
\label{reparenting}

The Hurd on Mach contains a curious RPC, \verb/file_reparent/, which
allows you to create a new capability for the same node, with the
difference that the new node will have a supplied capability as its
parent node.  A directory lookup of \texttt{..} on this new capability
would return the provided parent capability.

This function is used by the \texttt{chroot()} function, which sets
the parent node to the null capability to prevent escape from a
\texttt{chroot()} environment.  It is also used by the
\texttt{firmlink} translator, which is a cross over of a symbolic and
a hard link: It works like a hard link, but can be used across
filesystems.

A firmlink is a dangerous thing.  Because the filesystem will give no
indication if the parent node it returns is provided by itself or some
other, possibly untrusted filesystem, the user might follow the parent
node to untrusted filesystems without being aware of it.

In the Hurd port to L4, the filesystem can not accept untrusted parent
capabilities on behalf of the user anymore.  The \texttt{chroot()}
function is not difficult to implement anyway, as no real capability
is required.  The server can just be instructed to create a node with
no parent node, and it can do that without problems.  Nevertheless, we
also want a secure version of the \texttt{firmlink} translator.  This
is possible if the same strategy is used as in cross filesystem
lookups.  The client registers a server thread as the handler for the
parent node, and the filesystem returns a capability that can be used
for authentication purposes.  Now, the client still needs to connect
this to the new parent node.  Normally, the filesystem providing the
new parent node will also not trust the other filesystem, and thus can
not accept the capability that should be used for authentication
purposes.  So instead creating a direct link from the one filesystem
to the other, the firmlink translator must act as a middle man, and
redirect all accesses to the parent node first to itself, and then to
the filesystem providing the parent node.  For this, it must request a
capability from that filesystem that can be used for authentication
purposes when bootstrapping a connection, that allows such a
bootstrapping client to access the parent node directly.

This also fixes the security issues, because now any move away from
the filesystem providing the reparented node will explicitely go first
to the \texttt{firmlink} translator, and then to the filesystem
providing the parent node.  The user can thus make an informed
decision if it trusts the \texttt{firmlink} translator and the
filesystem providing the parent node.

\begin{comment}
  This is a good example where the redesign of the IPC system forces
  us to fix a security issue and provides a deeper insight into the
  trust issues and how to solve them.
\end{comment}