\chapter{The POSIX personality} The Hurd offers a POSIX API to the user by default. This is implemented in the GNU C library which uses the services provided by the Hurd servers. Several system servers support the C library. \section{Authentication} \label{auth} Capabilities are a good way to give access to protected objects and services. They are flexible, lightweight and generic. However, Unix traditionally uses access control lists (ACL) to restrict access to objects like files. Any task running with a certain user ID can access all files that are readable for the user with that user ID. Although all objects are implemented as capabilities in the Hurd, the Hurd also supports the use of user IDs for access control. The system authentication server \texttt{auth} implements the Unix authentication scheme using capabilities. It provides auth capabilities, which are associated with a list of effective and available user and group IDs. The holder of such a capability can use it to authenticate itself to other servers, using the protocol below. Of course, these other servers must use (and trust) the same \texttt{auth} server as the user. Otherwise, the authentication will fail. Once a capability is authenticated in the server, the server will know the user IDs of the client, and can use them to validate further operations. The \texttt{auth} server provides two types of capabilities: \paragraph{Auth capabilities} An auth capability is associated with four vectors of IDs: The effective user and group IDs, which should be used by other servers to authenticate operations that require certain user or group IDs, and the available user and group IDs. Available IDs should not be used for authentication purposes, but can be turned into effective IDs by the holder of an auth capability at any time. New auth capabilities can be created from existing auth capabilities, but only if the requested IDs are a subsets from the union of the (effective and available) IDs in the provided auth capabilities. If an auth capability has an effective or available user ID 0, then arbitrary new auth objects can be created from that. \paragraph{Passport capabilities} A passport capability can be created from an auth capability and is only valid for the task that created it. It can be provided to a server in an authentication process (see below). For the client, the passport capability does not directly implement any useful operation. For the server, it can be used to verify the identity of a user and read out the effective user and group IDs. The auth server should always create new passport objects for different tasks, even if the underlying auth object is the same, so that a task having the passport capability can not spy on other tasks unless they were given the passport capability by that task. \subsection{Authenticating a client to a server} A client can authenticate itself to a server with the following protocol: \paragraph{Preconditions} The client $C$ has an auth capability implemented by the \texttt{auth} server $A$. It also has a capability implemented by the server $S$. It wants to reauthenticate this capability with the auth capability, so the server associates the new user and group IDs with it. The server also has an auth capability implemented by its trusted \texttt{auth} server. For the reauthentication to succeed, the \texttt{auth} server of the client and the server must be identical. If this is the case, the participating tasks hold task info caps for all other participating tasks (because of the capabilities they hold). \begin{enumerate} \item The client $C$ requests the passport capability for itself from the auth capability from $A$. \begin{comment} Normally, the client will request the passport capability only once and store it together with the auth capability. \end{comment} \item The \texttt{auth} server receives the request and creates a new passport capability for this auth capability and this client. The passport capability is returned to the user. \item The user receives the reply from the \texttt{auth} server. It then sends the reauthentication request to the server $S$, which is invoked on the capability the client wants to reauthenticate. It provides the passport capability as an argument. \item The server $S$ can accept the passport capability, if it verifies that it is really implemented by the \texttt{auth} server it trusts. If the client does not provide a passport capability to the trusted \texttt{auth} server, the authentication process is aborted with an error. Now the server can send a request to the \texttt{auth} server to validate the passport capability. The RPC is invoked on the passport capability. \item The \texttt{auth} server receives the validation request on the passport capability and returns the task ID of the client $C$ that this passport belongs to, and the effective user and group IDs for the auth cap to which this passport cap belongs. \begin{comment} The Hurd on Mach returned the available IDs as well. This feature is not used anywhere in the Hurd, and as the available IDs should not be used for authentication anyway, this does not seem to be useful. If it is needed, it can be added in an extended version of the validation RPC. \end{comment} \item The server receives the task ID and the effective user and group IDs. The server now verifies that the task ID is the same as the task ID of the sender of the reauthentication request. Only then was the reauthentication request made by the owner of the auth cap. It can then return a new capability authenticated with the new user and group IDs. \begin{comment} The verification of the client's task ID is necessary. As the passport cap is copied to other tasks, it can not serve as a proof of identity alone. It is of course absolutely crucial that the server holds the task info cap for the client task $C$ for the whole time of the protocol. But the same is actually true for any RPC, as the server needs to be sure that the reply message is sent to the sender thread (and not any imposter). \end{comment} \item The client receives the reply with the new, reauthenticated capability. Usually this capability is associated in the server with the same abstract object, but different user credentials. \begin{comment} Of course a new capability must be created. Otherwise, all other users holding the same capability would be affected as well. \end{comment} The client can now deallocate the passport cap. \begin{comment} As said before, normally the passport cap is cached by the client for other reauthentications. \end{comment} \end{enumerate} \paragraph{Result} The client $C$ has a new capability that is authenticated with the new effective user and group IDs. The server has obtained the effective user and group IDs from the \texttt{auth} server it trusts. \begin{comment} The Hurd on Mach uses a different protocol, which is more complex and is vulnerable to DoS attacks. The above protocol can not readily be used on Mach, because the sender task of a message can not be easily identified. \end{comment} \section{Process Management} \label{proc} The \texttt{proc} server implements Unix process semantics in the Hurd system. It will also assign a PID to each task that was created with the \texttt{task} server, so that the owner of these tasks, and the system administrator, can at least send the \verb/SIGKILL/ signal to them. The \texttt{proc} server uses the task manager capability from the \texttt{task} server to get hold of the information about all tasks and the task control caps. \begin{comment} The \texttt{proc} server might also be the natural place to implement a first policy server for the \texttt{task} server. \end{comment} \subsection{Signals} \label{signals} Each process can register the thread ID of a signal thread with the \texttt{proc} server. The proc server will give the signal thread ID to any other task which asks for it. \begin{comment} The thread ID can be guessed, so there is no point in protecting it. \end{comment} The signal thread ID can then be used by a task to contact the task to which it wants to send a signal. The task must bootstrap its connection with the intended receiver of the signal, according to the protocol described in section \ref{ipcbootstrap} on page \pageref{ipcbootstrap}. As a result, it will receive the signal capability of the receiving task. The sender of a signal must then provide some capability that proves that the sender is allowed to send the signal when a signal is posted to the signal capability. For example, the owner of the task control cap is usually allowed to send any signal to it. Other capabilities might only give permission to send some types of signals. \begin{comment} The receiver of the signal decides itself which signals to accept from which other tasks. The default implementation in the C library provides POSIX semantics, plus some extensions. \end{comment} Signal handling is thus completely implemented locally in each task. The \texttt{proc} server only serves as a name-server for the thread IDs of the signal threads. \begin{comment} The \texttt{proc} server can not hold the signal capability itself, as it used to do in the implementation on Mach, as it does not trust the tasks implementing the capability. But this is not a problem, as the sender and receiver of a signal can negotiate and bootstrap the connection without any further support by the \texttt{proc} server. Also, the \texttt{proc} server can not even hold task info caps to support the sender of a signal in bootstrapping the connection. This means that there is a race between looking up the signal thread ID from the PID in the \texttt{proc} server and acquiring a task info cap for the task ID of the signal receiver in the sender. However, in Unix, there is always a race when sending a signal using \verb/kill/. The task server helps the users a bit here by not reusing task IDs as long as possible. \end{comment} Some signals are not implemented by sending a message to the task. \verb/SIGKILL/ for example destroys the tasks without contacting it at all. This feature is implemented in the \texttt{proc} server. The signal capability is also used for other things, like the message interface (which allows you to manipulate the environment variables and \texttt{auth} capability of a running task, etc). \subsection{The \texttt{fork()} function} To be written. \subsection{The \texttt{exec} functions} \label{exec} The \texttt{exec} operation will be done locally in a task. Traditionally, \texttt{exec} overlays the same task with a new process image, because creating a new task and transferring the associated state is expensive. In L4, only the threads and virtual memory mappings are actually kernel state associated with a task, and exactly those have to be destroyed by \texttt{exec} anyway. There is a lot of Hurd specific state associated with a task (capabilities, for example), but it is difficult to preserve that. There are security concerns, because POSIX programs do not know about Hurd features like capabilities, so inheriting all capabilities across \texttt{exec} unconditionally seems dangerous. \begin{comment} One could think that if a program is not Hurd-aware, then it will not make any use of capabilities except through the normal POSIX API, and thus there are no capabilities except those that the GNU C library uses itself, which \texttt{exec} can take care of. However, this is only true if code that is not Hurd-aware is never mixed with Hurd specific code, even libraries (unless the library intimately cooperates with the GNU C library). This would be a high barrier to enable Hurd features in otherwise portable programs and libraries. It is better to make all POSIX functions safe by default and allow for extensions to let the user specify which capabilities besides those used for file descriptors etc to be inherited by the new executable. For \verb/posix_spawn()/, this is straight-forward. For \texttt{exec}, it is not. either specific capabilities could be markes as ``do not close on \texttt{exec}'', or variants of the \texttt{exec} function could be provided which take further arguments. \end{comment} There are also implementation obstacles hindering the reuse of the existing task. Only local threads can manipulate the virtual memory mappings, and there is a lot of local state that has to be kept somewhere between the time the old program becomes defunct and the new binary image is installed and used (not to speak of the actual program snippet that runs during the transition). So the decision was made to always create a new task with \texttt{exec}, and copy the desired state from the current task to the new task. This is a clean solution, because a new task will always start out without any capabilities in servers, etc, and thus there is no need for the old task to try to destroy all unneeded capabilities and other local state before \texttt{exec}. Also, in case the \texttt{exec} fails, the old program can continue to run, even if the \texttt{exec} fails at a very late point (there is no ``point of no return'' until the new task is actually up and running). For suid and sgid applications, the actual \texttt{exec} has to be done by the filesystem. However, the filesystem can not be bothered to also transfer all the user state into the new task. It can not even do that, because it can not accept capabilities implemented by untrusted servers from the user. Also, the filesystem does not want to rely on the new task to be cooperative, because it does not necessarily trust the code, if is is owned by an untrusted user. \begin{enumerate} \item The user creates a new task and a container with a single physical page, and makes the \texttt{exec} call to the file capability, providing the task control capability. Before that, it creates a task info capability from it for its own use. \item The filesystem checks permission and then revokes all other users on the task control capability. This will revoke the users access to the task, and will fail if the user did not provide a pristine task object. (It is assumed that the filesystem should not create the task itself so the user can not use suid/sgid applications to escape from their quota restriction). \item Then it revokes access to the provided physical page and writes a trusted startup code to it. \item The filesystem will also prepare all capability transactions and write the required information (together with other useful information) in a stack on the physical page. \item Then it creates a thread in the task, and starts it. At pagefault, it will provide the physical page. \item The startup code on the physical page completes the capability transfer. It will also install a small pager that can install file mappings for this binary image. Then it jumps to the entry point. \item The filesystem in the meanwhile has done all it can do to help the task startup. It will provide the content of the binary or script via paging or file reads, but that happens asynchronously, and as for any other task. So the filesystem returns to the client. \item The client can then send its untrusted information to the new task. The new task got the client's thread ID from the filesystem (possibly provided by the client), and thus knows to which thread it should listen. The new task will not trust this information ultimatively (ie, the new task will use the authentication, root directory and other capabilities it got from the filesystem), but it will accept all capabilities and make proper use of them. \item Then the new task will send a message to proc to take over the old PID and other process state. How this can be done best is still to be determined (likely the old task will provide a process control capability to the new task). At that moment, the old task is desrtoyed by the proc server. \end{enumerate} This is a coarse and incomplete description, but it shows the general idea. The details will depend a lot on the actual implementation. \subsubsection{The startup information} The following information is passed to the new task by the parent (the filesystem in the suid case). Every item is a machine word. \begin{enumerate} \item \texttt{magic} The first four bytes are \texttt{E}, \texttt{X}, \texttt{E}, \texttt{C}. \item \texttt{program header location} \item \texttt{program header size} The location and size of the program header. The meaning of this field depends on the binary format. \item \texttt{feature flags} This bit-field indicates which of the following information is present. If the information is not present, the corresponding machine words are undefined. This provides simple version control. \begin{comment} They could also be undefined. \end{comment} \item \texttt{wortel thread ID} \item \texttt{wortel control cap ID} The thread ID of the \texttt{wortel} rootserver, and the local ID of the \texttt{wortel} control cap. The \texttt{wortel} control cap allows the user to make privileged system calls. This field is only present if the user has this capability. Usually, this is only the case for some initial servers at bootstrap. \item \texttt{physmem thread ID} \item \texttt{physmem control cap ID} The thread ID physical memory server, and the local ID of the \texttt{physmem} control cap. This cap can be used to manage the physical memory of this task. \item \texttt{physmem startup page container cap ID} The container cap ID for the startup code, containing this information, the initial pager, and other startup code. This container is mapped into the address space of the task outside of the actual program, and can be unmapped by the program after it has used this information and installed its own pager, by destroying this container, to reclaim the virtual address space and physical memory it occupies. \item (More to come.) \end{enumerate} \section{Unix Domain Sockets} \label{unixdomainsockets} In the Hurd on Mach, there was a global pflocal server that provided unix domain sockets and pipes to all users. This will not work very well in the Hurd on L4, because for descriptor passing, read: capability passing, the unix domain socket server needs to accept capabilities in transit. User capabilities are often implemented by untrusted servers, though, and thus a global pflocal server running as root can not accept them. However, unix domain sockets and pipes can not be implemented locally in the task. An external task is needed to hold buffered data capabilities in transit. in theory, a new task could be used for every pipe or unix domain socketpair. However, in practice, one server for each user would suffice and perform better. This works, because access to Unix Domain Sockets is controlled via the filesystem, and access to pipes is controlled via file descriptors, usually by inheritance. For example, if a fifo is installed as a passive translator in the filesystem, the first user accessing it will create a pipe in his pflocal server. From then on, an active translator must be installed in the node that redirects any other users to the right pflocal server implementing this fifo. This is asymmetrical in that the first user to access a fifo will implement it, and thus pay the costs for it. But it does not seem to cause any particular problems in implementing the POSIX semantics. The GNU C library can contact ~/servers/socket/pflocal to implement socketpair, or start a pflocal server for this task's exclusive use if that node does not exist. All this are optimizations: It should work to have one pflocal process for each socketpair. However, performance should be better with a shared pflocal server, one per user. \section{Pipes} Pipes are implemented using \texttt{socketpair()}, that means as unnamed pair of Unix Domain Sockets. The \texttt{pflocal} server will support this by implementing pipe semantics on the socketpair if requested. \begin{comment} It was considered to use shared memory for the pipe implementation. But we are not aware of a lock-free protocol using shared memory with multiple readers and multiple writers. It might be possible, but it is not obvious if that would be faster: Pipes are normally used with \texttt{read()} and \texttt{write()}, so the data has to be copied from and to the supplied buffer. This can be done efficiently in L4 even across address spaces using string items. In the implementation using sockets, the \texttt{pflocal} server handles concurrent read and write accesses with mutual exclusion. \end{comment} \section{Filesystems} \subsection{Directory lookup across filesystems} \label{xfslookup} The Hurd has the ability to let users mount filesystems and other servers providing a filesystem-like interface. Such filesystem servers are called translators. In the Hurd on GNU Mach, the parent filesystem would automatically start up such translators from passive translator settings in the inode. It would then block until the child filesystem sends a message to its bootstrap port (provided by the parent fs) with its root directory port. This root directory port can then be given to any client looking up the translated node. There are several things wrong with this scheme, which becomes apparent in the Hurd on L4. The parent filesystem must be careful to not block on creating the child filesystem task. It must also be careful to not block on receiving any acknowledgement or startup message from it. Furthermore, it can not accept the root directory capability from the child filesystem and forward it to clients, as they are potentially not trusted. The latter problem can be solved the following way: The filesystem knows about the server thread in the child filesystem. It also implements an authentication capability that represents the ability to access the child filesystem. This capability is also given to the child filesystem at startup (or when it attaches itself to the parent filesystem). On client dir\_lookup, the parent filesystem can return the server\_thread and the authentication capability to the client. The client can use that to initiate a connection with the child filesystem (by first building up a connection, then sending the authentication capability from the parent filesystem, and receiving a root directory capability in exchange). \begin{comment} There is a race here. If the child filesystem dies and the parent filesystem processes the task death notification and releases the task info cap for the child before the user acquires its own task info cap for the child, then an imposter might be able to pretend to be the child filesystem for the client. This race can only be avoided by a more complex protocol: Variant 1: The user has to acquire the task info cap for the child fs, and then it has to perform the lookup again. If then the thread ID is for the task it got the task ID for in advance, it can go on. If not, it has to retry. This is not so good because a directory lookup is usually an expensive operation. However, it has the advantage of only slowing down the rare case. Variant 2: The client creates an empty reference container in the task server, which can then be used by the server to fill in a reference to the child's task ID. However, the client has to create and destroy such a container for every filesystem where it excepts it could be redirected to another (that means: for all filesystems for which it does not use \verb/O_NOTRANS/). This is quite an overhead to the common case. \begin{verbatim} I have another idea the client does not give a container server sees child fs, no container -> returns O_NOTRANS node then client sees error, uses O_NOTRANS node, "" and container problem solved this seems to be the optimum hmm. So lazily supply a container. yeah Hoping you won't need one. and the server helps you by doing as much as it can usefully And that is the normal case. Yeah, that seems reasonable. the trick is that the server won't fail completely it will give you at least the underlying node \end{verbatim} \end{comment} The actual creation of the child filesystem can be performed much like a suid \texttt{exec}, just without any client to follow up with further capabilities and startup info. The only problem that remains is how the parent filesystem can know which thread in the child filesystem implements the initial handshake protocol for the clients to use. The only safe way here seems to be that the parent filesystem requires the child to use the main thread for that, or that the parent filesystem creates a second thread in the child at startup (passing its thread ID in the startup data), requiring that this second thread is used. In either case the parent filesystem will know the thread ID in advance because it created the thread in the first place. This looks a bit ugly, and violates good taste, so we might try to look for alternative solutions. \subsection{Reparenting} \label{reparenting} The Hurd on Mach contains a curious RPC, \verb/file_reparent/, which allows you to create a new capability for the same node, with the difference that the new node will have a supplied capability as its parent node. A directory lookup of \texttt{..} on this new capability would return the provided parent capability. This function is used by the \texttt{chroot()} function, which sets the parent node to the null capability to prevent escape from a \texttt{chroot()} environment. It is also used by the \texttt{firmlink} translator, which is a cross over of a symbolic and a hard link: It works like a hard link, but can be used across filesystems. A firmlink is a dangerous thing. Because the filesystem will give no indication if the parent node it returns is provided by itself or some other, possibly untrusted filesystem, the user might follow the parent node to untrusted filesystems without being aware of it. In the Hurd port to L4, the filesystem can not accept untrusted parent capabilities on behalf of the user anymore. The \texttt{chroot()} function is not difficult to implement anyway, as no real capability is required. The server can just be instructed to create a node with no parent node, and it can do that without problems. Nevertheless, we also want a secure version of the \texttt{firmlink} translator. This is possible if the same strategy is used as in cross filesystem lookups. The client registers a server thread as the handler for the parent node, and the filesystem returns a capability that can be used for authentication purposes. Now, the client still needs to connect this to the new parent node. Normally, the filesystem providing the new parent node will also not trust the other filesystem, and thus can not accept the capability that should be used for authentication purposes. So instead creating a direct link from the one filesystem to the other, the firmlink translator must act as a middle man, and redirect all accesses to the parent node first to itself, and then to the filesystem providing the parent node. For this, it must request a capability from that filesystem that can be used for authentication purposes when bootstrapping a connection, that allows such a bootstrapping client to access the parent node directly. This also fixes the security issues, because now any move away from the filesystem providing the reparented node will explicitely go first to the \texttt{firmlink} translator, and then to the filesystem providing the parent node. The user can thus make an informed decision if it trusts the \texttt{firmlink} translator and the filesystem providing the parent node. \begin{comment} This is a good example where the redesign of the IPC system forces us to fix a security issue and provides a deeper insight into the trust issues and how to solve them. \end{comment}