Hi Sits,
How does this compare to existing (coarse grained) Linux capabilities?
Linux capabilities are not capabilities in the classical sense of unforgeable tokens of authority. They are just permissions. Linux capabilities allow a non-root process to do some things as if they were root. Capsicum implements a traditional capability model, where there is no ambient authority and a sandboxed process can not do anything unless it holds the relevant capability.
How does this compare to SELinux?
They address different goals and do so in different ways. The goal of SELinux (and the FreeBSD MAC framework's type enforcement mode) is to allow a system administrator to restrict access of a particular program or user (or program-user pair) to some subset of system resources. It is very bad for application compartmentalisation, because you can't fork() a process and have different permissions for the different children and it's difficult to update the permissions on the fly (although Apple does this with their port of the FreeBSD MAC framework, to implement sandboxing on OS X and iOS, allowing powerboxes to grant permission to specific files or directories).
Capsicum is intended for application sandboxing. It is assumed that the user trusts the binary to be non-malicious, but the program author does not trust his code to be free of exploitable bugs. A sandboxed process should call cap_enter() early on, at which point it can't create any new file descriptors. In some simple cases, that's enough. For example, a sandboxed version of man opens the file containing the mdoc sources and then calls cap_enter(). Even if it's run as root, and is reading a malicious source file that exploits the troff parser and allows arbitrary code execution, it can't do anything other than write text to the terminal. More complex applications will keep open a UNIX domain socket to a (more) privileged process that can pass in new file descriptors as they're needed, after doing some application-specific policy checking.
Does this complement things like Linux's seccomp?
Seccomp is far more restrictive than Capscium and prevents even harmless system calls (e.g. getpid(), gettimeofday() - although not on platforms where these use VDSO, I believe, but others of equal utility and harmlessness are blocked). Capsicum allows a whitelisted set of calls. Additionally, Capsicum adds finer-grained permissions on file handles (for example, you can be able to read, but not mmap, or append but not seek) and a set of at-suffixed system calls that work on directory descriptors, such as openat that allows a sandboxed process to open files in a directory that it holds the relevant capability for. This means that, for example, you can give a sandboxed browser tab process a directory descriptor to a cache directory and it can write cache data there without needing any interposition from a more privileged process. It talks directly to the kernel.
Seccomp-bpf extends seccomp by allowing system calls to be blocked or allowed based on the execution of a BPF filter. This is more expressive than Capsicum (Google's first port of Capsicum implemented it in terms of seccomp-bpf, although it was slow and not complete), but it doesn't allow the policies to easily associate permissions data with file descriptors and requires you to implement a complex BPF policy.
What's the overhead compared to the above?
There is basically no overhead for capsicum. It's one extra bitmask check on each system call that interacts with a file descriptor, which is in the noise for most workloads[1].
In terms of programmer overhead, there's a summary table for the number of lines of code changed to implement sandboxing with various mechanisms in Chrome in the original Capsicum paper. Here's the short version:
- Windows ACLs: 22,350
- Chroot: 605
- OS X Seatbelt (built atop the FreeBSD MAC framework, which was also written by Robert): 560
- SELinux: 200
- Linux seccomp: 11,301
- Capsicum: 100
This doesn't tell the whole story, because the SELinux policy in Chrome periodically got out of sync with the code, and they only got errors when it was too restrictive, not when it was too permissive, so a few releases shipped without enough sandboxing. The seccomp version could isolate renderer processes from each other, but the SELinux one couldn't. The seccomp implementation did not adequately restrict filesystem access, and neither seccomp nor SELinux adequately restricted IPC access. You can see Robert's thesis for more detailed explanation of the limitations of all of these.
Will FreeBSD ship a policy for a ssh/sshd?
I think we will in 10.1, but not in 10.0. The privsep code in OpenSSH is quite special. Another colleague of Robert and myself has been developing a tool to aid reasoning about application compartmentalisation (SOAAP) and has been applying it to Chromium and OpenSSH: Chromium is easier to understand. We are shipping with a number of other utilities using Capsicum in 10, and more will in 10.1.
[1] The initial implementation made capabilities special file descriptors that referred to other file descriptors, so the numbers in the original paper are slightly worse, but still very small.