I'm not going to write an entire paper here on Slashdot.
You already kind of did lol. This is good stuff though. I have some follow-up questions if you don't mind:
1) How are you aware of (and able to control) lower-level things like the page size, or which functions go into which groups of pages?
In a general, hand-wavy fashion, things like page size are an attribute of the compilation environment, and do not vary.
In practice, there are some older MIPS systems and the original NeXTStep which would "gang" 4K pages into 8K pages, and of course there's the Intel variety of superpages, depending on operating mode and contents of CR4, and the PSE bit being set, with or without the PAE bit being set, to give you either 4M or 2M pages. There are also some other architectures that allow even weirder variants.
As a general rule, most of these things other than the default are accessed via two interfaces: Either a section attribute in an executable section descriptor -- meaning it's handled by the kernel -- or via a special user mode interface for allocating large pages. The user mode interface may or may not be hidden in the malloc internals, in order to prevent direct use by a program. In addition, there are potentially device specific controls (in UNIX systems, these would be ioctl's) to map large pages into a user space process; as an example, the frame buffer memory in a Wayland or X Server, and so on.
Practically speaking, one of the most useful things you can do with large pages in a Linux, BSD, or UNIX running on an Intel system is to put the kernel itself into large pages; location won't matter, without a kernel code injection exploit. It's useful because Intel processors maintain separate TLBs (Translation Look-aside Buffers) for large and small pages, and this means that user space processes, and kernel interrupts, traps, or traps from user space to kernel space (e.g. system calls) won't be ejecting each other's pages from the look-aside. Depending on how frequently you end up running in the kernel vs. user space, this can result in an up to about 36% performance improvement.
One of the problems with this is that there's a known bug in Intel processors where INVLPG won't invalidate the page mappings in both, so there was an early bug that tended to hit Linux systems -- but not FreeBSD system -- where the INVLPG instruction kicked a page out of one TLB but not both TLBs, if it was mapped in both. This was mostly an issue when you tried to convert from running in real mode to running on the PMMU, and then from there, from 4K pages to large pages. The work around is to INVLPG twice, or to reload CR3, which flushes all the TLBs (making it the "big hammer").
Anyway, that was a digression, and in the scenario I discussed using statistical protection, you'd use the compiler and linker to make sections per function or function group, and then the linker would put linkage vtables in each of these groups when creating the executable, and then the exec function in the kernel would interpret these as allocation units, and put the sections in as few contiguous pages as possible, and then randomly locate them some place in the process address space. Which on an Intel/PPC architecture would locate them in a 64 bit virtual space, out of a 52 or 53 bit physical addressable space.
When the loader resolved the linkages for shared libraries through the fault/fixup mechanism, it'd do it by library:section, rather than by library alone, using the per-section vtables.
2) Why is it called "container-in-a-mailbox?"
Historically speaking, there are several ways to pass things around around between components. One of these is via register reference to the address of the thing. Another is via stack reference to the address of the thing. Another is via descriptor (in VMS, this is the function descriptor; in Mach, this is a Mach Message that is defined in an IDL compiled with a tool called "mig"; in ORBs like OmiORB or Corba, or even COM/DCOM in Windows, it's via an IDL stub that's either compiled via a compiler like mig, or is generated automatically during the link stage, etc.). And then there is message passing, as is done on the data bus of things like the Northern Telecom DV1, DV2, DV3, etc., which are used in the implementation of phone switches.
When you pass messages, it's generally impossible to pass the entire message; and while the message passing could be a partial address space mapping hand-off, as in Mach, another method is to use what's called a "mailbox".
In its simplest implementation, a mailbox is a file in which one program copies a large chunk of data, and then informs some other program that there's data waiting in the mailbox.
In one of the early implementations of a TPS (transaction processing system) implemented by AT&T Bell Labs (named "Tuxedo", if you care), mailboxes were placed in System V shared memory segments, and the notification between programs occurred via System V message queues. In something like sendmail or qmail, they're implemented as separate message and index files in what's called a "mail queue directory".
In practice, if you are using something like statistical memory protection on a system with a very large address space, and a large amount of physical RAM (or less, if you are willing to page, but you will almost certainly page, due to front-to-back linear processing of mailbox contents), you could just place the messages in anonymous memory mappings, and then "forget" them once their address has been handed off to the next component interface.
So on most modern systems, mailboxes are ideal.
The reason it's container-in-mailbox is that there is a separate logical phase for container receipt vs. validation vs. use of container contents. So you would have to put the container in the mailbox so the validation phase could safely operate on it without trusting the component handing it the data has not been compromised (and likewise, with the handoff between validation and utilization.
It's a security domain separation, rather than a protection domain separation or an address space domain separation (although address space, in the case of statistical protections -- or hypothetical future hardware or use of ring 1/2, if you can live with a granularity of "3" -- generally amounts to one of those as an implementation detail).
3) you wrote, "Most modern (predominantly research) security architectures" who is doing this research, and where can I find it?
Wow. Pretty much everyone in OS software who cares?
IBM and Microsoft are players, OpenBSD is, for some types of things. Apple is; Linux people (though I think it was a DARPA project run by IBM?) were the first to implement ASLR; I think Apple was the first to ASLR absolutely everything? And to do page level executable signature verification in the paging path? Though I think they mostly did it for DRM reasons, rather than to be helpful to users. I think compiler stack probes came from the LLVM folks?
The hardware guys have pretty much been warts on the tail of progress; they're not very fast to implement anything for which there isn't already a proven market, because of development costs; T-Zones on ARM are the one thing I can think of that went big-time, and that was mostly to allow running application software and baseband software to run on the same processor in a cell phone, and that was mostly so that the SDRs (Software Defined Radios) could get certification by agencies like the FCC in the U.S. and whatever passes for the FCC in various other countries.
As part of this, you define an interface contract: you are permitted to call down to the interfaces below yourself, and you are permitted to call across, within the same layer to auxiliary functions, but under no circumstances are you permitted to call upward.
That would ruin (or improve) a lot of modern OO techniques.
Well yes, and no.
From a security perspective, if you were using an object interface vtable on a linkage for a C++ object, it would certainly prevent you doing things like hiding the function pointers from other code that's allowed to call into the object. So the mechanics of OO language design are inherently inimical to security, from that stand point. On the other hand, you could be handing off the call as a descriptor with an object to other code that knows how to do the dispatch, and performs information hiding to keep you from adding 4 or 8 or whatever to the known address of the function to get to the next function, which might be a friend function or a private function located sequentially in the object.
This wouldn't entirely preclude layering violations, but it would certainly make them more difficult. That would improve security, but whether it improved the techniques? It depends on whether your techniques were already predicated on interface violations, or whether you were accessing data embers directly, rather than trough an accessor/mutator function, and whether or not you were using a global static object instance with a reference or entry counter for critical sectioning, and so on.
Sidebar: as a general rule: critical sections should almost never be used, except when dealing directly with hardware interrupts, and not even then, if your hardware is correctly designed to block/queue interrupts until a previous interrupt is explicitly acknowledged. Protect access to data objects, not to the code that accesses or modifies data objects. Code should be intrinsically reentrant. There are some really cheap mechanisms you can use here to avoid pipeline stalls, like atomic increment when adding a reference, and only taking a lock on code when decrementing for the 1->0 check, and so on. You need a data pipeline barrier in that case, but you don't incur both a data and a code pipeline stall, etc. (if you want to see an example of how this is done, look at the kern_cred.c and kern_credential.c code in Mac OS X -- it's on opensource.apple.com).
In general, I think anything that results in code being reentrant, particularly OO code, where the object itself acts as a stateite, and there may be multiple threads operating on separate objects through the same code simultaneously, it just makes sense.
Other examples are turnstiles in Solaris, and prohibiting lock recursion as a "kernel panic/segfault offense" to prevent layer revisiting from even being legal at all, etc..
The reason I like DJB's work is because he seems to carefully think about what problems may arise every time he writes a line of code. He may not always succeed, but if you don't have that way of thinking, you will automatically fail at "identifying architectural layers for your libraries in order to abstract complexity of each layer from the layer below it," and will have bugs no matter what rules you follow.
The problem I really have with his work is that it's largely academically oriented, rather than practical. It's Like Peter Druschel's work at Rice University on LRP (Lazy Receiver Processing), which is quite brilliant, but impossible to reduce to practice.
At a previous employer, we actually reduced LRP (without the "rescon" additions, which are patented and IMO not useful) to practice. Getting this to actually work usefully as a solution to receiver livelock involved going well beyond the work he and his students had done, and required things like modification of the firmware in Tigon II and Tigon III cards to not interrupt until their last interrupt was acknowledged, and for incoming connections, you had to modify the way the routing and socket handling in the accept(2) system call operated, or you'd get a kernel packet when there wasn't actually an mbuf hanging there read to receive the incoming connect request, and so on.
It was a lot of "resolving implementation details which are inconvenient for me is left as an exercise for the student".
DJB's work has a lot of that flavor to it.
In particular, if you look at DJBDNS, it has no support for secondaries, it has no support for interior vs. exterior DNS resolution (I wrote the RFC draft for the in the IETF DNS Working Group, which is a mailing list named "namedroppers"), and it has no support for zone transfers.
These were all considered "insecure", and he expected that all DNS servers would be authoritative primaries, and that "zone transfers" would use an out of band communication mechanism (I believe at the time, he was suggesting "rsync" on the zone data files for this?). This was my first experience with his philosophy of partitioning function by program, and functional decomposition as a solution to complexity.
I didn't see this change with qmail -- although he admittedly did cover a larger proportion of the problem space, he still failed to map all of it -- and the places where he "compromised his principles by doing so" demonstrated later weaknesses in the philosophy, like the exploit we've been using as an example in this discussion.
I really don't like "proof is left to the student" type stuff any more than I first liked it when I saw in Feynman's lectures that he was in fact using Clifford Algebras to do quantum physics, and didn't bother to share this fact with the rest of us, or when I found out that Newton had invented calculus, and was able to pop Sir Edmund Halley's bubble on his big announcement by answering his rhetorical question "and do you know what shape the orbit describes?", and Newton pipes up "it's an ellipse, of course".
Anything where parts of the problem space that are supposedly being mapped by someone's solution aren't reduced to practice tend to be very annoying. But perhaps that's just me... 8^)