The Haskell code is called a "specfication", but if it is Haskell code, surely it should count as a _program_ already?
Usually formal specifications are denoted using some sort of semantics. Large step semantics defines how create a derivation tree. Basically, how is a large program broken up into steps and completed one step at a time. Small step semantics define how an instruction takes one step in a program. There are other various forms of semantics, but each one defines mathematical constructs like stores (mapping from program variables to values), and using these constructs, you can prove different properties about that specification.
.
How can you prove that that program is bug-free?
"bug-free" usually means there is nothing left up to the implementation of the formal specification. For example, in C/C++ dereferencing of null pointers left up to the implementation. This means that one implementation can seg fault because you tried to access invalid memory, and another could return a random value, while another could always return 1. All three of these fully implement the specification because the specification does not define what to do when dereferencing null pointers.
Using a strongly typed language such as Haskell, Ocaml, or most any other functional language helps to prevent cases where the specification doesn't explicitly define an action. For example what is True - 3. In a strongly typed language, this would never compile, so the specification doesn't have to define this case, however in C++ where bools are actually ints this can happen.
.
How about conceptual bugs?
If a conceptual bug is one where the specification is incorrect, then this proof technique will not catch those "bugs".
.
Was the toolchain verified?
Usually the first compiler for a language is proven to be correct, and then using this fact, all subsequent compilers can be compiled using the first compiler. For example (i'm going off of memory here) the first ocaml compiler was written in c/c++ and proven to be correct. Once it was proven to be correct, when further optimizations/modifications were needed, a new compiler was written in ocaml, then compiled using the first compiler. Since it is usually easier to prove the correctness of a functional program that is strongly typed, proving the next generation compiler is significantly easier than the first generation.
.
How about the hardware on which it runs?
I'm not sure about this. As i'm not familiar with the kernel, it may be that it doesn't interface with any of the hardware itself and interacts w/ the hardware through modules/drivers. If this is so, then the modules/drivers would need to be proven correct as well.
.
What overhead does this approach have? Are the benefits worth it?
The benefits are very large for certain situations. For example, when the nuclear reactor starts to overheat, you want to be able to say with 100% confidence there will not be a meltdown and it will automatically shut itself off. You would want to say (with 100% confidence) that while flying on autopilot, the aircraft will not suddenly decide to go into a nose dive, etc.
.
I'm not saying this is all bullshit, but it looks like me that they are pointing to one program, calling it a "specification", and then demonstrating machine translation / verification to a similar program. I'm not sure if I buy that methodology.
What i assume that they did was first create a formal specification(denotational semantics, etc) for the kernel, implemented it in Haskell then implemented it in C. They proved that the C implementation is a 1-1 mapping of the haskell program. I wasn't able to determine, but i assume they had already proven that the Haskell program is a correct implementation of the formal specification.
.
My apologies for the formatting.