This release removes almost all the remaining SMP contention from both critical paths and most common paths. The work continued past the release in the master branch (some additional things which were too complex too test in time for the release). For all intents and purposes the master branch no longer has any SMP contention for anything other than modifying filesystem operations (such as concurrent writing to the same file). And even those sorts of operations are mostly contention free due to the buffer cache and namecache layers.
Generally speaking what this means is that for smaller 8-core systems what contention there was mostly disappeared one or two releases ago, but larger (e.g. 48-core) systems still had significant contention when many cores were heavily resource-shared. This release and the work now in the master branch basically removes the remaining contention on the larger multi-core systems, greatly improving their scaling and efficiency.
A full bulk build on our 48-core opteron box took several days a year ago. Today it takes only 14.5 hours to build the almost 21000 packages in the FreeBSD ports tree. These weren't minor improvements.
Where it matters the most are with heavily shared resources, for example when one is doing a bulk build on a large multi-core system which is constantly fork/exec'ing, running dozens of the same process concurrently. /bin/sh, make, cc1[plus], and so on (a common scenario for any bulk building system), and when accessing heavily shared cached filesystem data (a very common scenario for web servers). Under these conditions there can be hundreds of thousands of path lookups per second and over a million VM faults per second. Even a single exclusive lock in these paths can destroy performance on systems with more than 8 cores. Both the simpler case where a program such as /bin/sh or cc1 is concurrently fork/exec'd thousands to tens of thousands of times per second and the more complex case where sub-shells are used for parallelism (fork without exec)... these cases no longer contend at all.
Other paths also had to be cleaned up. Process forking requires significant pid-handling interactions to allocate PIDs concurrently, and exec's pretty much require that locks be fine-grained all the way down to the page level (and then shared at the page level) to handle the concurrent VM faults. The process table, buffer cache, and several other major subsystems were rewritten to convert global tables into effectively per-cpu tables. One lock would get 'fixed' and reveal three others that still needed work. Eventually everything was fixed.
Similarly, network paths have been optimized to the point where a server configuration can process hundreds of thousands of tcp connections per second and we can get full utilization of 10GigE nics.
And filesystem paths have been optimized greatly as well, though we'll have to wait for HAMMER2 to finish that work for modifying filesystem calls to reap the real rewards from that.
There are still a few network paths, primarily related to filtering (PF) that are serialized and need to be rewritten, but that and the next gen filesystem are the only big ticket items left in the entire system insofar as SMP goes.
Well, the last problem, at least until we tackle the next big issue. There's still cache coherency bus traffic which occurs even when e.g. a shared lock is non-contended. The code-base is now at the point where we could probably drop-in the new Intel transactional instructions and prefixes and get real gains (again, only applicable to multi-chip/multi-core solutions, not simple 8-thread systems). It should be possible to bump fork/exec and VM fault performance on shared resources from their current very high levels right on through the stratosphere and into deep space. Maybe I'll make a GSOC out of it next year.
The filesystem work on HAMMER2 (the filesystem successor to HAMMER1) continues to progress but it wasn't ready for even an early alpha release this release. The core media formats are stable but the freemap and the higher level abstraction layers still have a lot of work ahead of them.
In terms of performance... well, someone will have to re-run bechmarks instead of just re-posting old stuff from 5 years ago. Considering the SMP work I'd expect DFly to top-out on most tests (but there's still always the issue of benchmark testers just blindly running things and not actually understanding the results they post about). Database performance with postgresql still requires some work for large system configurations due to the pmap replications (due to postgres fork()ing and using mmap() now instead of sysv-shm, e.g. if you used a large 100GB+ shared memory cache configuration for the test). We have a sysctl to enable page table sharing across discrete fork()s but it isn't stable yet... with it, though, we get postgres performance on-par with the best linux results in large system configurations. So there are still a few degenerate cases here and there that aren't so much related to SMP as they are to resource memory use. But not much is left even there.
Honestly, Slashdot isn't the right place to post BSD stuff anymore. It's too full of immature posts and uninformed nonsense.
-Matt