Your argument only makes sense if you fix the target platform, compiler and compiler options for the comparison. In fact, it's trivially provably correct: If the compiler beats my assembly code, I can simply replace my own assembly code with the assembly code the compiler generated and force a tie.
However, that misses the point: I can write the fastest possible assembly for a given platform, but it might not be the fastest way to do something on a different (but compatible) platform. But the C code, without modification, could potentially beat my unmodified assembly code when compiled for that other platform. The compiler has the flexibility to tune its output for the target, while my assembly code is fixed for one target. And if you include platforms with a different underlying assembly language, the C code wins by default because my assembly code doesn't even run.
For example consider 4x4 matrix multiplication. Do you use nested loops or just unroll it manually? Compilers tend not to fully unroll all the nested loops. The compiler may do better scheduling on fully unrolled non-looping code.
Ironically, a vectorizing compiler would prefer you give it the loops instead. If you gave it any hints at all, it should be regarding pointer aliasing (ie. the C99 restrict keyword, for example), pointer alignment and minimum trip counts if any of the loops have variable trip counts. Manually unrolling makes its job much, much harder, usually.
Do you create temporary variables to preload a row or column, or do you just access each variable in memory directly? The former may generate better code on a RISC architecture and the later on a CISC architecture.
If you provide good pointer aliasing qualifiers, I'd hope both produce about the same regardless of CISC or RISC with modern compilers, instruction schedulers and register allocators.
These are the sort of things I think of when referring to helping the compiler, giving it hints. When that mythical smart compiler arrives that is able to figure out the preceding on its own, it will simply ignore your hints, the hints will do no harm. Until this mythical compiler arrives, the hints may help.
When was the last time you used the register keyword in C and it had a meaningful effect? Depending on which aspect of the code you consider, the "mythical" compiler you refer to may be less mythical than you think.
Look up the history of the Stepanov Benchmark as it applies to C++ programs, for example. It was once a hot topic among C++ compiler writers, because it exposed how awful C++ abstractions were to run time. Now most compilers ace it, sometimes producing faster code with the C++ abstractions than the C baseline they're measured against.
Ok, maybe not 20 years old, but 17 years old. Software I wrote in 1996 is still used today to verify chips built in the team I'm in at work. And that code compiles just fine. I haven't developed actively on it in about 14 years. No substantial tweaks to keep it current, either. I don't think it will compile as a 64-bit executable, but given that even Firefox is available as a 32-bit executable by default tells me that that's not a "historical" mode.
I was speaking with a team at work. They're talking about finally replacing some 30+ year old code in their code base with more modular, modern code. Sure, the whole package around it has continued to evolve, but some pieces date back to the first Reagan Administration. High level languages made that sort of continuity possible.
Now granted, the team whose code I'm referring to is a compiler team. Maybe, just maybe, they put more faith in compiled high level languages than your average programmer.
LLVM started outside of Apple. Apple hired some key LLVM developers, and put several of their own on it too. They've kept it public because enough people outside Apple are still contributing, and sure, that's great. So far everyone benefits. If Apple decided to stop publishing their LLVM updates and took it private, FreeBSD would have to fork it or move to another compiler.
But none of that is specific to FreeBSD, and none of those fund core FreeBSD development (which could happen just as easily with GCC or another compiler if LLVM were unavailable). Your point, again?
I personally find it more valuable to periodically examine the compiler output, and understand if there are particular idioms in my code which lead to bad code generation. For example, when I'm constructing a particular set of abstractions (say, a C++ template library), can the compiler peel them back and give me efficient code?
I still write assembly code when I need to (especially on small embedded systems, or on the Intellivision), but most of the time I just don't have the time. Also, most code's performance just doesn't matter. My time truly is the limited resource. I've had to learn that perfect is the enemy of good, and so to pick my battles wisely.
MenuetOS is impressive, but I doubt I would ever be able to use it, because I won't ever have the subset of peripherals, motherboards, etc. required to run it. And, because its development is entirely in assembly code, I suspect hardware will continue to change faster than it does.
I remember the last time I bet on an assembly-written horse (WordPerfect). It was the fastest, it was solid, and it got trumped, hard, when the world changed faster than it could.
I'm not sure why you think x86 is at all opcode compatible with the 4004. It's not even opcode compatible with the 8085. (You could translate 8085 code to 8086 with a special translator, but it wasn't guaranteed to be perfect.) The 4004 had a very weird instruction set, actually. Probably had something to do with the fact that you had to manually manage the memory bus and chip selects from the CPU, as opposed to more generic memory busses found on, well, just about anything outside the microcontroller world that talks to a JEDEC memory.
Otherwise, most of the CPU complexity that currently shows up is due to the fact that the CPU speed far outstrips the memory bus speed, thus all of the concern about "local" memory caches and pipelined instruction ordering. If you could create a much faster memory bus, CPU designs could be simplified considerably from a software developer POV.
This is the infamous memory wall, and lets face it, no processor vendor has figured out how to bypass physics and "just make memory faster." The problem was identified as far back in the 1940s, long before Intel even existed or Shockley's team at Bell Labs had invented their transistor. Consider this quote from Von Neumann himself:
Ideally one would desire an indefinitely large memory capacity such that any particular ⦠word would be immediately available. ⦠We are ⦠forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.
A. W. Burks, H. H. Goldstine, and J. von Neumann
Preliminary Discussion of the Logical Design of an Electronic Computing Instrument (1946)
But hey, I'm sure you've got some ideas. Why not get some VC money and make the next processor that will beat them all?
As I recall, FreeBSD provided some of the key underpinnings to Mac OS X and iOS. Surely Apple can spare some of its $90B back to the effort. $1M is a rounding error compared to $90B...
My cassettes all migrated to CD's, and then from there to digital audio.
So extrapolating from that it seems the end game for all evolution is becoming beings of pure energy, DRM optional.
Not trying to do the "one up" thing here, but IMHO, the end game for evolution would be to become beings of pure information. Energy and matter are merely vehicles to store and transfer information content. We would probably get equally frustrated with the limitations of existing as energy beings as we currently do with the flaccid biological bags that we exist in.
And your DRM comment is indeed something to ponder on - the artificial copy protection mechanisms that we have slapped on top of our existence - not just at physical levels but even in our minds.
This is as oppsed to the physical +1 Insightful atomicxblue would be sending you in the mail on a normal day.
Do you think that keyboard you are holding is
Hmmm.
PS. To my Indian friends, can you please share with us how you guys can keep the budget so low?
duh, they obviously outsourced the work to ind-uhh... that is a good question.
Heh, that was quite funny!
There's very little I know about ISRO. But there are a few things that work well in India (as a government run entity) and ISRO is definitely one of them. You have to understand that for several decades, Indian organzations like ISRO had to innovate and invent even basic engineering stuff largely in isolation. The homegrown Param supercomputer was also a repsonse to this - because most high technology items (even basic things like CPUs and interconnects) could not be imported as they were banned by the US.
As such, the frugality of organizations like ISRO is more of a byproduct of the severely constrained environment in which they grew up in. So they learnt to make do with what they had, learnt to develop workarounds and become really innovative. Plus, some early successes enabled ISRO to acquire pride of place even in the mind-numbingly inefficient and corrupt bureaucracy. Due to this, they were able to largely avoid a lot of red tape that is endemic to any Indian government organization. They were able to get reasonable amounts of funding and were also able to attract some reasonable levels of talent.
In terms of talent, the situation is still quite sad as most scientists who work in ISRO either do it because of a true calling or because of patriotism or both. They still know they get paid peanuts compared to their American or Chinese counterparts. It is a near miracle that organizations like ISRO survived and even thrived in the morass that is the Indian Administration Service - an ignoble legacy of the Brits, but something that was made a hundred times worse by the Indians themselves.
Why the needlessly stringent power draw? You can get passively cooled discrete GPUs or low-noise active cooling which would give you a major bump in performance. APUs won't be able to do 4K for a loooong time for anything but video.
You make a valid point - and I don't know *all* the options that exist.
It would actually be a very interesting exercise to do this kind of a comparison. Say, take some HTPC like constraints such as space and heat, identify the options available - both CPU+discrete graphics and CPU+GPU integrated, and compare the options using price and performance.
Back to your point, it is not just power draw - space and cooling are also factors. A reasonably strong integrated CPU+GPU system lets you build a cabinet that can be very slim - say, something that resembles a compact blue ray player.
I would also imagine that an integrated solution like this will allow better airflow.
Finally there's price. Undoubtedly discrete graphics will always have the performance crown. However, if you think of Moore's law, CPUs have already reached the point of diminishing returns in terms of size of individual cores or even number of cores in a chip. From now on, IMHO, Moore's law will be all about integrating as much of the motherboard as possible into a single chip or package. And GPU is the most obvious starting point.
To put it another way, in terms of price-performance-heat, discrete GPUs will not be able to compete with a highly integrated solution - over time. They will keep getting pushed into smaller and smaller niches. An integrated solution will generally be cheaper and cooler for equivalent performance. It wasn't a viable solution in many cases until now only because the performance was sub-par - but Kaveri is the first viable chip that gives you enough horsepower to play last gen games at full HD with reasonable frame rates. In two years, Kaveri will be at 2 teraflops - the same as a PS4.
No amount of careful planning will ever replace dumb luck.