Comment Re:People still use GCC? (Score 5, Informative) 91
I'm not the AC, but I'll try to share the knowledge.
I'm a kernel programmer and worked on a Linux based realtime highdef broadcast quality H.264 video encoder that used a hybrid mix of multiple cores and FPGAs, so I'm fairly familiar with at least one use case.
openMP has uses for parallelizing workloads via pragmas in the compiler code. That is, take an app that is designed for a single CPU, add some pragmas and some openMP calls and let the compiler parallelize it. It does this [mostly] by paralleling loops that it finds.
Parallelizing [simple] loops can be done in [at least] two ways:
(1) A single loop can be parallelized across multiple cores
(2) If a function does loop A followed by loop B and loop A and B share no data, they can be done in parallel.
openMP assumes a shared memory architecture (e.g. all cores are on the same motherboard). Contrast this to MPI that can go "off board" [via a network link]. There are hybrid implementations that use both in a complementary fashion.
A good use case for this is weather prediction/simulation which is highly compute intensive but doesn't have realtime requirements. We just want our final answer ASAP, but what the program does moment-to-moment doesn't matter. Another use case is protein folding.
But, neither openMP nor MPI is well suited to a realtime situation that requires precise control over latency. Also, openMP doesn't support compare-and-swap. And, it's prone to race conditions.
Ideally, designing a given app from the ground up for parallelism is a better choice. If one does that, the fanciness of openMP isn't required. My last implementation of an openMP equivalent [that also incorporated what MPI does] was ~1000 lines of code because the app was pre-split into threads set up in a pipeline. It supported a multi-master, distributed, map/reduce equivalent using worker threads [still within 1000 lines].
Consider the second loop parallelization case. It's easy enough for a programmer to see that loop A and loop B are disjoint and put them in separate threads (e.g. A and B). But, if one is aware of this, the splitup can be done even if loop A and B share some data because one can control the synchronization between threads precisely. Extend this to 40-50 threads that have a more complex dependency graph.
Note that latency means that a given thread A will deliver its results to thread B in a finite/precise/predictable/repeatable amount of time. In video processing, each stage must finish processing within a the allotted for a video frame [usually 1/30th of a second]. With extra buffering, that can be relaxed a bit, but the average must be 1/30th and can't vary too widely (e.g. no frame could take [say] 1/2 second).
Thus, the AC, although snide, is partially right. If I were doing an implementation, I believe the result would be better not using openMP. But, I've got 40+ years doing realtime systems. Not everybody does. Most consumers of openMP [and/or MPI] are usually scientists/researchers who are [no doubt] experts in their field, but they're usually not expert level programmers. And, they usually don't have the restrictions imposed by a realtime system. Notable exceptions: programming for MRI/PET/etc machines.