James_Intel - Slashdot User

Comment Re:Performance Comparison? (Score 1) 158

by James_Intel on Friday July 27, 2007 @11:14AM (#20011143) Attached to: Intel Releases Threading Library Under GPL 2

TBB's not a replacement for OpenMP or MPI. If either work for you - you should use them. OpenMP is usually use as a shared memory programming model, and works well for Fortran and much C code. TBB is aimed for C++ and C programs, also shared memory. MPI is a distributed memory programming model.

First and foremost - TBB is easy to use, and still high performance.

A critical item for high performance in parallel programs is scaling - and TBB helps get better scaling more easily than you would find with hand coding. But OpenMP and MPI generally encourage/lead to scaling.

Another key on shared memory machines is managing caches well. TBB, again, does very well.

Benchmarks... it would be good to get ideas on what we should show. Here is what we have looked at / seen:
(1) Comparing with code written using pthreads/windows threads, TBB code is much easier to write and debug. We've seen serious programmers who don't write lots of parallel code be unable to get the hand threaded code to work, but get TBB to work. In each case I've seen - the programmer had to stop with hand threading because they just had to go work on something else after their effort to add pthreads to a big program didn't work well enough. We've seen experienced programmers get good scaling with TBB the first time, something they have to spend time on with hand coded threads.
(2) If a program can be written with OpenMP, we've seen comparable performance between OpenMP and TBB on first implementations of code - but then further tuning can lead to OpenMP out performing TBB if the OpenMP dynamic scheduling can be turned off as an advantage (use 'static scheduling' for a boost in speed). TBB is always dynamic, and will get beat in such cases. Of course, dynamic scheduling can be a huge win for many cases with problems that are even a little bit irregular.
(3) Distributed memory code (using MPI or a 'cluster' version of OpenMP) - usually out performs everything else. This is because, as a developer, you need to work out how to write a program with minimal dependencies between code running on each node of the computer. This work, is not easy... but once done (if it can be done) - the program has few synchronizations, not memory contention, etc. I don't recommend MPI for anyone thinking of coding for 2-4 cores and starting parallelism for the first time.

Comment Re:Threading (Score 1) 90

by Shobhan_Intel on Friday April 06, 2007 @05:57PM (#18640533) Attached to: Parallel Programming

Thread Pools are indeed an efficient way of managing native threads - they involve less manual thread creation and destruction, assigning worker threads to manage concurrent task execution is also quite convenient. Libraries such as Intel(R) Threading Building Blocks create thread pools with an optimal number of worker threads - based upon detection of available processors/cores.

Comment Re:Theoretical question [9 month pregnancies] (Score 1) 90

by Arch_Intel on Friday April 06, 2007 @04:12PM (#18639089) Attached to: Parallel Programming

The original question did not ask about optimality, but only about parallelizability. The cited paper "Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms" considers only sorts that are optimal, in the sense (to quote the abstract) that "the product of its time and processor bounds is upper bounded by a linear function of the input size." The O(1) construction I stated is not bound by that constraint, though I'll be the first to admit that the contraint is a good one for practical parallelism.

It is possible to report the sorted permutation in O(1) time using the usual PRAM model. Have a shared location X. The processor that finds the sorted permutation reports that it "won" by writing its processor id to location X. Of course, the number of processors required is silly.

Coarse grain parallelism, as in searching for Mersenne Primes, is great when you can exploit it. But some programs really do need finer grain parallelism, because they must deliver results within a certain time. For example, video games require delivering frames at 30-60 frames per second, and responding to the player's inputs in a few frames. Using 60 processors to each work on a separate frame, where each processor takes a whole second to generate a frame, is not practical, because even though the throughput requirements are met, the latency requirements (with respect to user input) are not.

Comment Re:synchronization protects DATA not code (Score 1) 90

by Clay_Intel on Friday April 06, 2007 @11:28AM (#18634691) Attached to: Parallel Programming

You're right about protecting data. The problem/confusion is likely stemming from these tutorials not understanding or not emphasizing the definition of "critical region" of code. That is, code in which shared variables are accessed (usually modified).

I, too, am suspicious of lock-free synchronizations. They seem interesting, but are still topics for research. Also, the complexity and subtleties of the code are higher than I might recommend, especially for new concurrent coders. Consider the derivation of a lockless solution to "The Critical Section Problem" from Chapter 3 of M. Ben-Ari's Principles of Concurrent and Distributed Programming, 2/e. Four attempts at solution are made and proved to be lacking before the presentation of Dekker's Algorithm.

Lock-free data structures are going to need to demonstrate a large performance gain over more straightforward synchronization methods before they are going to be accepted by the general population.

Slashdot Top Deals