Journal Shobhan_Intel's Journal: Parallel Programming 90
When thinking about parallel programming, there are three fundamental responsibilities that need to be addressed:
1) Identify the parallelism
2) Synchronize thread executions
3) Distribute data (global or local)
What should the tasks of the programmer be and what should be taken care of by the programming methodology? What do you think is the best approach for parallel programming: a new language, language extensions, or libraries?
We're also interested in what level of experience you have with parallel and threaded programming.
- How long have you been doing it, or, if you've never done it, what is making you consider it now?
- Are there specific challenges that you have faced as you embarked upon your past or current parallel programming efforts?
- Are you thinking about scalable parallel programming?
We're very interested in your feedback, though we appreciate your patience as we try to answer your questions while doing our day jobs. We'll likely be inviting additional members of our team as well to participate in this discussion topic and are looking forward to an interesting discussion this week.
some quick thoughts (Score:3, Insightful)
For web servers, it's pretty easy to shuffle incoming requests to the different processors. Apache and IIS do this already as far as I can tell. Database servers (once the transactions are tuned) are usually disk-bound, as are file servers.
Games have a lot of stuff going on and they tend to be written with a tight event loop. The problem isn't parallizing this stuff, but more that game developers don't want their game to degrade or stutter on single CPU machines at all so they code for the lowest common denominator.
Stuff that deals with media like photos, movies, music files can do filtering or ripping on multiple cores pretty easily, just divide and conquer.
SETI, folding, etc scale up really nicely.
Small scale parallism is really difficult. Half a dozen companies with fine-grain parallelizing compilers have come and gone. Maybe INTC can buy up their old IP and make something useful, who knows.
Re: (Score:1)
[Disclaimer: I'm lead developer for Intel(R) Threading Building Blocks library [Intel(R) TBB], and so am biased.]
The point about degrading or stuttering on a single CPU is a good one. It raises issues of scheduling. A scheduler can be fair (give all tasks equal time) or unfair. On a single CPU, running a fair scheduler can invite stuttering problems as the CPU alternates between tasks. Our TBB scheduler avoids this problem by using a Cilk-style unfair (and not preemptive scheduler). On a single CP
Re: (Score:2)
Having two different scheduling algorithms (fair and unfair) yields optimal execution time, but really really sucks for debugging. Essentially, a program running under the unfair scheduler has a much lower probability of encountering deadlocks and race conditions than a program running under the fair scheduler.
Older hardware has fewer cores. Given a lengthy software lifecycle, a program can make it through much of the early debugging and test phases without any problems. Processor technology improves.
Re: (Score:1)
Alas, in many situations, fair scheduling can have such a high overhead that the resulting program is slower than a serial program. A simple example is walking a tree. The "fair" way to do this is parallel breadth-first search, which can be quite a memory hog. The unfair way to do this is parallel depth-first search.
Ensuring that program correctness is independent of schedule timing is important. For Intel(R) TBB, a good way to debug these issues is to set the number of worker threads ridiculously h
Servers, Virtual Machines, and Games (Score:2)
lazy programmer (Score:4, Interesting)
The main problem for me with multithread application programming is synchronization. Perhaps I'm just being lazy, but I avoid mutex's like the plague. They are so hard to test and debug it's not worth it.
What I like to do is start up multiple threads only where there is a significant chunk of divisible work to be done, then let the threads work on it individually and separately until they're all done, then use pthread_join() to synchronize. This is usually referred to as "data parallelism" in the sense that the data is divided up and each thread is basically doing the same thing but to a different chunk of data. This really stresses out paging and cache since there is no locality, but it saves me a lot of headaches and it means the app gets delivered on time.
The second model that I've had luck with on MT applications is client/server inside the same process. I think it's actually called "producer/consumer" parallelism whatever. Anyway this is useful when some external thing like disk or network is the real bottleneck and having one thread queue, sort, or arbitrate requests in a way that maximizes throughput can help a lot.
Anyway, my point is any new language or library that helps MT programming really needs to remove the mutex burden from the programmer. Access to shared resources needs to be noted and automatically protected without the programmer having to declare a mutux and lock it in exactly the right way. The language or library should be smart enough to see that two different threads are accessing the same memory (or file or whatever) automatically lock/unlock the resource based on how the code in threads uses it. No, this is not easy, but it's what needs to be done.
Re: (Score:2, Interesting)
Completely automatic locking might be more immediately feasible with a some sort of threaded keyword on variable declarations that would automatically lock an
Re: (Score:3, Interesting)
Such a simple system can lead to two problems right off the top, though. T
Re: (Score:1)
Re: (Score:2)
Such a simple system can lead to two problems right off the top, though. The first is performance. If you have three statements in a row that involve auto-lock variables, you will execute three different lock/unlock overheads, one for each statement. If the programmer used explicit synchrnonization, only one lock/unlock is needed to protect all three statements.
I think there is a more general need for a solution to that problem. There are wrapper for SDL or OpenGL which, after inlining, can result in consecutive SDL_{Lock|Unlock}Surface or gl{Begin|End} blocks.
What would be really cool is a function attribute that tells GCC: where this function is called, and a call to this other specified function immediately and unconditionally follows (with no function calls in between), you can optimize out both calls.
Like:
Re: (Score:2, Interesting)
[Disclaimer: I'm lead developer for Intel(R) Threading Building Blocks library [Intel(R) TBB], and so am biased. What follows is my own opinion.]
It's tricky to automate locking, because locks really protect program invariants, not memory locations. For example, the invariant might be "flag EMPTY is true if and only if list LIST is empty." Programmers are going to have to mark the invariant somehow. There's been some research on automatic locking (like http://berkeley.intel-research.net/dgay/pubs/06-po [intel-research.net]
Re: (Score:2)
I find the debugging difficulty scales roughly linearly in terms of the degree of concurrency with a CSP approach, and roughly exponentially with a threading
Re: (Score:1)
Re: (Score:2)
I'm not sure that I see the distinction you're trying to draw there. Execution pitfalls are essentially algorithm design problems that haven't been properly dealt with. Threads may seem easier to design because they're "just like" sequential code,
Re: (Score:1)
Yes, I was being a bit obtuse there, wasn't I? Well, I agree with everything that you've said here, except the last sentence.
It all stems from being a lazy programmer (and I've admitted to this in front of many different audiences). As you say, I see threaded coding as "just like" sequential code, except for the breaking up of work to threads and the needed synchronizations (locks). If I'm using an expl
Re: (Score:2)
I understand that there's a lot of overhead involved there, bu
Re: (Score:1)
Re: (Score:2)
You can turn to databases to see that locking can't be completely automatic. In a table, changing a row might have isolated effect on just this row, or it may impact indirectly other rows, or if it has explicit/implicit foreign keys may impact o
threaded keyword (Score:1)
Re: (Score:2, Informative)
Re: (Score:2)
>> the thread pool and use one thread per function call to
>> loop through the container, running up to your pre-defined
>> number of threads.
This is called vectorization and is usually available in high performance fortran compilers and some vendor-supplied C/C++ compilers too. It works really well if the hardware (e.g. cray) has been optimized for this model. My only caution is that when you create or communicate with a thre
Re: (Score:1)
[Disclaimer: I'm lead developer for Intel(R) Threading Building Blocks library [Intel(R) TBB], and so am biased.]
The overhead per parallel call is definitely a problem. It's a fundamental "grainsize" problem in parallel programming. The individual chunks of work have to be big enough to amortize the parallel overheads. Of course, the same can be said for amortizing the cost of a virtual function call. The difference is that parallel overheads tend to be much higher. For example, TBB needs at least
Re: (Score:1)
But thanks a lot for the info. Based on it, I'll definitely be restructuring my program, instead of le
Re: (Score:1)
Growing Pains (Score:2)
What I would like to see would be "Resource aware" libraries. A simple example would be vector class. The vector.sort() would notice that 4 processors are available, and break the sorting apart into 4 t
Re: (Score:3, Interesting)
We ship a parallel sort with Intel(R) Threading Building Blocks (Intel (R) TBB). It even works on vectors.
The difficulty with hiding parallelism inside objects is that it is effective only for lengthy operations. E.g., our parallel quicksort requires 500 or more elements before it goes parallel, because we found it typically didn't gain much when sorting smaller sequences.
In general, effective parallelism is going to be at fairly high levels, and thus the responsibility of the components (by ensuri
Comments from a Student (Score:2)
How about all three! For some tasks such as numerics, perhaps a functional like programming language might do best, where for say c++ libraries and language extensions could go a long way. Memory models are also a very important and left out issue here!! I'm currently trying to wrap my head around functional programming, its worlds different from the state machine based code that i currently w
Re: (Score:1)
Functional languages are a great way to write race-free parallel code, but so far have lost out because:
Maybe once 4 or more cores are the norm, more programmers will find that the sequential losses can be made up by the parallel gains.
We wrote Intel(R) Threading Building Blocks as a pure library so that developers cou
Multicore programming plus MPI (Score:1)
However, when there may be tens of cores on a chip, and these may be connected together into SMPs, and these into heterogenous clusters, you clearly run into problems. Multilevel parallelism will be a hard programming task indeed. New languages and new paradig
Re: (Score:1)
Re: (Score:2)
New languages and new paradigms are urgently needed - but these shouldn't be too hard to learn for programmers.
I'm a fan of SCOOP for Eiffel [se.ethz.ch] on that front: it gives you a decent CSP based system, but does so in a way that fits very natural into standard OO programming (there's very little to learn). Of course Eiffel is somewhat of a niche language, but the ideas that inform SCOOP could feasibly be transferred to other OO languages such as Java and C++.
For the record (Score:1, Offtopic)
Algorithm design tools? (Score:1)
1) Identify the parallelism
I don't "code" but I have recently done a lot of business process work, and before that more algorithm design work than I care to think about.
Most of the failures I have seen in parallel work haven't been the code, the hardware or the compiler.
It is solving the wrong problem or doing it ass backwards that is the problem.
Yes identify the parallelism, but what tools do you propose to make available to show what the is going on?
There are enoug
Re: (Score:1)
Now, you'll want to drive the parallelism as high as possible in the call stack as you can. That is, if functionC is found to be the hotspot of the code, don't automatically look for parallelism at that level.
Re: (Score:1)
"I'm not aware of any tools that you can use to identify where independent operations are with your code. "
"Right now, a profiler and your brain (with a good understanding of the code and the algorithms) are the best tools you can use for identifying (and implementing) parallelism."
And isn't that the problem?
"I'm not aware of any tools that you can use to identify where independent operations are with your code. "
W
What do Beowulf coders do with multi-core nodes? (Score:2, Interesting)
I assume that MPI is going to be the dominant programming model for cluster programmers. So what do you do when the nodes on the cluster become multi-core? Do you just run your MPI applications with the same number of processes, but on fewer nodes? Do you run your MPI apps on the same number of nodes but use more processes? Running more processes on a mult
Thread Safety.. (Score:2)
If I'm going multi-core multi-processor for low latency I usually try and pass a numProcs argument so that the intra-machine communications are maximized. I am really leery of going multi-threaded in general, I think the extra complexity will add marginal gains over inter-machine mpi except in truly pathological cases.
However if you have a very capable thread based system that
Re: (Score:1)
However, if you've got a hotspot in the process execution, why not thread that portion of code between communication calls? Use something simple like OpenMP and thread ve
Threads are not the answer (Score:2)
Re: (Score:1)
Re: (Score:1)
Erlang (Score:1)
How well does Erlang integrate the ugly (non-functional) stuff? Windows System Calls? Data Compression code? Graphics Coding? Radix Sorting? Parsers? Matrix Math?
Functional languages are great, but I keep getting the impression that they can't do much of what is expected in a modern language. Being able to implement optimized complex inner loops is a core function of many programming language platforms. How well does Erlang deal with those ugly, but critical sections of code that make a large applicat
Re: (Score:1)
Game programmers don't seem to have any problems thinking up ways to make use of all the processing power they can get a hold of, and for the average user, might be the only reason to want more than two or four cores. Personally, I'm more concerned about getting all that power packed into a cool user interface like beryl (if you d
Re: (Score:1)
Re: (Score:2)
I've used JCSP for a few experimental programs, but nothing really substantial. It works pretty well, but is encumbered (compared to a purpose-designed concurrent language) by some hefty syntactic overhead. There is (IIRC) a port of JCSP to the .NET framework, and a similar (but not identical) library for C++ called C++CSP. A few handy links:
Note t
Re: (Score:1)
Re: (Score:1)
The Lee paper I referred to above offers another option: coordination languages, which provide the concurrent glue between sequential programs written in existing mainstream languages.
This sounds a lot like the Bulk Synchronous Parallel model proposed a few years ago. Using The Google, I see that there is still some interest, some research, and even a book published on the topic.
I don't think the problem is with threads, specifically, but the low-level interface used today to create, coordinate, and use threads. Current threading APIs are almost like programming in Assembly language (or lower). Nor do I think the number of threads is a problem. However, each thread used and sync
Theoretical question [9 month pregnancies] (Score:2)
The graphical advertisement that ran with this story:
has the text
I googled that quote, and found a hit at an Intel blog:
Re: (Score:1)
Theory says that any algorithm can be parallelized to run in O(1) time, given enough processors. Alas, I can't find the citation. I think it's a paper from the early 1980s. The proof is based on simulating a Turing machine computation in O(1) time. The trick is to simulate all possible Turing machine executions that might yield the right answer, and pick out the right one.
For example, to sort N items in in constant time, generate all possible permutations, and choose the one that is sorted. It only
coherent theoretical framework? (Score:2)
Alas, I can't find the citation. I think it's a paper from the early 1980s.
If you can remember the paper, I'd love to know about it.
the original exact answer given by the sequential program really is hard or impossible to deliver faster with parallel programming
When you try to pin people down, the answer is usually something like "this computation is believed to be difficult to parallelize" [given the constraint of k processors].
My question: Has anyone tried to take the thing beyond a vague sense
Re: (Score:2)
I saw something like this in the Journal of the ACM from the 60's or early 70's, I believe. I can't remember which year. Was full of math I didn't understand at the time (not that I would understand it any better now, but it's been a very long time...) Anyway, you have to be a member to access their library [acm.org], and I'm not a member so I can't look it up for you.
A brief search turns up Structuring of Parallel Algorithms [acm.org] by P.A.Gilmore, 1968.
Hopefully this will narrow the GP's search :-). Enjoy.
Paper citation (Score:1)
Re: (Score:2)
> Theory says that any algorithm can be parallelized to run in O(1) time
No. "Work optimal" is a term for an optimal parallel algorithm, and it isn't always O(1). As I recall, the lower limit for sorting n items is lg n, and you can't improve on this no matter how many processors you have available. See "Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms" by Rajasekaran and Reif. The trouble with your example is that, yes, if each of those n! processors is programmed to do a diff
Re: (Score:1)
The original question did not ask about optimality, but only about parallelizability. The cited paper "Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms" considers only sorts that are optimal, in the sense (to quote the abstract) that "the product of its time and processor bounds is upper bounded by a linear function of the input size." The O(1) construction I stated is not bound by that constraint, though I'll be the first to admit that the contraint is a good one for practical para
Parallel theory is still in infancy (Score:1)
I read "popular" math books w
What is with this ad? (Score:1)
What is with this ad? I'm used to running into posts that are mildly (or massively) offensive to women on slashdot. After all the population is almost exclusively male and it is to be expected. But to have something like this glaring at me from the front page is just a bit overboard even for slashdot.
Most ads are designed to deliver the message: there is this product that is common but then there is our product which is way cool and better in every way.
Re: (Score:1)
I originally read this nine women counter-example in Fred Brooks' "The Mythical Man-Month" (originally published in 1975). I suppose some other gestation period could have been used, but how many folks know offhand how long pregnacy lasts for reindeer or horses or giraffes or cats?
Look at Embedded Systems (Score:2)
I work with many embedded control systems, and parallel embedded monitoring and control systems. These applications tend to push the extremes: reliability, latency and cost.
Simple embedded systems, like Programmable Logic Controllers (PLCs), structure all their code in a very defined structure. Often this is simply "one big loop" containing input, computation, then output. The multi-tasking PLCs define atomic copy instructions, such that all the input (or output) can be done in one synchronized instru
CSP for Embedded Systems (Score:1)
Re: (Score:1)
Research question (Score:2)
Re: (Score:2)
Hmmm... you seem to have missed the development that research funding from DARPA, NRL, ONR, ARL, NSF, etc. has gotten scarce as hens teeth since we started funneling it all to Haliburton. It's the rare school that will cough up its own $$$ for research equipment, you... big twerp.
Hardware transactional memory (Score:1)
Re: (Score:1)
Third party libraries and thread safety (Score:2, Interesting)
We have no ability to enforce our users to initialize our caches. In C/C++ code, C++ static initialization doesn't reliable work on every platform (e.g. HP-UX and Apple Ma
Re: (Score:1)
My understanding was that the volatile keyword would force reading of the actual memory location. Thus the compiler wouldn't attempt to either optimize or reorder accesses to memory locations corresponding to I/O. The following is the documentation from the Microsoft C++ compiler:
Objects declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object at the point it is requested, even if a previous instruction asked for a value from the same object. Also, the value of the object is written immediately on assignment. Also, when optimizing, the compiler must maintain ordering among references to volatile objects as well as references to other global objects. In particular, a write to a volatile object (volatile write) has a reference to a global or static object that occurs before a write to a volatile object in the instruction sequence will occur before that volatile write in the compiled binary. A read of a volatile object (volatile read) has a reference to a global or static object that occurs after a read of volatile memory in the instruction sequence will occur after that volatile read in the compiled binary. This allows volatile objects to be used for memory locks and releases in multithreaded applications.
Re: (Score:1)
Re: (Score:1)
RCU arrays (Score:1)
(It would maintain a change list in some sort of sparse array structure.)
It would overload [] and assignment to make this change mostly transparent.
There could be a mutex locked method to make a public change.
That would help me a lot with image processing parallelism.
synchronization protects *DATA* not code (Score:2)
In general, locks (mutexes, semaphores, etc.) are there to protect data, not code! This may be obvious to experienced developers, but it is really hard trying to get some people to wrap their brains around that concept. It doesn't help when some tutorials talk about "critical sections" of code. Sure, there are sections of code that are critical,
Re: (Score:1)
I, too, am suspicious of lock-free synchronizations. They seem interesting, but are still topics for research. Also, the complexity and subtleties of the code are higher than I might recommend, especially for new concurrent coders. Consider the derivati
Re: (Score:1)
Add me to the suspicious crowd. Lockless algorithms are the domain of experts. It's been noted by others that if you find a lockless algorithm for a non-trivial data structure, it's a publishable result.
Lockless algorithms require extremely careful checking. Use of a model checker is essential. I've been happy with SPIN http://spinroot.com/spin/whatispin.html [spinroot.com] for doing this checking, but model checking is a lot of work with any tool.
In my experience, lockless algorithms have had poor performance
Threading (Score:2)
Until that happens, I'll confine myself to threadpools and a worker thread.
Re: (Score:1)
I'm doing parallelization already with fork/wait (Score:2)
Re:I'm doing parallelization already with fork/wai (Score:2)
It is available now on mainframes (Score:1)
I have written numerous multi-activity programs, from async/sync comm, transaction, and batch, with and without a DBMS. I only used MASM (1100/2200), Fortran, COBOL, a small amount of C, and PLUS-An internal language used to write portions of the OS, and is the Intermediate Language (IL)
Start with the tools (Score:1)