Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Intel by OSTG

Journal Shobhan_Intel's Journal: Parallel Programming 90

Hello Slashdot community. Shobhan and Clay here to talk about parallel programming this week, a topic that is both interesting and familiar to us. As processors for server, desktop and mobile platforms move from dual to quad core and beyond, software must be parallelized to best benefit from the potential performance gains now possible. Software developers have successfully introduced parallelism to their software through multithreading using Intel resources such as software development tools and training. Let's get this discussion kicked off with some topics at the top of our minds.

When thinking about parallel programming, there are three fundamental responsibilities that need to be addressed:

1) Identify the parallelism
2) Synchronize thread executions
3) Distribute data (global or local)

What should the tasks of the programmer be and what should be taken care of by the programming methodology? What do you think is the best approach for parallel programming: a new language, language extensions, or libraries?

We're also interested in what level of experience you have with parallel and threaded programming.

  • How long have you been doing it, or, if you've never done it, what is making you consider it now?
  • Are there specific challenges that you have faced as you embarked upon your past or current parallel programming efforts?
  • Are you thinking about scalable parallel programming?

We're very interested in your feedback, though we appreciate your patience as we try to answer your questions while doing our day jobs. We'll likely be inviting additional members of our team as well to participate in this discussion topic and are looking forward to an interesting discussion this week.

This discussion has been archived. No new comments can be posted.

Parallel Programming

Comments Filter:
  • by fred fleenblat ( 463628 ) on Monday April 02, 2007 @02:25PM (#18577045) Homepage
    In the context of plain old office tasks and email/web surfing, use the extra cores for the OS and for background services, leaving one core 100% available for the foreground application. this is probably the most realistic use of multiple cores for most people. it kind of makes intel an "enabler" of OS's that hog a lot of resources, but so be it.

    For web servers, it's pretty easy to shuffle incoming requests to the different processors. Apache and IIS do this already as far as I can tell. Database servers (once the transactions are tuned) are usually disk-bound, as are file servers.

    Games have a lot of stuff going on and they tend to be written with a tight event loop. The problem isn't parallizing this stuff, but more that game developers don't want their game to degrade or stutter on single CPU machines at all so they code for the lowest common denominator.

    Stuff that deals with media like photos, movies, music files can do filtering or ripping on multiple cores pretty easily, just divide and conquer.

    SETI, folding, etc scale up really nicely.

    Small scale parallism is really difficult. Half a dozen companies with fine-grain parallelizing compilers have come and gone. Maybe INTC can buy up their old IP and make something useful, who knows.
    • [Disclaimer: I'm lead developer for Intel(R) Threading Building Blocks library [Intel(R) TBB], and so am biased.]

      The point about degrading or stuttering on a single CPU is a good one. It raises issues of scheduling. A scheduler can be fair (give all tasks equal time) or unfair. On a single CPU, running a fair scheduler can invite stuttering problems as the CPU alternates between tasks. Our TBB scheduler avoids this problem by using a Cilk-style unfair (and not preemptive scheduler). On a single CP

      • Having two different scheduling algorithms (fair and unfair) yields optimal execution time, but really really sucks for debugging. Essentially, a program running under the unfair scheduler has a much lower probability of encountering deadlocks and race conditions than a program running under the fair scheduler.

        Older hardware has fewer cores. Given a lengthy software lifecycle, a program can make it through much of the early debugging and test phases without any problems. Processor technology improves.

        • Alas, in many situations, fair scheduling can have such a high overhead that the resulting program is slower than a serial program. A simple example is walking a tree. The "fair" way to do this is parallel breadth-first search, which can be quite a memory hog. The unfair way to do this is parallel depth-first search.

          Ensuring that program correctness is independent of schedule timing is important. For Intel(R) TBB, a good way to debug these issues is to set the number of worker threads ridiculously h

    • I just built a server to run virtual machines of my existing servers on. Picked up a Quad-core 5355 Xeon processor and the mobo will let me add a second CPU later when I need it. This sucker is running several servers that have a decent load on them and is doing a good job of it. I enabled each virtual machine to have two virtual processors too. Since most server applications are multi-threaded it seems a good idea to do so. It's going to make management a lot easier condensing everything to one physical ma
  • lazy programmer (Score:4, Interesting)

    by fred fleenblat ( 463628 ) on Monday April 02, 2007 @02:40PM (#18577275) Homepage
    Speaking more specifically...

    The main problem for me with multithread application programming is synchronization. Perhaps I'm just being lazy, but I avoid mutex's like the plague. They are so hard to test and debug it's not worth it.

    What I like to do is start up multiple threads only where there is a significant chunk of divisible work to be done, then let the threads work on it individually and separately until they're all done, then use pthread_join() to synchronize. This is usually referred to as "data parallelism" in the sense that the data is divided up and each thread is basically doing the same thing but to a different chunk of data. This really stresses out paging and cache since there is no locality, but it saves me a lot of headaches and it means the app gets delivered on time.

    The second model that I've had luck with on MT applications is client/server inside the same process. I think it's actually called "producer/consumer" parallelism whatever. Anyway this is useful when some external thing like disk or network is the real bottleneck and having one thread queue, sort, or arbitrate requests in a way that maximizes throughput can help a lot.

    Anyway, my point is any new language or library that helps MT programming really needs to remove the mutex burden from the programmer. Access to shared resources needs to be noted and automatically protected without the programmer having to declare a mutux and lock it in exactly the right way. The language or library should be smart enough to see that two different threads are accessing the same memory (or file or whatever) automatically lock/unlock the resource based on how the code in threads uses it. No, this is not easy, but it's what needs to be done.
    • Re: (Score:2, Interesting)

      by jbo5112 ( 154963 )
      Boost threads [boost.org] (C++) provides a scoped_lock object that works with a mutex. When it's created with a specific mutex, the mutex gets locked. The mutex gets unlocked when you call unlock() or when the scoped_lock goes out of scope. It's not everything you hoped for, but it will at least unlock for you when you forget. In many cases this is sufficient.

      Completely automatic locking might be more immediately feasible with a some sort of threaded keyword on variable declarations that would automatically lock an
      • Re: (Score:3, Interesting)

        by Clay_Intel ( 1080849 )
        Notwithstanding Arch's comments below about the feasibility of automatic locks, I would think that some keyword at declaration time is going to be the target to shoot for. There will be the overhead of some kind of protection needed regardless of whether a variable is being read or written. If the code can determine when serial execution is being done (versus concurrent execution) the overhead of synchronization could be skipped.

        Such a simple system can lead to two problems right off the top, though. T

        • by jbo5112 ( 154963 )
          On the performance issue, if it's unlocking and locking the same data, it would have to optimize it so that the system isn't unlocking the data in between steps that you need it to stay locked for. A fancy compiler might even be able to detect when variables are always used as a group and create one group lock for all of them, but that could also be left up to the programmer to catch and encapsulate them in an object or struct.
        • Such a simple system can lead to two problems right off the top, though. The first is performance. If you have three statements in a row that involve auto-lock variables, you will execute three different lock/unlock overheads, one for each statement. If the programmer used explicit synchrnonization, only one lock/unlock is needed to protect all three statements.

          I think there is a more general need for a solution to that problem. There are wrapper for SDL or OpenGL which, after inlining, can result in consecutive SDL_{Lock|Unlock}Surface or gl{Begin|End} blocks.

          What would be really cool is a function attribute that tells GCC: where this function is called, and a call to this other specified function immediately and unconditionally follows (with no function calls in between), you can optimize out both calls.

          Like:

          void glEnd () __attribute__ ((undone_by (glBegin)));

    • Re: (Score:2, Interesting)

      by Arch_Intel ( 1081675 )

      [Disclaimer: I'm lead developer for Intel(R) Threading Building Blocks library [Intel(R) TBB], and so am biased. What follows is my own opinion.]

      It's tricky to automate locking, because locks really protect program invariants, not memory locations. For example, the invariant might be "flag EMPTY is true if and only if list LIST is empty." Programmers are going to have to mark the invariant somehow. There's been some research on automatic locking (like http://berkeley.intel-research.net/dgay/pubs/06-po [intel-research.net]

    • Any language based on the CSP model (such as Erlang, Termite, etc) will do this. You don't need explicit locks, because you don't have shared data structures. If you want to share some data then you use a broker process which passes the object to whichever process requests it. It is still possible to introduce deadlocks, but it is much, much harder.

      I find the debugging difficulty scales roughly linearly in terms of the degree of concurrency with a CSP approach, and roughly exponentially with a threading

      • I read Hoare's book on CSP about 15 years ago, but I'm not able to recall any of the details. I've only just learned how to spell Erlang, so I can't comment about that directly. However, a message-based parallelism solves some problems and creates others. The synchronization of threads (processes) is less of a problem, but the efficient distribution of data can take the time of the developer, moreso in distributed systems. Running in shared-memory alleviates some of the problems, but cache issues start
        • I've done both threaded and message-passing parallelism. The former has more execution pitfalls to be overcome and the latter has more algorithm design work needed up front. Either choice will work, it just depends where you want to spend your development time.

          I'm not sure that I see the distinction you're trying to draw there. Execution pitfalls are essentially algorithm design problems that haven't been properly dealt with. Threads may seem easier to design because they're "just like" sequential code,

          • [Warning: I work for Intel. This comment contains a shameless plug for an Intel software product.]

            Yes, I was being a bit obtuse there, wasn't I? Well, I agree with everything that you've said here, except the last sentence.

            It all stems from being a lazy programmer (and I've admitted to this in front of many different audiences). As you say, I see threaded coding as "just like" sequential code, except for the breaking up of work to threads and the needed synchronizations (locks). If I'm using an expl

            • I don't have any experience with threading in C/C++, since I mostly deal with PERL these days. I've got to tell you, though, that it doesn't get much easier than PERL's iThreads. Essentially, when you start a thread, you by default copy your in-scope variables over, unless you've declared them as shared. The perl library has built-in objects for safe passing of information between threads, and locks automatically unlock when they go out of scope.

              I understand that there's a lot of overhead involved there, bu
              • What iThreads is doing seems interesting, especially the focus on designing the library in a way that reduces the likelihood of data races/deadlocks - at the potential expense of performance. One of the newer trends we're seeing is the emergence of threading for performance or data parallel programming beyond high performance computing. Those who see the value in threading for functionality (or when you just want asynchronous behavior) can effectively continue to use libraries, APIs and tools that have ex
    • by suv4x4 ( 956391 )
      The language or library should be smart enough to see that two different threads are accessing the same memory (or file or whatever) automatically lock/unlock the resource based on how the code in threads uses it. No, this is not easy, but it's what needs to be done.

      You can turn to databases to see that locking can't be completely automatic. In a table, changing a row might have isolated effect on just this row, or it may impact indirectly other rows, or if it has explicit/implicit foreign keys may impact o
  • I haven't done much parallel programming, just a few server apps, which are easy to use parallel processing in, but I've found Boost's [boost.org] threads library to work nicely for the time being. Currently, you are required to use external libraries for a consistent way of multi-threading your program in Visual C++ and GCC. A simple and possibly good solution might be to add a C/C++ keyword to a variable declaration to tell the compiler that an object should have an associated mutex, and of course actually creating
    • Re: (Score:2, Informative)

      Extending STL is a very interesting idea, we tried it ourselves - "Intel Threading Building Blocks" - see http://www.intel.com/software/products/tbb [intel.com] to download an evaluation copy. We're a few days away from updating to v1.1 (next week I think). This is our try at doing this - we are getting good feedback... it is a pretty big package of algorithms, containers, scheduling, mutual access (locks/atom) support, etc.. Yes - it's a product - I'm interested in feedback what we should do beyond this product. (Int
    • >> When you use the function for_each, it would reach into
      >> the thread pool and use one thread per function call to
      >> loop through the container, running up to your pre-defined
      >> number of threads.

      This is called vectorization and is usually available in high performance fortran compilers and some vendor-supplied C/C++ compilers too. It works really well if the hardware (e.g. cray) has been optimized for this model. My only caution is that when you create or communicate with a thre
      • [Disclaimer: I'm lead developer for Intel(R) Threading Building Blocks library [Intel(R) TBB], and so am biased.]

        The overhead per parallel call is definitely a problem. It's a fundamental "grainsize" problem in parallel programming. The individual chunks of work have to be big enough to amortize the parallel overheads. Of course, the same can be said for amortizing the cost of a virtual function call. The difference is that parallel overheads tend to be much higher. For example, TBB needs at least

        • by jbo5112 ( 154963 )
          My thought isn't to create threads each time something like for_each is used, although that would be better than nothing, but my thought is to create threads with the container and use those threads to do the processing. If the container will be around for a while, it should be fairly easy to get 10,000+ instructions per chunk. Ideally the program would even share the same threads for all the containers.

          But thanks a lot for the info. Based on it, I'll definitely be restructuring my program, instead of le
  • First off getting mutex-lock based parallelism to work at any real complexity is a mess. Getting a message system going with MPI is still a mess for anything that has a non-uniform structure between nodes. Using Java grid computing is a monster to implement, and isnt appropriate for most applications.

    What I would like to see would be "Resource aware" libraries. A simple example would be vector class. The vector.sort() would notice that 4 processors are available, and break the sorting apart into 4 t

    • Re: (Score:3, Interesting)

      by Arch_Intel ( 1081675 )

      We ship a parallel sort with Intel(R) Threading Building Blocks (Intel (R) TBB). It even works on vectors.

      The difficulty with hiding parallelism inside objects is that it is effective only for lengthy operations. E.g., our parallel quicksort requires 500 or more elements before it goes parallel, because we found it typically didn't gain much when sorting smaller sequences.

      In general, effective parallelism is going to be at fairly high levels, and thus the responsibility of the components (by ensuri

  • What do you think is the best approach for parallel programming: a new language, language extensions, or libraries?

    How about all three! For some tasks such as numerics, perhaps a functional like programming language might do best, where for say c++ libraries and language extensions could go a long way. Memory models are also a very important and left out issue here!! I'm currently trying to wrap my head around functional programming, its worlds different from the state machine based code that i currently w

    • Functional languages are a great way to write race-free parallel code, but so far have lost out because:

      • Their sequential performance tends to be slower, which can be a showstopper for vendors shipping code now.
      • They require rewriting code from scratch, which can be another showstopper.

      Maybe once 4 or more cores are the norm, more programmers will find that the sequential losses can be made up by the parallel gains.

      We wrote Intel(R) Threading Building Blocks as a pure library so that developers cou

  • I have been using MPI for parallel programming, as it works actually quite well, also on shared memory systems. Although it is harder at start than OpenMP, usually the overall effort is about the same, or even less due to more straightforward logic.

    However, when there may be tens of cores on a chip, and these may be connected together into SMPs, and these into heterogenous clusters, you clearly run into problems. Multilevel parallelism will be a hard programming task indeed. New languages and new paradig

    • What I like about explicit message passing is that it forces the programmer to confront the communication issues up front, when writing the program, not when debugging the performance later. When I'm pushing multi-threaded shared-memory performance, I end up having to design in terms of message passing, where the messages are cache lines ping-ponging between processors.
    • by Coryoth ( 254751 )

      New languages and new paradigms are urgently needed - but these shouldn't be too hard to learn for programmers.

      I'm a fan of SCOOP for Eiffel [se.ethz.ch] on that front: it gives you a decent CSP based system, but does so in a way that fits very natural into standard OO programming (there's very little to learn). Of course Eiffel is somewhat of a niche language, but the ideas that inform SCOOP could feasibly be transferred to other OO languages such as Java and C++.

  • For the record (Score:1, Offtopic)

    by eclectro ( 227083 )
    Many of us on Slashdot have experience with Beowulf clusters.
  • Are you sure this is the place to start?
    1) Identify the parallelism

    I don't "code" but I have recently done a lot of business process work, and before that more algorithm design work than I care to think about.

    Most of the failures I have seen in parallel work haven't been the code, the hardware or the compiler.

    It is solving the wrong problem or doing it ass backwards that is the problem.

    Yes identify the parallelism, but what tools do you propose to make available to show what the is going on?

    There are enoug
    • There's more art than science to this step and there is some debate about where to implement the parallelism. To get the biggest bang for the buck, you need to somehow parallelize those portions of the code that take the most execution time. Thus, a profiler is one of the first tools to be used.

      Now, you'll want to drive the parallelism as high as possible in the call stack as you can. That is, if functionC is found to be the hotspot of the code, don't automatically look for parallelism at that level.

      • by tqft ( 619476 )
        "Now, you'll want to drive the parallelism as high as possible in the call stack as you can. "

        "I'm not aware of any tools that you can use to identify where independent operations are with your code. "

        "Right now, a profiler and your brain (with a good understanding of the code and the algorithms) are the best tools you can use for identifying (and implementing) parallelism."

        And isn't that the problem?

        "I'm not aware of any tools that you can use to identify where independent operations are with your code. "

        W
  • I thought I'd post this as a separate comment, rather than as a reply to the Beowulf post ("For the record"), just to give it a bit more visibility.

    I assume that MPI is going to be the dominant programming model for cluster programmers. So what do you do when the nodes on the cluster become multi-core? Do you just run your MPI applications with the same number of processes, but on fewer nodes? Do you run your MPI apps on the same number of nodes but use more processes? Running more processes on a mult

    • Ive heard of a few thread safe MPI implementations. But it seems like a big chunk of complexity to add for a pretty small payoff.

      If I'm going multi-core multi-processor for low latency I usually try and pass a numProcs argument so that the intra-machine communications are maximized. I am really leery of going multi-threaded in general, I think the extra complexity will add marginal gains over inter-machine mpi except in truly pathological cases.

      However if you have a very capable thread based system that

      • I wasn't thinking about thread-safety of MPI implementations. I've always advocated not using any MPI calls within threaded regions, so thread-safety of the MPI wasn't a real issue. I agree that the added complexity would be nowhere near the gains you might achieve and mixing these two parallel methods can lead to debugging nightmares.

        However, if you've got a hotspot in the process execution, why not thread that portion of code between communication calls? Use something simple like OpenMP and thread ve

  • Really, the subject line says it all. By threads, I mean concurrent control flows with shared data structures. There are two reasons I think they're a bad idea:
    1. They're simply not scalable when it comes to large-scale concurrency. I don't mean that you can't create a lot of threads, I mean that the complexity of threaded execution simply becomes impossible for the programmer to manage when the thread-count gets high. Ed Lee at Berkeley has an excellent article on The Problem With Threads [berkeley.edu] that describes the
    • For some situations, threads seem to be the correct answer (such as for Shared Memory environments) - while not for others (possibly Distributed Memory environments). As even laptops now ship with dual core processors - you are less and less likely to find applications that would not see the value in performing tasks concurrently. Especially given the performance potential introduced by having more and more cores to work with. But who wants to deal with threads? What makes things easier for a developer
    • by jbo5112 ( 154963 )
      Would something like this work well on a small scale, say writing video games (or desktop apps other than media related ones) to make use of dual, quad, and eight core processors?

      Game programmers don't seem to have any problems thinking up ways to make use of all the processing power they can get a hold of, and for the average user, might be the only reason to want more than two or four cores. Personally, I'm more concerned about getting all that power packed into a cool user interface like beryl (if you d
    • JCSP - thanks for the pointer. I've not heard of it before. Have you used it much? I'll take a look. Hardware parallelism is best exploited by threads - specifically a thread per processor core. Software parallelism is NOT best expressed in threads. When Ed Lee made a very strong case against threads, I saw it as an attack on the evils of unstructured use of threads. Writing directly to raw thread interfaces (pthreads, Window threads) is writing to the "assembly language" of parallelism. It is very
      • I've used JCSP for a few experimental programs, but nothing really substantial. It works pretty well, but is encumbered (compared to a purpose-designed concurrent language) by some hefty syntactic overhead. There is (IIRC) a port of JCSP to the .NET framework, and a similar (but not identical) library for C++ called C++CSP. A few handy links:

        • A great beginner's intro to JCSP on IBM's developerWorks can be found here [ibm.com]
        • The JCSP project website at UoC Kent is here [kent.ac.uk]
        • The JCSP open source bundle is here [jcsp.org]

        Note t

    • The Lee paper I referred to above offers another option: coordination languages, which provide the concurrent glue between sequential programs written in existing mainstream languages.

      This sounds a lot like the Bulk Synchronous Parallel model proposed a few years ago. Using The Google, I see that there is still some interest, some research, and even a book published on the topic.

      I don't think the problem is with threads, specifically, but the low-level interface used today to create, coordinate, and use threads. Current threading APIs are almost like programming in Assembly language (or lower). Nor do I think the number of threads is a problem. However, each thread used and sync


  • The graphical advertisement that ran with this story:

    has the text

    One woman can have a baby in 9 months. But 9 women can't make a baby in one month. Not all algorithms can be parallelized.

    I googled that quote, and found a hit at an Intel blog:

    A Field of Nails
    By Clay Breshears
    November 20th, 2006
    http://softwareblogs.intel.com/2006/11/20/a-field- of-nails/ [intel.com]

    My favorite counter-example to illustrate this is a pregnant wo

    • Theory says that any algorithm can be parallelized to run in O(1) time, given enough processors. Alas, I can't find the citation. I think it's a paper from the early 1980s. The proof is based on simulating a Turing machine computation in O(1) time. The trick is to simulate all possible Turing machine executions that might yield the right answer, and pick out the right one.

      For example, to sort N items in in constant time, generate all possible permutations, and choose the one that is sorted. It only


      • Alas, I can't find the citation. I think it's a paper from the early 1980s.

        If you can remember the paper, I'd love to know about it.

        the original exact answer given by the sequential program really is hard or impossible to deliver faster with parallel programming

        When you try to pin people down, the answer is usually something like "this computation is believed to be difficult to parallelize" [given the constraint of k processors].

        My question: Has anyone tried to take the thing beyond a vague sense
      • I saw something like this in the Journal of the ACM from the 60's or early 70's, I believe. I can't remember which year. Was full of math I didn't understand at the time (not that I would understand it any better now, but it's been a very long time...) Anyway, you have to be a member to access their library [acm.org], and I'm not a member so I can't look it up for you.

        A brief search turns up Structuring of Parallel Algorithms [acm.org] by P.A.Gilmore, 1968.

        Hopefully this will narrow the GP's search :-). Enjoy.

      • I believe Arch is thinking of Ian Parberry's paper Parallel speedup of sequential machines: a defense of parallel computation thesis [acm.org] from SIGACT, Summer 1986. As he said to me when he passed along the citation, requiring so many processors could be the boon for multi-core processors and Intel. :-)
      • > Theory says that any algorithm can be parallelized to run in O(1) time

        No. "Work optimal" is a term for an optimal parallel algorithm, and it isn't always O(1). As I recall, the lower limit for sorting n items is lg n, and you can't improve on this no matter how many processors you have available. See "Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms" by Rajasekaran and Reif. The trouble with your example is that, yes, if each of those n! processors is programmed to do a diff

        • The original question did not ask about optimality, but only about parallelizability. The cited paper "Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms" considers only sorts that are optimal, in the sense (to quote the abstract) that "the product of its time and processor bounds is upper bounded by a linear function of the input size." The O(1) construction I stated is not bound by that constraint, though I'll be the first to admit that the contraint is a good one for practical para

    • Turing machines have been around for just over 70 years, electronic computers for about 65 years, and parallel processing systems for less time than that. I think we're still in the infancy of theoretical computer science and am not surprised we don't have a theory about what can be made parallel in a practical sense. We're so used to seeing the pace of technical innovation and things change dramatically in the space of months, we begin to expect this from almost everything.

      I read "popular" math books w

    • I will probably be flamed for this post but oh well.

      What is with this ad? I'm used to running into posts that are mildly (or massively) offensive to women on slashdot. After all the population is almost exclusively male and it is to be expected. But to have something like this glaring at me from the front page is just a bit overboard even for slashdot.

      Most ads are designed to deliver the message: there is this product that is common but then there is our product which is way cool and better in every way.
  • I work with many embedded control systems, and parallel embedded monitoring and control systems. These applications tend to push the extremes: reliability, latency and cost.

    Simple embedded systems, like Programmable Logic Controllers (PLCs), structure all their code in a very defined structure. Often this is simply "one big loop" containing input, computation, then output. The multi-tasking PLCs define atomic copy instructions, such that all the input (or output) can be done in one synchronized instru

  • I'm starting in on my PhD thesis selection, and expect parallel computation to be the topic area. Does Intel offer any support (hardware, funding, etc.) for such research?
  • Once that's implemented, parallelism will fall in place.
    • It is an interesting area for sure - a world without locks is a good world if it works. But even if it were to work perfectly - doesn't it still make you program in too low of a level? Still managing threads and still neeing to declare transactions explicitly?
  • Our project's problems have mostly involved ensuring thread safety. We don't spawn threads, but if a user of our code uses our code cold, our code has to safely work as fast as possible. A lot of our thread safety concerns involves creating caches of data that that must be computed at runtime, but the caches usually don't change very much.

    We have no ability to enforce our users to initialize our caches. In C/C++ code, C++ static initialization doesn't reliable work on every platform (e.g. HP-UX and Apple Ma
    • Isn't the point of the volatile keyword that every memory access is forced to read all the way from memory with no reordering? ie. won't the following statement work?

      volatile int cacheinitialized;

      My understanding was that the volatile keyword would force reading of the actual memory location. Thus the compiler wouldn't attempt to either optimize or reorder accesses to memory locations corresponding to I/O. The following is the documentation from the Microsoft C++ compiler:

      Objects declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object at the point it is requested, even if a previous instruction asked for a value from the same object. Also, the value of the object is written immediately on assignment. Also, when optimizing, the compiler must maintain ordering among references to volatile objects as well as references to other global objects. In particular, a write to a volatile object (volatile write) has a reference to a global or static object that occurs before a write to a volatile object in the instruction sequence will occur before that volatile write in the compiled binary. A read of a volatile object (volatile read) has a reference to a global or static object that occurs after a read of volatile memory in the instruction sequence will occur after that volatile read in the compiled binary. This allows volatile objects to be used for memory locks and releases in multithreaded applications.

    • Multi-process vs multi-threading considerations do indeed rely on factors such as coarse-grain vs fine-grain jobs - whether your application is more dependent upon communication between nodes, or whether the application mostly relies on computations on a single node which requires sharing of memory use. The overhead of new process creation is definitely a consideration as well. Race conditions tend to be an inhibitor to considering multithreading and Intel Thread Checker has certainly proved to be a usef
    • And to address your comment - "We can NOT use other threading libraries due to the lack of portability. Our product has to work on everything from Palm OS to z/OS (IBM mainframes) and everything in between. We have to use the least common denominator between all operating systems that our code runs on." There's a good discussion starting on the "threaded keyword" post that you might want to provide some feedback on
  • I'd like to see a template implemented which is the container for an array, that keeps the changes private.
    (It would maintain a change list in some sort of sparse array structure.)
    It would overload [] and assignment to make this change mostly transparent.
    There could be a mutex locked method to make a public change.

    That would help me a lot with image processing parallelism.
  • One of the biggest problems I've seen when going over parallel code written by others is that they don't understand what synchronization primitives are there for.

    In general, locks (mutexes, semaphores, etc.) are there to protect data, not code! This may be obvious to experienced developers, but it is really hard trying to get some people to wrap their brains around that concept. It doesn't help when some tutorials talk about "critical sections" of code. Sure, there are sections of code that are critical,
    • You're right about protecting data. The problem/confusion is likely stemming from these tutorials not understanding or not emphasizing the definition of "critical region" of code. That is, code in which shared variables are accessed (usually modified).

      I, too, am suspicious of lock-free synchronizations. They seem interesting, but are still topics for research. Also, the complexity and subtleties of the code are higher than I might recommend, especially for new concurrent coders. Consider the derivati

      • Add me to the suspicious crowd. Lockless algorithms are the domain of experts. It's been noted by others that if you find a lockless algorithm for a non-trivial data structure, it's a publishable result.

        Lockless algorithms require extremely careful checking. Use of a model checker is essential. I've been happy with SPIN http://spinroot.com/spin/whatispin.html [spinroot.com] for doing this checking, but model checking is a lot of work with any tool.

        In my experience, lockless algorithms have had poor performance

  • I have avoided threading for the same reason I have avoided C++. It's a broken platform. It doesn't make your smarter after you learn about it - it makes you realize how broken the system is. It's more of a patchwork than a real solution. There needs to be a paradigm shift (ala Actor Oriented Programming).

    Until that happens, I'll confine myself to threadpools and a worker thread.
    • Thread Pools are indeed an efficient way of managing native threads - they involve less manual thread creation and destruction, assigning worker threads to manage concurrent task execution is also quite convenient. Libraries such as Intel(R) Threading Building Blocks create thread pools with an optimal number of worker threads - based upon detection of available processors/cores.
  • I'm doing some parallelization already by using fork/wait in a module that spawns five children to do rsync operations to a remote server. The operations send documents of varying sizes over a slow link to a server in Mumbai, India, and having five operations running simultaneously instead of a single, faster operation means that a single large document doesn't clog up the channel. (The motivation for this was numerous support calls at 2am to tell me that the queue was 'frozen' -- it wasn't, it was just sen
    • I agree. For the most part you can use multiple processes or threads to creat parallelism. The majority of problems don't need to be so tight that they require special coding styles. Just don't write apps as one huge ball of code. It's easier to debug several small programs than a single massive app anyway. Much easier to use the OS scheduler than to write your own. Most of us aren't writing stuff that needs to be done in C/Asm let alone that needs that tight of parallelism.
  • I worked on the Sperry/Unisys 1100/2200 systems as an internalist and systems programmer. (I also worked on practically all of our product lines, and all of our software for each, usually up to the systems programmer level.)

    I have written numerous multi-activity programs, from async/sync comm, transaction, and batch, with and without a DBMS. I only used MASM (1100/2200), Fortran, COBOL, a small amount of C, and PLUS-An internal language used to write portions of the OS, and is the Intermediate Language (IL)
  • How about an intuitive multithreaded debugger? Something that actually uses a GUI for something useful; pop up one debugger window per thread and be able to step through each thread independently? Tricky work, but it'll take some of the pain out of debugging multithreaded software.

Living on Earth may be expensive, but it includes an annual free trip around the Sun.

Working...