Do I have to learn CUDA/OpenCL — which seems a daunting task to me — or is there a simpler way?
You do NOT have to learn CUDA or OpenCL. You can use libraries or compilers. GPU libraries tend to give better performance than GPU compilers (e.g. OpenACC) and tend to be able to handle more algorithms. That is because compilers are simply not smart enough to do things as well as expert programmers who meticulously hand-tune kernels and put them in libraries. Any number of libraries are available. There are many poorly supported libraries out there, so you may have to search around to find good ones. I suggest one below.
What, currently, is the most painless way to start playing with GPU programming? Surely there must a be a 'relatively painless' way out there, with which one can begin to learn how to harness the GPU?"
My colleagues and I at AccelerEyes have dedicated the last 6 years of our lives to trying to help people find exactly what you're looking for - "a relatively painless" way to harness the GPU. The result is our ArrayFire library for CUDA or OpenCL. I know it's uncool to toot one's horn, but the GPU computing community is small enough that people know each other and we're all working together to build out the ecosystem. There are many different contributions to GPU computing by many different groups. Our group's specialty in the ecosystem has always been the "relatively painless" contribution coupled with great performance. The reason people like our stuff is because we do nothing but work on squeezing out the most performance possible. Then we wrap up those kernels into convenient library calls that can be plugged in like math functions to your code with much less burden than writing the CUDA or OpenCL from scratch.
Happy to answer any further questions you may have about specific libraries, compilers, or GPU programming approaches. We eat, drink, and breathe everything CUDA/OpenCL.
BTW, we also encourage learning expert CUDA/OpenCL development. It is tough, no doubt about that. It is time-consuming and for many developers is not worth the added development complexity and lengthened development time. It sounds like you are probably in the boat of not caring about becoming an expert in low-level details, rather just wanting to get better performance to achieve a goal and be done with it. Is that correct?
Perhaps a Visual Programming Language or 'VPL' that lets you connect boxes/nodes and access the GPU very simply?
Labview does not have good support for GPUs. Many ArrayFire users are building custom Labview blocks so that they can program the GPUs more simply. I can connect you to some of those users if you wish (just shoot me a note to email@example.com).
I'm unaware of another graphical box/nodes package that supports GPUs.
While I'm at it, I know this post is going to be read by many expert CUDA/OpenCL developers out there. If you're interested in writing CUDA/OpenCL code daily, we're hiring (see my email above)
The pain is not in compiling GPU code; rather, the pain is in writing good GPU code. The major difference between NVIDIA and AMD (and the major edge NVIDIA has over AMD) is not as much the compiler as it is the libraries.
Of course, I'm biased, because I work at AccelerEyes and we do GPU consulting with our freely available, but not open source, ArrayFire GPU library, which has both CUDA and OpenCL versions.
Link to Original Source
1) Expert convolutions on the GPU (that work well for both separable/non-separable cases, arbitrary input matrix sizes, and arbitrary kernel sizes) are extremely difficult. I don't think you can be our implementation. If you can, I will try to entice you away from other pursuits in life.
2) CONV2 (i.e. convolutions) are very useful in many applications and often make more sense that pursuing some sort of other arithmetic expression. I do agree with your statement though that algorithm/implementation choice is critical and is a decision that should come before optimization efforts. I just think convolutions are an essential tool to which many problems are best boiled down.
Jacket is meant to be a luxury as was mentioned elsewhere... providing a faster, better approach to what you could try to reinvent by hand if you had infinite energy.
The Canny Edge benchmark is a full blown application (of which Canny Edge detection is the major component). The image sizes that are processed are listed in the graph, but there are tons of images being processed in the course of running the full application. We should make that clearer on our website... thanks for pointing that out.
People have been saying that open source would swamp Jacket since we launched in 2007. The reality is that it is too stinking hard to build good stuff open source (i.e. where the developers aren't paid), when there isn't an enormous user community to fuel the effort in intangible benefits back to the contributors. Otherwise, we'd open source Jacket and try to live off the service contracts like every other open source project.
So we end up pricing the software inline with what people are used to paying for addons to MATLAB. And Jacket is great so we end up doing really well with this model.
While GPU computing in MATLAB is too small a niche, M-programming in general is ripe for the open sourcing. Octave has never gained any steam and has been around so long that it is stale. Scilab seems good but is stuck in Europe. We would be thrilled to participate with the community in building something that delivers more promise overall. What is certain is that MathWorks has a greater stranglehold on science/engineering than Microsoft does on operating systems.
For a full explanation of why I say "fake", read, http://www.accelereyes.com/products/compare
For a brief explanation of why I say "fake" GPU support consider the question, what does supporting GPUs mean? If you can run an FFT are you content? Or do you want to use INV, SVD, EIG, RAND, and the list goes on and on. Jacket has 10X the functionality of PCT-GPU.
Why else is the PCT-GPU implementation weak? Well, it is so poorly constructed (shoehorned into their legacy Java system), that it is rarely more beneficial to use the GPU than the CPU with the PCT-GPU implementation. It takes 600 cycles to load-then-store global memory on the GPU (required in each kernel call). The main innovation that led us to build Jacket is the ability to generate as few kernels as possible to eliminate as many 600 cycle roundtrip transfers as possible. For example, Jacket's runtime system may only launch one kernel for every 20 lines of code. PCT-GPU on the other hand is limited to launching a GPU kernel for every basic function call.
Jacket also has a GFOR loop which is the only parallel FOR-loop for GPUs, http://wiki.accelereyes.com/wiki/index.php/GFOR_Usage
I'm not aware of any MATLAB programmer that has had a good experience with PCT-GPU.
Finally, because I'm so thrilled at this getting slashdotted (despite it being a link promoting PCT-GPU), I'm be happy to offer free 3 month Jacket subscriptions to anyone that emails me in the next 48 hours with the word "slashdot" in the subject, at firstname.lastname@example.org
PS: Roblimo, if we can get some blurb love in your summary on the main slashdot.org page, it would really mean a ton to all our guys that have worked on this project for the last 4 years!