Data communication in a foolproof way. Writing a threaded program is easy if the program is simple. You can even get a bit more performance out of a program using multiple threads if you use locking. If you use locking, you end up with the possibility of race conditions, deadlock and other nightmares.
Extending this to something like a game engine is much harder. Say we split our physics and rendering into two threads. How does the physics thread update the render thread? We could just lock the whole scene graph, but then we don't get much of a performance increase, if at all. We then could use two buffers. The renderer renders the data from one, and the physics thread updates the other. When we are ready to update the frame, we just swap the buffers. Then we end up with some input lag. There are still complications. What happens if we add an AI thread. How does that add data to the buffer in a way that doesn't conflict with the physics thread?
We could use lock free lists, which are very hard to get right. Even some implementations that I have seen end up locking the heap, which we want to avoid. But even then we end up with some issues.
Don't get me started on debugging threaded applications. Finding that while it works fine on one and two cores. 0.1% of the time on a quad core there is a deadlock.
So to sum it up. Anyone can write a threaded application where it is easy to split the tasks. If you are designing it from the ground up, it is even easier. If you need to write performance critical maintainable code that involves a lot of communication, it suddenly gets much harder.