Software is malleable in that whatever is on the inside can be safely changed through refactoring to meet your new design goals. And yes, you have to adhere to strong design principles: the open/close principle helps ensure that you can safely migrate to a new API while still supporting your old clients; the interface segregation principle helps ensure that your clients are always getting the right service without confusion; and you have to commit to serious code coverage metrics for your automated tests. That means you don't even write an exception handler unless you have a unit test that proves it properly catches the exception.
And developers absolutely cannot work in a vacuum, or be incompetent - there's no room for them. So when they're writing the negative tests, they are expected to be smart enough understand the permutations and the boundaries in the requirements they're implementing. But high complexity means lots of paths through the code, which means lots of tests, and this need for testability that is practically and realistically achievable provides incentive for the developer to keep code complexity down. That is a feat he or she continually accomplishes through the refactoring step of TDD. That way, instead of writing fifty tests, perhaps they can split it into five modules and write ten tests. Not coincidentally, this activity continues to improve modularity, reusability, and maintainability of the module. So it improves the code's design after it's written (an activity that still was not needed up front.) As a bonus, you get to execute the automated tests again and again, so future maintainers benefit by knowing they haven't broken your module. TDD is actually a design methodology, not a test strategy.
And I know that you're using CAPTCHAs as a clever example (how can you prove that you wrote a transformation so complex that you can't Turing test it?), but the real answer there is it depends on what code you're testing. Are you testing the code that processes the outcome for a true or false response? Are you testing the user interface, that allows them to type letters into a text box? Those tests aren't especially hard to automate. But when you're talking about the specifics of "is this CAPTCHA producing a human-interpretable output?" then you're talking about usability testing, which is expensive, manual, and slow. It's a task you'd perform after changing the CAPTCHA generation routines, but you wouldn't be able to automate. So I'd have to manually test it only after changing the generation routines, and I wouldn't alter the generation routines without scheduling more user testing.
(If I ever had to write a CAPTCHA for real, I'd probably try to parameterize it and allow the admins to tweak the image generation without my having to further change and test the code. So if the admin figures out how to tweak it to a black-on-black test, and preventing low-contrast color schemes wasn't identified in the original effort, the admin could still untweak it. And yes, that should generate a bug report, even though it would be recoverable.)
But in terms of difficult to test code, teams that do this kind of development work well will often have different suites of tests for different situations. Etsy does this really well, by splitting tests into various categories: slow, flaky, network, trunk, sleep, database, etc. They always run all trunk tests on every build, but only if the developer is working on something that tests the actual network communication would he execute the network tests. See http://codeascraft.com/2011/04... for their really inspiring blog.