The research we were doing was in fact prompted by the well-documented success of neural networks in other nonlinear problems. One of the very first good examples of an applied adaptive neural network was in the standard modem of the time, which used a very small neural network to optimize the equalizer settings on each end.
Neural nets appear to have a lot more success with constructing nonlinear maps from subsets of Rn to Rm with n and m relatively small. Vision is not such a case as the input space n is very large. Once n and m get large you will require an exponentially large number of training samples, with the increased risk of falling into local minima (mitigated by simulated annealing or tunneling). In addition, if there is any inherent linearity in the problem an old-school Kalman filter may be less sexy but more useful.
Many of the success stories of neural nets are really of the "Stone Soup" variety, in which the neural network is the "Stone" and the meat-and-potatoes real work is in how to preprocess the data to reduce the dimensions n and m. One of the most amazing (non-neural) pattern-recognition apps that I have seen recently is the Shazam technology, which can identify a recorded song from 30 seconds of a (noisy) snippet. Their dimension-reducing logic involves hashes of spectrogram peak pairs. No neural nets to be seen, but absolutely brilliant and points to ways that similar things could be done in the visual domain.
I spent a lot of time on this project, writing a lot of neural net simulations, supervised and unsupervised learning, back-prop, Hopfield nets, reproducing a lot of Terry Sejnowski's and Grossman's work, taking trips over to Caltech to see what David Van Essen and his crew were doing with their analysis of the visual cortex of Monkey brains, trying to understand how "wetware" neural nets can so quickly identify features in a visual field, resolve 3D information from left-right pairs, and the like. For the most part, all of the neural net models were really programmable dynamical systems, and the big trick was to find ways (steepest descent, simulated annealing, lagrangian analysis) of computing a set of synaptic parameters whose response minimizes an error function. That, and figuring out the "right" error function and training sets (assuming you are doing supervised learning).
Bottom line was, not much came of all this, beyond a few research grants and published papers. The one thing that we do know now is, real biological neural networks do not learn by backward-error propagation. If they did, the learning would be so slow that we would all still be no smarter than trilobites if that. Most learning does appear to be "connectionist" and is stored in the synaptic connections between nodes, and that those connections are strengthened when the nodes that they connect often fire simultaneously. There is some evidence now of "grandmother cells" which are hypothetical cells that fire when, e.g. your grandmother comes into the room. But other than that, most of the real magic of biological vision appears to be in the pre-processing stages of the retinal signals, which are hardcoded layer upon layer of edge-detectors, color mapping, and some really amazing slicing, dicing and discrete FFT transforms of the orginal data into small enough and simple enough pieces that the cognitive part of the brain can make sense of the information.
It's pretty easy to train a small back-prop net to "spot" a silhouette of a cat and distinguish it from a triangle and a doughnut. It is not so easy to write a network simulation that can pick out a small cat in a busy urban scene, curled up beneath a dumpster, batting at a mouse....
To do that, you need a dog.