Everytime I see this topic appear on Slashdot (Last time) I think:
You're putting a neural network (NN) through a classification process where it is fed this image as a "fixed input", where the input's constituent elements are constant, and you ask it to classify correctly the same way as a human would. The problem with this comparison is the human eye does not see a "constant" input stream; the eye captures a stream of images, each slightly skewed as your head moves and the images changes slightly. Based on this stream of slightly different images, the human identifies an object.
However, in this research, time and again a "team" shows a "fault" in a NN by taking a single, nonvarying image input to a NN and calling it a "deep flaw in the image processing network", and I just get a feeling that they're doing it wrong.
To your topic though: You better hope your car is not just taking one single still image and performing actions based on that. You better hope your car is taking a stream of images and making decisions, which would be a completely different class of problem than this.