Ah, right, that paper. I don't think they'd use the word "continuous" the way they did if they thought about it for a bit either. They use it as a vague throwaway in the abstract and then never again. Also "fairly discontinous" is silly. It's either is or it isn't, nothing in between. That's kind of the defining property of a discontinuity.
What they actually mean is this:
Our main result is that for deep neural networks, the smoothness assumption that underlies many kernel methods does not hold. Specifically, we show that by using a simple optimization procedure, we are able to find adversarial examples, which are obtained by imperceptibly small perturbations to a correctly classified input image, so that it is no longer classified correctly.
Deep neural networks can be "highly nonlinear" (which they note). The output of nonlinear systems can vary a lot in response to small changes in the input. In fact, one way of defining smoothness is as an upper bound on local nonlinearity, so this result shouldn't be at all surprising. However, discontinuities are only a subset of things that are nonlinear and not smooth. In fact, the optimization they use to genrate their adversarial examples depends on the neural network being continuous (down to the inherent discretization of the datatype).
Unfortunately a lot of other people only read the abstract of this paper and completely missed this (the very next paragraph after the one I quoted above):
In some sense, what we describe is a way to traverse the manifold represented by the network in an efficient way (by optimization) and finding adversarial examples in the input space. The adversarial examples represent low-probability (high-dimensional) “pockets” in the manifold, which are hard to efficiently find by simply randomly sampling the input around a given example. Already, a variety of recent state of the art computer vision models employ input deformations during training for increasing the robustness and convergence speed of the models [9, 13]. These deformations are, however, statistically inefficient, for a given example: they are highly correlated and are drawn from the same distribution throughout the entire training of the model. We propose a scheme to make this process adaptive in a way that exploits the model and its deficiencies in modeling the local space around the training data.
The concept of adversarial examples isn't about "lol, neural networks are dumb and easy to fool." They're about efficiently generating supplemental training data that makes the model more robust.
Szegedy et al also only study one specific type of model, although with some internal variations, and there is a discontinuity in that model. It has nothing to do with the neural network though. It's the very last step where you decide that, for example, a vector of probabilities like [0.3, 0.29999, 0.30001, 0.1] == [0,0,1,0]. It's called thresholding, and we sometimes do it because it's necessary to make a decision, but we also do it a lot because humans don't deal well with uncertainty.
I think you can find a similar adversarial-plus-thresholding example for the human brain in optical illusions like Rubin's Vase. You see a face or a vase, switching back and forth as you imperceptibly change your focus on different parts of the completely static image. You also don't see both at the same time, it's pretty strongly either or. "Fairly discontinuous" if you will.