I think I understand... vaguely. To simplify, you're saying it's been trained on a specific dataset, and it chooses whichever image in the dataset the input is most like.
A bit.
It's easier to imagine in 2D. Imagine you have a bunch of height/weigt measurements and a lable telling you whether a person is overweight. Plot them on a graph, and you will see that in one corner people are generally overweight and in another corner, they are not.
If you have a new pair of measurements come along with no label, you could just find the closest height/weight pair and use that. That is in fact a nearest neighbour classifier. It works, except that you need to keep all the original data around.
If you imagine taking 1000 points along the two axes (1,000,000 in total) you could classify each of them according to who is nearest. If you do that you can see that there is more or less a line separating the two groups.
Machine learning is generally the process of finding that line, or an approximation to it somehow.
The DNNs don't find the nearest neighbour explicitly: they just tell you which side of the line a given input is on. They also have a bunch of domain specific knowledge buit in because we know something about the shape of the line, which helps find it. For example, image objects may be scaled up or down in size or distorted in a variety of ways.
Is that about the gist? I'm probably not going to understand things about higher dimensions without a lot of additional information.
The answer is in fact tied into dimensionality. In the 2D example, you can cover the whole space with 1,000,000 points. In 3D to do the same, you need 1,000,000,000. Beyond that the numbers rapidly become completely infeasible.