It's unlikely our language processing is over-engineered as we certainly lost some other of neural processing due to the change (check out this episode of MindFeild which covers this in some depth https://www.youtube.com/watch?...). Most animals already have some language and tone interpreters built in, thus language interpretation need not evolve at exactly the same time as the ability to accidentally make sounds that can more uniquely communicate your intention. However, one can imagine the increased benefit in survival one might have by being able to quickly learn sound mannerism of the ruling family & neighboring tribes, and thus those already having superior interpretation skills we likely selected via evolution in much the way we selected for similar genes in border collies (see https://www.youtube.com/watch?...). Moreover, it seems that the ability for complex thought (and thus the ability to further extend language) just an artifact of having being taught to name more things during development (https://www.youtube.com/watch?v=gMqZR3pqMjg). Similarly, our ability to perceive things can be limited by whether we were taught a name for them (https://www.youtube.com/watch?v=mgxyfqHRPoE).
Most research I've read compels me to believe that everything in the brain seems to be stored by association, this is both efficient and allows for some redundancy at the same time. There is no single node responsible for the perception of a "zebra" or "tiger" (as the "Jennifer Aniston Neuron" paper would lead you to believe) but rather provocation of a single neuron can lead to recall of the major cords associated with neuron, much like playing a 'C' on a piano might make one think of the C major chord, unless one is feeling sad, in which you might first think of C minor. (Emotional processing is why there will likely be various resolutions of a brain back-up, the first one just being the map of default connectivity, and the others how the map changes with emotion.) This same redundancy exists at the lower level processing like retinal computation (see https://www.youtube.com/watch?... for a primer), this is likely because the interconnectivity allows for better denoising of the system. Thus, while a single neuronal node (like the Aniston Neron) might be able to stimulate the recall/perception of a 'stripe' or 'loop', in undamaged HUMAN circuits perception never really works this way.
Reading is not always the best way to convey higher level concepts. While reading does provide a more rapid way to convey information because your brain isn't distracted by the overhead of needing to interpret body langue, tone, ones ability to recall information is more directly correlated with having an actual or imaged experience (especially an emotional one), which is why we have lab classes in science and need to actually write some code in order to learn how to program.
Your final question seems to be based on the false premise that we don't use the audio cortex when reading, and thus I cannot address directly address it. Many of the same auditory processes are still involved in reading (https://www.sciencedirect.com/science/article/pii/S0960982213000055). Similarly, while the brain is very plastic, the brain tends to keep the processing of like things together in a similar way across humans, and our visual cortex and navigational processing seem to happen in the same areas, but if vision is removed, some of this area can be repurposed for storage (most likely about where something was, or what it felt like). Moreover, there seems to be much more evidence that perception and memory storage happens at the network group level rather than the individual node. Memories of perceptions are stored/experienced like cords played on a piano, only when we try to recall do we run into a scenario where a single neuron/note can be linked to what one might describe as a memory because of the brain automatically tries to play the appropriate cord to harmonize with it. This is the basis of why it's so easy to implant false memory.