To me the crux comes down to the experiential history any consciousness has as a reference in a conversation. If you remove any one of our senses from a person, and then try to have a conversation in text, there are noticeable differences. For a chatbot, remove all senses but some strange "can see text in an otherwise silent dark experience" and a chatbot is at a severe handicap to participate. Contextual clues aren't just the decorative influence to meaningful dialog, they're the essence of it.
So until we get a "bot" that can use some form of vision, hearing and touch - and possibly smell/taste - to fills its "memory" with massive associations that we humans use - it'll never do much. We're left with a machine guessing at the layers of meanings involved and following massive piles of rules to mimic the text of real communication. It cannot easily make the jumps across semantic concepts of jokes like "How does a fish smell? With it's nose, dummy!" or phrases as simple as "See what I mean?" or "I heard you were taking a vacation" or "Check out this vid, it touches on the finer point about AI" or "Over here, the weather is great" - the list is endless, and subtly woven into all conversations.
Interestingly, a machine that could use input like our own senses wouldn't need to be limited to just those 5. It could have broader-bandwidth input for light, sound, and get into perceiving radio-waves, echolocation, etc. Of course, it would have to talk to us in "human context" so it understood time-related phrase like "a little while" was based on human perception, the locale, etc. Also, we may have to get used to a single bot that has multiple physical presences, such that it "lived" (had sensory input from) in several locations across the globe experiencing things, but knew to focus on our location when chatting with us.
What some have proposed is a precursor to such a machine, by using machine-aided design to build the bot. So for example if a computer could design the optimal "drivers" for stereoscopic vision (layers of them - for color, contrast, movement, etc) through iterative evolutionary means (where multiple designs for, say, contrast, competed with a fitness test) - we might get a machine accepting input from devices and storing/searching it more effectively. Right now, we throw a lot of guesses around and just employ massive processing power. Of course, this iterative design would need to be built into the bot permanently, so that it kept improving without so much tinkering.